尝试使用 LWP::UserAgent 从 http://www.firstgiving.com 抓取 JSON 数据
正如你们中的一些人可能已经听说的那样,目前有几个 Reddit 子版块正在进行慈善活动,特别是 r/无神论。为了帮助/鼓励筹款,我开始编写一个小网络实用程序来提供有关这些捐赠的实时信息(基本上,将 Reddit 的数据与 FirstGiving 的数据混合起来) - 你可以看到我到目前为止所拥有的内容此处 - 它仅显示每个 subreddit 的总计和平均数字以及这是非常初步的(也不漂亮)。
我想添加的一个功能是 FirstGiving 似乎没有提供的功能,即搜索或链接到特定捐赠的能力。上周有很多帖子,人们试图提供捐赠匹配和类似的服务,但也有很多假/巨魔帖子,并且没有好的方法来验证某人是否在“交付”(我们都知道截图很容易)伪造的。)我计划缓存来自 FirstGiving 的数据,以允许某人链接到
检查了 FirstGiving 页面后,似乎有一个未记录的 JSON API 调用(在滚动到页面底部以显示更多捐款时使用),它将返回捐赠清单金额、消息和昵称作为 HTML 表格。根据 Opera Dragonfly 的说法,当我在浏览器(Opera)中访问它时,它看起来是这样的:
URL: http://www.firstgiving.com/ProfileWebApi/Donations
Method: POST
Status: 200 OK
Duration: 1220 ms
请求详细信息
POST /ProfileWebApi/Donations HTTP/1.1
User-Agent: Opera/9.80 (Windows NT 6.1; U; Edition United Kingdom Local; en) Presto/2.10.229 Version/11.60
Host: www.firstgiving.com
Accept-Language: en-GB,en;q=0.9
Accept-Encoding: gzip, deflate
Referer: http://www.firstgiving.com/fundraiser/r-atheism/ratheism
Cookie: ASP.NET_SessionId=rmsl4b45jdxwykanpoqkb255
Connection: Keep-Alive
Content-Length: 111
Content-Type: application/json;
Accept: application/json, text/javascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
Content-Transfer-Encoding: binary
Request body
{"EventGivingGroupId":1476950,"TotalRaised":"190776.020000","PageIsExpired":false,"PageNumber":4,"PageSize":50}
Response details
HTTP/1.1 200 OK
Cache-Control: private
Content-Length: 62979
Content-Type: application/json; charset=utf-8
Server: Microsoft-IIS/7.5
X-AspNetMvc-Version: 2.0
X-AspNet-Version: 2.0.50727
X-Powered-By: ASP.NET
Date: Tue, 13 Dec 2011 19:13:28 GMT
正文
{"Data":"\u0009\u000d\u000a\u0009\u0009\u0009\u0009\u000d\u000a <table class=\"donationTable collapsed\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style='height:0px; overflow:hidden;' >\u000d\u000a <thead class=\"visuallyhidden\">\u000d\u000a\u0009\u0009 <tr>\u000d\u000a <th scope=\"col\">Comment<\/th>\u000d\u000a <th scope=\"col\" class=\"amount\">Donation<\/th>\u000d\u000a <\/tr>\u000d\u000a <\/thead>\u000d\u000a\u0009\u0009\u0009 \u000d\u000a <tr> \u000d\u000a <td class=\"comment\">\u000d\u000a \u000d\u000a <strong>Dear Regan Layman<\/strong>\u000d\u000a Happy holidays :)<br \/>\u000d\u000a \u000d\u000a <time datetime=\"2011-12-10T21:55:35.0000000\">\u000d\u000a 12\/10\/2011\u000d\u000a <\/time>\u000d\u000a \u000d\u000a <\/td>\u000d\u000a \u000d\u000a <td class=\"amount\">\u000d\u000a $20.00<sup style=\"font-size:10px;\" title=\"Offline donation\"><\/sup> \u000d\u000a \u000d\u000a <\/td>\u000d\u000a <\/tr>\u000d\u000a\u0009 \u000d\u000a <tr> \u000d\u000a <td class=\"comment\">\u000d\u000a \u000d\u000a <strong>Frodo Baggins<\/strong>\u000d\u000a Due to the fact that doctors heal people, not God!<br \/>\u000d\u000a \u000d\u000a <time datetime=\"2011-12-10T21:52:11.0000000\">\u000d\u000a 12\/10\/2011\u000d\u000a <\/time>\u000d\u000a \u000d\u000a <\/td>\u000d\u000a \u000d\u000a <td class=\"amount\">\u000d\u000a $4.64<sup style=\"font-size:10px;\" title=\"Offline donation\"><\/sup> \u000d\u000a \u000d\u000a <\/td>\u000d\u000a <\/tr>\u000d\u000a\u0009 \u000d\u000a
(剪掉了响应正文的其余部分。此外,还有通常有更多 cookie,但我手动删除了除 assession id 之外的所有内容,并且它正常工作,因此它们似乎与除分析等之外的任何内容无关)
但是,当我尝试做同样的事情时从 Perl 脚本中,我没有得到这个有用的输出。这是我的脚本:
#!/usr/bin/perl -w
use LWP::Simple;
use JSON;
use HTTP::Cookies;
use LWP::UserAgent;
use Data::Dumper;
my $cookie_jar = HTTP::Cookies->new;
my $ua = LWP::UserAgent->new(cookie_jar => $cookie_jar);
#push @{ $ua->requests_redirectable }, 'POST';
$ua->get('http://www.firstgiving.com/fundraiser/r-atheism/ratheism');
print Dumper $cookie_jar;
my $req = HTTP::Request->new(
'POST',
'http://www.firstgiving.com/ProfileWebApi/Donations');
$req->header('Accept-Encoding' => 'gzip, deflate');
$req->header('Referer' => 'http://www.firstgiving.com/fundraiser/r-atheism/ratheism');
$req->header('X-Requested-With' => 'XMLHttpRequest');
$req->header('Content-Transfer-Encoding' => 'binary');
$req->header('Content-type:' => 'application/json');
$req->header('User-Agent' => 'Opera/9.80 (Windows NT 6.1; U; Edition United Kingdom Local; en) Presto/2.10.229 Version/11.60');
$req->content('{"EventGivingGroupId":1476950,"TotalRaised":"190776.020000","PageIsExpired":true,"PageNumber":2,"PageSize":50}');
#$req->content('{"EventGivingGroupId":1476950,"PageNumber":1,"PageSize":50}');
my $post_request = $ua->request($req);
print Dumper( ($post_request) );
这是输出:
$VAR1 = bless( {
'COOKIES' => {
'www.firstgiving.com' => {
'/' => {
'ASP.NET_SessionId' => [
0,
'yynhqi2udtz4y055fakdvjiu',
undef,
1,
undef,
undef,
1,
{
'HttpOnly' => undef
}
]
}
}
}
}, 'HTTP::Cookies' );
$VAR1 = bless( {
'_protocol' => 'HTTP/1.1',
'_content' => '<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="%2ferror%2f404">here</a>.</h2>
</body></html>
',
'_rc' => '302',
'_headers' => bless( {
'x-powered-by' => 'ASP.NET',
'client-response-num' => 1,
'location' => '/error/404',
'cache-control' => 'private',
'date' => 'Tue, 13 Dec 2011 19:43:56 GMT',
'client-peer' => '204.12.127.197:80',
'x-aspnet-version' => '2.0.50727',
'client-date' => 'Tue, 13 Dec 2011 19:36:45 GMT',
'x-aspnetmvc-version' => '2.0',
'content-type' => 'text/html; charset=utf-8',
'title' => 'Object moved',
'client-transfer-encoding' => [
'chunked'
],
'server' => 'Microsoft-IIS/7.5'
}, 'HTTP::Headers' ),
'_msg' => 'Found',
'_request' => bless( {
'_content' => '{"EventGivingGroupId":1476950,"TotalRaised":"190776.020000","PageIsExpired":true,"PageNumber":2,"PageSize":50}',
'_uri' => bless( do{\(my $o = 'http://www.firstgiving.com/ProfileWebApi/Donations')}, 'URI::http' ),
'_headers' => bless( {
'cookie2' => '$Version="1"',
'user-agent' => 'Opera/9.80 (Windows NT 6.1; U; Edition United Kingdom Local; en) Presto/2.10.229 Version/11.60',
'cookie' => 'ASP.NET_SessionId=yynhqi2udtz4y055fakdvjiu',
'x-requested-with' => 'XMLHttpRequest',
'accept-encoding' => 'gzip, deflate',
'content-transfer-encoding' => 'binary',
'content-type:' => 'application/json',
'referer' => 'http://www.firstgiving.com/fundraiser/r-atheism/ratheism'
}, 'HTTP::Headers' ),
'_method' => 'POST',
'_uri_canonical' => $VAR1->{'_request'}{'_uri'}
}, 'HTTP::Request' )
}, 'HTTP::Response' );
如果我启用行 push @{ $ua->requests_redirectable }, 'POST';
(即允许 POST 重定向),它将重定向到404 错误页面
如果这是 FirstGiving 有意阻止非人类进入的尝试客户,我当然会放弃,但他们的robots.txt似乎并没有禁止我正在做的事情。
As some of you may have heard, several subreddits are having a charity drive at the moment, notably r/atheism. In the interests of helping/encouraging fundraising, I've started writing a little web utility to provide real-time information about these donations (basically, mashing-up data from Reddit with data from FirstGiving) - you can see what I have so far here - it just shows the totals and average figures for each subreddit and it's very preliminary (also not pretty.)
A feature I'd like to add is something which FirstGiving doesn't seem to offer, the ability to search for or link to a specific donation. There were a lot of posts last week in which people tried to offer donation matching and similar, but there were also a lot of fake/troll posts, and no good way to verify whether someone was "delivering" (we all know screenshots are easily faked.) I plan to cache data from FirstGiving to allow someone to link to
Having examined the FirstGiving page, there seems to be an undocumented JSON API call (used when scrolling to the bottom of the page to display more donations) which will return a list of donation amounts, messages and nicknames as an HTML table. Here's what it looks like when I access it in my browser (Opera), according to Opera Dragonfly:
URL: http://www.firstgiving.com/ProfileWebApi/Donations
Method: POST
Status: 200 OK
Duration: 1220 ms
Request details
POST /ProfileWebApi/Donations HTTP/1.1
User-Agent: Opera/9.80 (Windows NT 6.1; U; Edition United Kingdom Local; en) Presto/2.10.229 Version/11.60
Host: www.firstgiving.com
Accept-Language: en-GB,en;q=0.9
Accept-Encoding: gzip, deflate
Referer: http://www.firstgiving.com/fundraiser/r-atheism/ratheism
Cookie: ASP.NET_SessionId=rmsl4b45jdxwykanpoqkb255
Connection: Keep-Alive
Content-Length: 111
Content-Type: application/json;
Accept: application/json, text/javascript, */*; q=0.01
X-Requested-With: XMLHttpRequest
Content-Transfer-Encoding: binary
Request body
{"EventGivingGroupId":1476950,"TotalRaised":"190776.020000","PageIsExpired":false,"PageNumber":4,"PageSize":50}
Response details
HTTP/1.1 200 OK
Cache-Control: private
Content-Length: 62979
Content-Type: application/json; charset=utf-8
Server: Microsoft-IIS/7.5
X-AspNetMvc-Version: 2.0
X-AspNet-Version: 2.0.50727
X-Powered-By: ASP.NET
Date: Tue, 13 Dec 2011 19:13:28 GMT
Body
{"Data":"\u0009\u000d\u000a\u0009\u0009\u0009\u0009\u000d\u000a <table class=\"donationTable collapsed\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style='height:0px; overflow:hidden;' >\u000d\u000a <thead class=\"visuallyhidden\">\u000d\u000a\u0009\u0009 <tr>\u000d\u000a <th scope=\"col\">Comment<\/th>\u000d\u000a <th scope=\"col\" class=\"amount\">Donation<\/th>\u000d\u000a <\/tr>\u000d\u000a <\/thead>\u000d\u000a\u0009\u0009\u0009 \u000d\u000a <tr> \u000d\u000a <td class=\"comment\">\u000d\u000a \u000d\u000a <strong>Dear Regan Layman<\/strong>\u000d\u000a Happy holidays :)<br \/>\u000d\u000a \u000d\u000a <time datetime=\"2011-12-10T21:55:35.0000000\">\u000d\u000a 12\/10\/2011\u000d\u000a <\/time>\u000d\u000a \u000d\u000a <\/td>\u000d\u000a \u000d\u000a <td class=\"amount\">\u000d\u000a $20.00<sup style=\"font-size:10px;\" title=\"Offline donation\"><\/sup> \u000d\u000a \u000d\u000a <\/td>\u000d\u000a <\/tr>\u000d\u000a\u0009 \u000d\u000a <tr> \u000d\u000a <td class=\"comment\">\u000d\u000a \u000d\u000a <strong>Frodo Baggins<\/strong>\u000d\u000a Due to the fact that doctors heal people, not God!<br \/>\u000d\u000a \u000d\u000a <time datetime=\"2011-12-10T21:52:11.0000000\">\u000d\u000a 12\/10\/2011\u000d\u000a <\/time>\u000d\u000a \u000d\u000a <\/td>\u000d\u000a \u000d\u000a <td class=\"amount\">\u000d\u000a $4.64<sup style=\"font-size:10px;\" title=\"Offline donation\"><\/sup> \u000d\u000a \u000d\u000a <\/td>\u000d\u000a <\/tr>\u000d\u000a\u0009 \u000d\u000a
(snipped the rest of the response body. Also, there are usually more cookies, but I manually deleted everything except aspsession id, and it worked normally so they don't appear to be relevant to anything except analytics etc)
However, when I try to do the same thing from a perl script, I don't get this useful output. Here is my script:
#!/usr/bin/perl -w
use LWP::Simple;
use JSON;
use HTTP::Cookies;
use LWP::UserAgent;
use Data::Dumper;
my $cookie_jar = HTTP::Cookies->new;
my $ua = LWP::UserAgent->new(cookie_jar => $cookie_jar);
#push @{ $ua->requests_redirectable }, 'POST';
$ua->get('http://www.firstgiving.com/fundraiser/r-atheism/ratheism');
print Dumper $cookie_jar;
my $req = HTTP::Request->new(
'POST',
'http://www.firstgiving.com/ProfileWebApi/Donations');
$req->header('Accept-Encoding' => 'gzip, deflate');
$req->header('Referer' => 'http://www.firstgiving.com/fundraiser/r-atheism/ratheism');
$req->header('X-Requested-With' => 'XMLHttpRequest');
$req->header('Content-Transfer-Encoding' => 'binary');
$req->header('Content-type:' => 'application/json');
$req->header('User-Agent' => 'Opera/9.80 (Windows NT 6.1; U; Edition United Kingdom Local; en) Presto/2.10.229 Version/11.60');
$req->content('{"EventGivingGroupId":1476950,"TotalRaised":"190776.020000","PageIsExpired":true,"PageNumber":2,"PageSize":50}');
#$req->content('{"EventGivingGroupId":1476950,"PageNumber":1,"PageSize":50}');
my $post_request = $ua->request($req);
print Dumper( ($post_request) );
and here is the output:
$VAR1 = bless( {
'COOKIES' => {
'www.firstgiving.com' => {
'/' => {
'ASP.NET_SessionId' => [
0,
'yynhqi2udtz4y055fakdvjiu',
undef,
1,
undef,
undef,
1,
{
'HttpOnly' => undef
}
]
}
}
}
}, 'HTTP::Cookies' );
$VAR1 = bless( {
'_protocol' => 'HTTP/1.1',
'_content' => '<html><head><title>Object moved</title></head><body>
<h2>Object moved to <a href="%2ferror%2f404">here</a>.</h2>
</body></html>
',
'_rc' => '302',
'_headers' => bless( {
'x-powered-by' => 'ASP.NET',
'client-response-num' => 1,
'location' => '/error/404',
'cache-control' => 'private',
'date' => 'Tue, 13 Dec 2011 19:43:56 GMT',
'client-peer' => '204.12.127.197:80',
'x-aspnet-version' => '2.0.50727',
'client-date' => 'Tue, 13 Dec 2011 19:36:45 GMT',
'x-aspnetmvc-version' => '2.0',
'content-type' => 'text/html; charset=utf-8',
'title' => 'Object moved',
'client-transfer-encoding' => [
'chunked'
],
'server' => 'Microsoft-IIS/7.5'
}, 'HTTP::Headers' ),
'_msg' => 'Found',
'_request' => bless( {
'_content' => '{"EventGivingGroupId":1476950,"TotalRaised":"190776.020000","PageIsExpired":true,"PageNumber":2,"PageSize":50}',
'_uri' => bless( do{\(my $o = 'http://www.firstgiving.com/ProfileWebApi/Donations')}, 'URI::http' ),
'_headers' => bless( {
'cookie2' => '$Version="1"',
'user-agent' => 'Opera/9.80 (Windows NT 6.1; U; Edition United Kingdom Local; en) Presto/2.10.229 Version/11.60',
'cookie' => 'ASP.NET_SessionId=yynhqi2udtz4y055fakdvjiu',
'x-requested-with' => 'XMLHttpRequest',
'accept-encoding' => 'gzip, deflate',
'content-transfer-encoding' => 'binary',
'content-type:' => 'application/json',
'referer' => 'http://www.firstgiving.com/fundraiser/r-atheism/ratheism'
}, 'HTTP::Headers' ),
'_method' => 'POST',
'_uri_canonical' => $VAR1->{'_request'}{'_uri'}
}, 'HTTP::Request' )
}, 'HTTP::Response' );
If I enable the line push @{ $ua->requests_redirectable }, 'POST';
(i.e., allow redirection for POST) it redirects to a 404 error page
If this is some intentional attempt by FirstGiving to keep out non-human clients, I'll of course give up, but their robots.txt doesn't seem to prohibit what I'm doing.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
添加
Accept: application/json, text/javascript, */*; q=0.01
标头。我通常认为这不是一个关键的标头,但在本例中它似乎是关键的。我使用
curl
做了一个快速的小测试。这有效:这给了我重定向:
Add the
Accept: application/json, text/javascript, */*; q=0.01
header. Not a header I'd normally expect to be critical, but in this case it seems to be.I did a quick little test using
curl
. This worked:This gave me the redirect: