为什么 WWW::Mechanize 获取某些页面而不是其他页面?
我对 Perl/HTML 很陌生。我正在尝试使用 $mech->get($url)
从 http://en.wikipedia.org/wiki/Periodic_table 但它一直返回这样的错误消息:
获取错误 http://en.wikipedia.org/wiki/Periodic_table: 在PeriodicTable.pl第13行被禁止
,但是如果$url
是http://search.cpan.org/。
任何帮助将不胜感激!
这是我的代码:
#!/usr/bin/perl -w
use strict;
use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new( autocheck => 1 );
$mech = WWW::Mechanize->new();
my $table_url = "http://en.wikipedia.org/wiki/Periodic_table/";
$mech->get( $table_url );
I'm new to Perl/HTML things. I'm trying to use $mech->get($url)
to get something from a periodic table on http://en.wikipedia.org/wiki/Periodic_table but it kept returning error message like this:
Error GETing
http://en.wikipedia.org/wiki/Periodic_table:
Forbidden at PeriodicTable.pl line 13
But $mech->get($url)
works fine if $url
is http://search.cpan.org/.
Any help will be much appreciated!
Here is my code:
#!/usr/bin/perl -w
use strict;
use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new( autocheck => 1 );
$mech = WWW::Mechanize->new();
my $table_url = "http://en.wikipedia.org/wiki/Periodic_table/";
$mech->get( $table_url );
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是因为维基百科根据请求提供的用户代理拒绝访问某些程序。
您可以通过在实例化之后和
get()
之前设置代理来为自己设置别名,使其显示为“普通”网络浏览器,例如:这对我来说适用于您发布的 URL。较短的字符串可能也会起作用。
(我认为您也应该删除 URL 中的尾部斜杠。)
WWW::Mechanize 是 LWP::UserAgent - 请参阅那里的文档以获取更多信息,包括
agent()
方法。不过,您应该限制对这种访问方法的使用。维基百科在其 robots.txt 文件中明确拒绝访问某些蜘蛛。 LWP::UserAgent(以 libwww 开头)的默认用户代理位于列表中。
It's because Wikipedia deny access to some programs based on the User-Agent supplied on the request.
You can alias yourself to appear as a 'normal' web browser by setting the agent after instantiation and before the
get()
, for example:That worked for me with the URL in your posting. Shorter strings will probably work too.
(You should remove the trailing slash from the URL too I think.)
WWW::Mechanize is a subclass of LWP::UserAgent - see docs there for more info, including on the
agent()
method.You should limit your use of this method of access though. Wikipedia explicitly deny access to some spiders in their robots.txt file. The default user agent for LWP::UserAgent (which starts with libwww) is in the list.
当您遇到此类问题时,您需要观察 HTTP 事务,以便了解网络服务器发送回给您的内容。在这种情况下,您会看到 Mech 连接并获得响应,但维基百科拒绝响应您的机器人。我喜欢 Mac 上的 HTTP Scoop。
When you have these sorts of problems, you need to watch the HTTP transactions so you can see what the webserver is sending back to you. In this case, you'd see that Mech connects and gets a response, but Wikipedia is declining to respond to your bot. I like HTTP Scoop on the Mac.