为什么 WWW::Mechanize 获取某些页面而不是其他页面？

发布于 2024-09-19 15:14:59 字数 920 浏览 4 评论 0原文

我对 Perl/HTML 很陌生。我正在尝试使用 $mech->get($url) 从 http://en.wikipedia.org/wiki/Periodic_table 但它一直返回这样的错误消息：

获取错误 http://en.wikipedia.org/wiki/Periodic_table：在PeriodicTable.pl第13行被禁止

，但是如果$url是http://search.cpan.org/。

任何帮助将不胜感激！

这是我的代码：

#!/usr/bin/perl -w

use strict;
use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new( autocheck => 1 );

$mech = WWW::Mechanize->new();

my $table_url = "http://en.wikipedia.org/wiki/Periodic_table/";

$mech->get( $table_url );

原文

I'm new to Perl/HTML things. I'm trying to use $mech->get($url) to get something from a periodic table on http://en.wikipedia.org/wiki/Periodic_table but it kept returning error message like this:

Error GETing
http://en.wikipedia.org/wiki/Periodic_table:
Forbidden at PeriodicTable.pl line 13

But $mech->get($url) works fine if $url is http://search.cpan.org/.

Any help will be much appreciated!

Here is my code:

#!/usr/bin/perl -w

use strict;
use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new( autocheck => 1 );

$mech = WWW::Mechanize->new();

my $table_url = "http://en.wikipedia.org/wiki/Periodic_table/";

$mech->get( $table_url );

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

禾厶谷欠 2024-09-26 15:14:59

这是因为维基百科根据请求提供的用户代理拒绝访问某些程序。

您可以通过在实例化之后和 get() 之前设置代理来为自己设置别名，使其显示为“普通”网络浏览器，例如：

$mech->agent( 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8' );

这对我来说适用于您发布的 URL。较短的字符串可能也会起作用。

（我认为您也应该删除 URL 中的尾部斜杠。）

WWW::Mechanize 是 LWP::UserAgent - 请参阅那里的文档以获取更多信息，包括 agent() 方法。

不过，您应该限制对这种访问方法的使用。维基百科在其 robots.txt 文件中明确拒绝访问某些蜘蛛。 LWP::UserAgent（以 libwww 开头）的默认用户代理位于列表中。

It's because Wikipedia deny access to some programs based on the User-Agent supplied on the request.

You can alias yourself to appear as a 'normal' web browser by setting the agent after instantiation and before the get(), for example:

$mech->agent( 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8' );

That worked for me with the URL in your posting. Shorter strings will probably work too.

(You should remove the trailing slash from the URL too I think.)

WWW::Mechanize is a subclass of LWP::UserAgent - see docs there for more info, including on the agent() method.

You should limit your use of this method of access though. Wikipedia explicitly deny access to some spiders in their robots.txt file. The default user agent for LWP::UserAgent (which starts with libwww) is in the list.

回复收藏 0 原文