在 PERL 中使用 LWP::UserAgent 下载 XML 结果

发布于 2024-10-17 10:06:02 字数 1269 浏览 2 评论 0 原文

我希望在 Perl 问题上得到一些帮助。

我需要下载一个作为查询结果的 XML 文件,解析结果,从 XML 文件中获取下一个链接,下载并下载该文件。重复。

我已经能够很好地下载并解析第一个结果集。

我抓取下一个 URL,但返回的结果似乎永远不会改变。即:第二次循环时,$res->content 与第一次相同。因此,$url 的值在第一次下载后永远不会改变。

我怀疑这是一个范围问题,但我似乎无法解决这个问题。

use LWP::UserAgent;
use HTTP::Cookies;
use Data::Dumper;
use XML::LibXML;
use strict;

my $url = "http://quod.lib.umich.edu/cgi/f/findaid/findaid-idx?c=bhlead&cc=bhlead&type=simple&rgn=Entire+Finding+Aid&q1=civil+war&Submit=Search;debug=xml";

while ($url ne ""){

    my $ua = LWP::UserAgent->new();    
    $ua->agent('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)');
    $ua->timeout(30);
    $ua->default_header('pragma' => "no-cache", 'max-age' => '0');

    print "Download URL:\n$url\n\n";

    my $res = $ua->get($url);

    if ($res->is_error) {
        print STDERR __LINE__, " Error: ", $res->status_line, " ", $res;
        exit;
    } 

    my $parser = XML::LibXML->new(); 
    my $doc = $parser->load_xml(string=>$res->content);

    #grab the url of the next result set
    $url = $doc->findvalue('//ResultsLinks/SliceNavigationLinks/NextHitsLink');

    print "NEXT URL:\n$url\n\n";

}

I'm hoping for some assistance with a Perl issue.

I need to download an XML file that is the result of a query, parse the results, grab the next link from the XML file, download & repeat.

I have been able to download and parse the first result set fine.

I grab the next URL, but it seems that returned result never changes. I.e.: the second time through the loop, $res->content is the same as the first time. Therefore, the value of $url never changes after the first download.

I'm suspecting it is a scope problem, but I just cannot seem to get a handle on this.

use LWP::UserAgent;
use HTTP::Cookies;
use Data::Dumper;
use XML::LibXML;
use strict;

my $url = "http://quod.lib.umich.edu/cgi/f/findaid/findaid-idx?c=bhlead&cc=bhlead&type=simple&rgn=Entire+Finding+Aid&q1=civil+war&Submit=Search;debug=xml";

while ($url ne ""){

    my $ua = LWP::UserAgent->new();    
    $ua->agent('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)');
    $ua->timeout(30);
    $ua->default_header('pragma' => "no-cache", 'max-age' => '0');

    print "Download URL:\n$url\n\n";

    my $res = $ua->get($url);

    if ($res->is_error) {
        print STDERR __LINE__, " Error: ", $res->status_line, " ", $res;
        exit;
    } 

    my $parser = XML::LibXML->new(); 
    my $doc = $parser->load_xml(string=>$res->content);

    #grab the url of the next result set
    $url = $doc->findvalue('//ResultsLinks/SliceNavigationLinks/NextHitsLink');

    print "NEXT URL:\n$url\n\n";

}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

只是我以为 2024-10-24 10:06:02

我怀疑你得到的医生不是你所期望的。看起来您正在获取某种搜索页面,然后尝试抓取结果页面。确保 javascript 不会导致您的 fetch 未返回您期望的内容,如下所示 其他问题

另外,您可以尝试转储标头,看看是否可以找到其他线索:

use Data::Dumper;
print Dumper($res->headers), "\n";

顺便说一句,您可能应该养成添加“使用警告”的习惯,以防万一您还没有这样做。

I suspect the doc you're getting isn't what you expect. It looks like you're fetching a some kind of search page and then trying to crawl the resulting pages. Make sure javascript isn't responsible for your fetch not returning the content you expect, as in this other question.

Also, you might try dumping the headers to see if you can find another clue:

use Data::Dumper;
print Dumper($res->headers), "\n";

As an aside, you should probably get in the habit of adding "use warnings" in case you already haven't.

在你怀里撒娇 2024-10-24 10:06:02

服务器可能只为您提供默认结果,而没有 HTTP_REFERER。我见过一些设置故意这样做以阻止抓取。

试试这个:

在 while 循环之前,添加 in:

my $referer;

就在您之前:

# grab the result of...

添加 in:

$referer = $url

这样,您可以在将前一个 URL 重置为下一个 URL 之前保存它。

然后,在您的 UserAgent 标头设置中,将其添加到:

    $ua->default_header(pragma => "no-cache", max-age => 0, Referer => $referer);

我不会肯定地说这是问题所在,但根据我的经验,这就是我要开始的地方。
另一种选择是在 LWP 之外尝试。将所有 URL 记录到一个文件中,然后尝试从命令行使用 wget-ting 或 lynx --source-ing 来查看是否获得与 LWP 提供的结果不同的结果。如果不是,这肯定是服务器正在做的事情,而诀窍就是找到一种方法来解决它,就是这样……解决这个诀窍的方法只是更接近地复制常规网络浏览器的功能(因此,比较发送到 Firefox 中的 Firebug 或 Safari 中的 Inspector 发送的标头会很有帮助)

The server may be giving you only default results without an HTTP_REFERER. I've seen some setups do this deliberately to discourage scraping.

Try this:

Before the while loop, add in:

my $referer;

Right before you have:

# grab the result of...

Add in:

$referer = $url

That way you save the previous URL before resetting it to the next one.

Then, in your UserAgent header settings, add that in:

    $ua->default_header(pragma => "no-cache", max-age => 0, Referer => $referer);

I won't say for sure that this is the problem, but in my experience that's where I'd start.
Another option is to try it outside of LWP. Log all of your URLs into a file and try wget-ting them or lynx --source-ing them from the command line to see if you get different results than LWP gives you. If not, it's certainly somehting the server is doing and the trick is to find a way to work around it, is all... and the solution to the trick is just to more closely duplicate what a regular web browser does (thus, comparing your headers sent to the headers sent by Firebug in Firefox or the Inspector in Safari can help a lot)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文