使用 Perl-Mechanize 进行 Dom-Processing：完成一个小程序

发布于 2024-11-08 17:31:21 字数 1523 浏览 6 评论 0原文

我目前正在使用此数据集包含 2700 个基金会。所有数据均可免费使用，没有任何限制或版权问题。

到目前为止我所拥有的： 如果我采用 WWW::Mechanize，那么收获任务应该没有问题 - 特别是对于进行基于表单的搜索和选择单个条目。嗯 - 我猜该算法基本上是两个嵌套循环：外循环运行基于表单的搜索，内循环处理搜索结果。

外部循环将在页面上的第二个搜索表单上使用 select() 和 submit_form() 函数。我们这里可以使用DOM处理吗？好吧——我们怎样才能得到选择值。

结果的内部循环将使用跟随链接函数通过以下调用获取实际条目。

$mech->follow_link(url_regex => qr/webgrab_path=http:\/\/evs2000.*\?
Id=\d+$/, n => $result_nbr);

这会将我们的机械浏览器转发到入口页面。基本上，URL 查询会查找具有 Id 模式的 webgrap_path 的链接，该模式对于每个数据库条目都是唯一的。 $result_nbr 变量告诉 mecha 接下来应该遵循哪一个结果。

如果我们有多个结果页面，我们也会使用相同的技巧来遍历结果页面。对于条目信息的语义提取，我们可以使用XML解析实际条目的内容：LibXML的html解析器（在这个页面上工作得很好），因为它为您提供了一些强大的DOM选择（使用XPath) 方法。好吧，实际的页面循环应该可以用几行 Perl 来完成（最多 20 行——可能更少）。

但是等等：入口页面的处理将是最复杂的部分脚本的。

方法：原则上我们可以使用单个 while 循环执行相同的算法如果我们巧妙地使用 back() 函数。

您能给我一个关于开始时的提示 - 入口页面的处理 - 在 Perl:: Mechanize 中执行此操作吗？

这就是我所拥有的：

 GetThePage(
    starting url 
);
sub GetThePage {
    my $mech ...
    my @pages = ...
    while(@pages) {
       my $page = shift @pages;
       $mech->get( $page );
       push @pages, GetMorePages( $mech );
       SomethingImportant( $mech );
       SomethingXPATH( $mech );
    }
}

问题是如何找到 DOM 路径。

原文

I'm currently working on a little harvester, using this dataset of 2700 foundations. All the data are free to use with no limitations or copyright isues.

What I have so far: The harvesting task should be no problem if I take WWW::Mechanize — particularly for doing the form based search and selecting the individual entries. Hmm — I guess that the algorithm would be basically two nested loops: the outer loop runs the form-based search, the inner loop processes the search results.

The outer loop would use the select() and the submit_form() functions on the second search form on the page. Can we use DOM processing here? Well — how can we get the get the selection values.

The inner loop through the results would use the follow link function to get to the actual entries using the following call.

$mech->follow_link(url_regex => qr/webgrab_path=http:\/\/evs2000.*\?
Id=\d+$/, n => $result_nbr);

This would forward our mechanic browser to the entry page. Basically the URL query looks for links that have the webgrap_path to Id pattern, which is unique for each database entry. The $result_nbr variable tells mecha which one of the results it should follow next.

If we have several result pages we would also use the same trick to traverse through the result pages. For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML's html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods.
Well the actual looping through the pages should be doable in a few lines of Perl (max. 20 lines — likely less).

But wait: the processing of the entry pages will then be the most complex part
of the script.

Approaches: In principle we could do the same algorithm with a single while loop
if we use the back() function smartly.

Can you give me a hint for the beginning — the processing of the entry pages — doing this in Perl:: Mechanize?

Here's what I have:

 GetThePage(
    starting url 
);
sub GetThePage {
    my $mech ...
    my @pages = ...
    while(@pages) {
       my $page = shift @pages;
       $mech->get( $page );
       push @pages, GetMorePages( $mech );
       SomethingImportant( $mech );
       SomethingXPATH( $mech );
    }
}

The question is how to find the DOM-paths.

分享到QQ

分享到微博