使用机械化可以让这变得更容易吗?
在这篇帖子中我了解到Mechanize Ruby/Perl 中的
比该特定示例中的 HTML::TreeBuilder 3
更容易使用。
Mechanize
是否优于 HTML::TokeParser
?
使用 Mechanize
在 Ruby 中编写下面的代码是否会更容易?
sub get_img_page_urls {
my $url = shift;
my $ua = LWP::UserAgent->new;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
my $req = new HTTP::Request 'GET' => "$url";
$req->header('Accept' => 'text/html');
$response_u = $ua->request($req); # send request
die "Error: ", $response_u->status_line unless $response_u->is_success;
my $stream = HTML::TokeParser->new(\$response_u->content);
my %urls = ();
my $found_thumbnails = 0;
my $found_thumb = 0;
while (my $token = $stream->get_token) {
# <div class="thumb-box" ... >
if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb-box') {
$found_thumbnails = 1;
}
# <div class="thumb" ... >
if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb') {
$found_thumb = 1;
}
# <a ... >
if ($found_thumbnails and $found_thumb and $token->[0] eq 'S' and $token->[1] eq 'a') {
$urls{'http://example.com' . "$token->[2]{href}"} = 1;
# one url have been found. Now start all over.
$found_thumb = 0;
$found_thumbnails = 0;
}
}
return %urls;
}
In this post I learned that Mechanize
in Ruby/Perl is easier to use than HTML::TreeBuilder 3
in that particular example.
Is Mechanize
superior to HTML::TokeParser
?
Would the below also have been easier to write in Ruby using Mechanize
?
sub get_img_page_urls {
my $url = shift;
my $ua = LWP::UserAgent->new;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
my $req = new HTTP::Request 'GET' => "$url";
$req->header('Accept' => 'text/html');
$response_u = $ua->request($req); # send request
die "Error: ", $response_u->status_line unless $response_u->is_success;
my $stream = HTML::TokeParser->new(\$response_u->content);
my %urls = ();
my $found_thumbnails = 0;
my $found_thumb = 0;
while (my $token = $stream->get_token) {
# <div class="thumb-box" ... >
if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb-box') {
$found_thumbnails = 1;
}
# <div class="thumb" ... >
if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb') {
$found_thumb = 1;
}
# <a ... >
if ($found_thumbnails and $found_thumb and $token->[0] eq 'S' and $token->[1] eq 'a') {
$urls{'http://example.com' . "$token->[2]{href}"} = 1;
# one url have been found. Now start all over.
$found_thumb = 0;
$found_thumbnails = 0;
}
}
return %urls;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
就界面而言,任何东西都比 HTML::TokeParser 更好。 WWW::Mechanize 的形式非常出色,但它也缺乏一种声明性的方式来查找某些元素。我喜欢 Web::Query 和 HTML::Query 他们按照 jQuery 建模了他们的界面,据我所知,这使得这种编程变得流行。
该问题的程序较短如下。它会自动引发异常,因此不需要显式的错误处理。
之前作为评论发布https://stackoverflow.com/q/8274221#comment-10196381
Anything is better than HTML::TokeParser, speaking about the interface. WWW::Mechanize shines with forms, but it also lacks a declarative way to find certain elements. I'm fond of Web::Query and HTML::Query who modeled their interface after jQuery which as far as I know made this sort of programming popular.
The program from the question is shorter as follows. It automatically raises exceptions, so no need for explicit error handling.
Previously posted as comment https://stackoverflow.com/q/8274221#comment-10196381
Mechanize 不仅仅是一个解析器。它添加了一个模拟浏览器,允许您导航网站、填写表单等。但它还包括一个解析器,使网络抓取变得非常简单。这是使用 ruby Mechanize 重写的方法:
Mechanize is more than a parser. It adds an emulated browser, which allows you to navigate a site, fill out forms, etc. But it also includes a parser, making web scraping very simple. Here's your method rewritten using ruby Mechanize:
不确定您是否需要使用 Mechanize,因为我认为 Nokogiri 就足够了。我不知道 perl,所以我不完全确定 html 在你的示例中是如何布局的,但我假设它是这样的:
这是 Nokogiri 的代码:
Not sure you would need to use Mechanize as I think Nokogiri would suffice. I don't know perl so I am not totally sure how the html is laid out in your example but I am assuming it's like this:
Here's the code with Nokogiri: