使用机械化可以让这变得更容易吗?

发布于 2024-12-20 11:52:36 字数 1737 浏览 0 评论 0原文

这篇帖子中我了解到Mechanize Ruby/Perl 中的 比该特定示例中的 HTML::TreeBuilder 3 更容易使用。

Mechanize 是否优于 HTML::TokeParser

使用 Mechanize 在 Ruby 中编写下面的代码是否会更容易?

sub get_img_page_urls {
    my $url = shift;

    my $ua = LWP::UserAgent->new;
    $ua->agent("$0/0.1 " . $ua->agent);
    $ua->agent("Mozilla/8.0");

    my $req = new HTTP::Request 'GET' => "$url";
    $req->header('Accept' => 'text/html');

    $response_u = $ua->request($req);  # send request

    die "Error: ", $response_u->status_line unless $response_u->is_success;

    my $stream = HTML::TokeParser->new(\$response_u->content);

    my %urls = ();

    my $found_thumbnails = 0;
    my $found_thumb = 0;

    while (my $token = $stream->get_token) {

        # <div class="thumb-box" ... >
        if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb-box') {
            $found_thumbnails = 1;
        }

        # <div class="thumb" ... >
        if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb') {
            $found_thumb = 1;
        }

        #                                          <a ... >
        if ($found_thumbnails and $found_thumb and $token->[0] eq 'S' and $token->[1] eq 'a') {
            $urls{'http://example.com' . "$token->[2]{href}"} = 1;

            # one url have been found. Now start all over.
            $found_thumb = 0;
            $found_thumbnails = 0;
        }

    }

    return %urls;
}

In this post I learned that Mechanize in Ruby/Perl is easier to use than HTML::TreeBuilder 3 in that particular example.

Is Mechanize superior to HTML::TokeParser?

Would the below also have been easier to write in Ruby using Mechanize?

sub get_img_page_urls {
    my $url = shift;

    my $ua = LWP::UserAgent->new;
    $ua->agent("$0/0.1 " . $ua->agent);
    $ua->agent("Mozilla/8.0");

    my $req = new HTTP::Request 'GET' => "$url";
    $req->header('Accept' => 'text/html');

    $response_u = $ua->request($req);  # send request

    die "Error: ", $response_u->status_line unless $response_u->is_success;

    my $stream = HTML::TokeParser->new(\$response_u->content);

    my %urls = ();

    my $found_thumbnails = 0;
    my $found_thumb = 0;

    while (my $token = $stream->get_token) {

        # <div class="thumb-box" ... >
        if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb-box') {
            $found_thumbnails = 1;
        }

        # <div class="thumb" ... >
        if ($token->[0] eq 'S' and $token->[1] eq 'div' and $token->[2]{class} eq 'thumb') {
            $found_thumb = 1;
        }

        #                                          <a ... >
        if ($found_thumbnails and $found_thumb and $token->[0] eq 'S' and $token->[1] eq 'a') {
            $urls{'http://example.com' . "$token->[2]{href}"} = 1;

            # one url have been found. Now start all over.
            $found_thumb = 0;
            $found_thumbnails = 0;
        }

    }

    return %urls;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

祁梦 2024-12-27 11:52:36

就界面而言,任何东西都比 HTML::TokeParser 更好。 WWW::Mechanize 的形式非常出色,但它也缺乏一种声明性的方式来查找某些元素。我喜欢 Web::QueryHTML::Query 他们按照 jQuery 建模了他们的界面,据我所知,这使得这种编程变得流行。

该问题的程序较短如下。它会自动引发异常,因此不需要显式的错误处理。

use URI;
use Web::Query 'wq';

sub get_img_page_urls {
    my ($url) = @_;
    $Web::Query::UserAgent = LWP::UserAgent->new(agent => 'Mozilla/8.0');

    return map {
        URI->new($_)->abs('http://example.com')->as_string   # hash key
        => 1                                                 # hash value
    } wq($url)->find('div.thumb-box div.thumb a')->attr('href');
}

之前作为评论发布https://stackoverflow.com/q/8274221#comment-10196381

Anything is better than HTML::TokeParser, speaking about the interface. WWW::Mechanize shines with forms, but it also lacks a declarative way to find certain elements. I'm fond of Web::Query and HTML::Query who modeled their interface after jQuery which as far as I know made this sort of programming popular.

The program from the question is shorter as follows. It automatically raises exceptions, so no need for explicit error handling.

use URI;
use Web::Query 'wq';

sub get_img_page_urls {
    my ($url) = @_;
    $Web::Query::UserAgent = LWP::UserAgent->new(agent => 'Mozilla/8.0');

    return map {
        URI->new($_)->abs('http://example.com')->as_string   # hash key
        => 1                                                 # hash value
    } wq($url)->find('div.thumb-box div.thumb a')->attr('href');
}

Previously posted as comment https://stackoverflow.com/q/8274221#comment-10196381

归属感 2024-12-27 11:52:36

Mechanize 不仅仅是一个解析器。它添加了一个模拟浏览器,允许您导航网站、填写表单等。但它还包括一个解析器,使网络抓取变得非常简单。这是使用 ruby​​ Mechanize 重写的方法:

def get_img_page_urls(url)
  agent = Mechanize.new
  agent.user_agent_alias = "Windows Mozilla"
  agent.get(url).search("//div[@class='thumb-box']/div[@class='thumb']/a/@href")
end

Mechanize is more than a parser. It adds an emulated browser, which allows you to navigate a site, fill out forms, etc. But it also includes a parser, making web scraping very simple. Here's your method rewritten using ruby Mechanize:

def get_img_page_urls(url)
  agent = Mechanize.new
  agent.user_agent_alias = "Windows Mozilla"
  agent.get(url).search("//div[@class='thumb-box']/div[@class='thumb']/a/@href")
end
哆兒滾 2024-12-27 11:52:36

不确定您是否需要使用 Mechanize,因为我认为 Nokogiri 就足够了。我不知道 perl,所以我不完全确定 html 在你的示例中是如何布局的,但我假设它是这样的:

<div class="thumb-box">
  ...
  <div class="thumb">
    ...
    <a href="http://example.com/img/5.jpg">...
  </div>
</div>

这是 Nokogiri 的代码:

require 'nokogiri'
require 'open-uri'

def get_img_page_urls(url)
  urls = []
  doc = Nokogiri::HTML(open('http://www.example.com', 'User-Agent' => 'Mozilla/8.0'))
  doc.css('div.thumb-box div.thumb a').each do |link|
    urls << link.attr("href")
  end

  urls
end

Not sure you would need to use Mechanize as I think Nokogiri would suffice. I don't know perl so I am not totally sure how the html is laid out in your example but I am assuming it's like this:

<div class="thumb-box">
  ...
  <div class="thumb">
    ...
    <a href="http://example.com/img/5.jpg">...
  </div>
</div>

Here's the code with Nokogiri:

require 'nokogiri'
require 'open-uri'

def get_img_page_urls(url)
  urls = []
  doc = Nokogiri::HTML(open('http://www.example.com', 'User-Agent' => 'Mozilla/8.0'))
  doc.css('div.thumb-box div.thumb a').each do |link|
    urls << link.attr("href")
  end

  urls
end
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文