从 HTML 页面中提取所有链接，排除特定表中的链接

发布于 2024-09-19 06:07:36 字数 2499 浏览 12 评论 0原文

我对 Perl/HTML 还很陌生。这是我尝试使用 WWW::Mechanize 和 HTML::TreeBuilder：

对于每个化学元素维基百科上的页面，我需要提取指向维基百科上其他化学元素页面的所有超链接，并以这种格式打印每个唯一的对：

Atomic_Number1 (Chemical Element Title1) -> Atomic_Number2 (Chemical Element Title2)

唯一的问题是每个化学元素页面上都有一个迷你周期表（右上角）页面）。所以这个小小的元素周期表只会使每个元素的结果相同。我在从页面中提取除该表之外的所有链接时遇到问题。

[注意：为了便于调试，我只查看了 $elem == 6 (Carbon) (@line 42)。]

这是我的代码：

#!/usr/bin/perl -w

use strict;
use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new( autocheck => 1 );

$mech = WWW::Mechanize->new();

my $table_url = "http://en.wikipedia.org/wiki/Periodic_table";

$mech->agent('Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) /
              AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1   /
              Safari/533.17.8');

$mech->get($table_url);

my $tree = HTML::TreeBuilder->new_from_content($mech->content);
my %elem_set;
my $atomic_num;

## obtain a hash array of elements and corresponding titles and links
foreach my $td ($tree->look_down(_tag => 'td')) {

  # If there's no <a> in this <td>, then skip it:
  my $a = $td->look_down(_tag => 'a') or next;

  my $tdText = $td->as_text;
  my $aText  = $a->as_text;

  if($tdText =~ m/^(\d+)\S+$/){
    if($1 <= 114){  #only investigate up to 114th element
      $atomic_num = $1;
    }
    $elem_set{$atomic_num} = [$a->attr('title'), $a->attr('href')];
  }
}

## In each element's page. look for links to other elements in the set
foreach my $elem (keys %elem_set) {
  if($elem == 6){
    # reconstruct element url to ensure only fetch pages in English
    my $elem_url = "http://en.wikipedia.org" . $elem_set{$elem}[1];
    $mech->get($elem_url);

    #####################################################################
    ### need help here to exclude links from that mini periodic table ###
    #####################################################################

    my @target_links = $mech->links();
    for my $link ( @target_links ) {
      if( $link->url =~ m/^\/(wiki)\/.+$/ && $link->text =~ m/^\w+$/ ){
        printf("%s, %s\n", $link->text, $link->url);
      }
    }

  }
}

原文

I'm pretty new to Perl/HTML. Here is what I'm trying to do with WWW::Mechanize and HTML::TreeBuilder:

For each chemical element page on Wikipedia, I need to extract all hyperlinks that point to the other chemical elements' pages on wiki and print each unique pair in this format:

Atomic_Number1 (Chemical Element Title1) -> Atomic_Number2 (Chemical Element Title2)

The only problem is that there is a mini periodic table on every chemical element's page (top-right of the page). So this tiny periodic table will just make the result same for every element. I'm having trouble on extracting all links from the page EXCEPT from that very table.

[Note: I only looked at $elem == 6 (Carbon) (@line 42) for the ease of debugging.]

Here is my code:

#!/usr/bin/perl -w

use strict;
use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new( autocheck => 1 );

$mech = WWW::Mechanize->new();

my $table_url = "http://en.wikipedia.org/wiki/Periodic_table";

$mech->agent('Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) /
              AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1   /
              Safari/533.17.8');

$mech->get($table_url);

my $tree = HTML::TreeBuilder->new_from_content($mech->content);
my %elem_set;
my $atomic_num;

## obtain a hash array of elements and corresponding titles and links
foreach my $td ($tree->look_down(_tag => 'td')) {

  # If there's no <a> in this <td>, then skip it:
  my $a = $td->look_down(_tag => 'a') or next;

  my $tdText = $td->as_text;
  my $aText  = $a->as_text;

  if($tdText =~ m/^(\d+)\S+$/){
    if($1 <= 114){  #only investigate up to 114th element
      $atomic_num = $1;
    }
    $elem_set{$atomic_num} = [$a->attr('title'), $a->attr('href')];
  }
}

## In each element's page. look for links to other elements in the set
foreach my $elem (keys %elem_set) {
  if($elem == 6){
    # reconstruct element url to ensure only fetch pages in English
    my $elem_url = "http://en.wikipedia.org" . $elem_set{$elem}[1];
    $mech->get($elem_url);

    #####################################################################
    ### need help here to exclude links from that mini periodic table ###
    #####################################################################

    my @target_links = $mech->links();
    for my $link ( @target_links ) {
      if( $link->url =~ m/^\/(wiki)\/.+$/ && $link->text =~ m/^\w+$/ ){
        printf("%s, %s\n", $link->text, $link->url);
      }
    }

  }
}

分享到QQ

分享到微博