LWP、HTML::TableExtract 和 Text::CSV 输出 - 如何在此处添加属性?

发布于 2024-10-19 02:41:33 字数 5441 浏览 0 评论 0原文

我有一个小型解析器,可以解析一个包含 6150 条记录的站点。但我需要将其保存为 CSV 格式。

首先在这里看到目标站点: http: //192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750

我需要所有数据 - 在字段中分隔

number
schoolnumber
school-name
Adress
Street
Postal Code
phone
fax
School-type
website

- 我有一个脚本:我是非常感兴趣你对此有何看法。尚未获得所有领域 - 我需要更多!

#!/usr/bin/perl
use strict;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);

my $total_records = 0;
my $alpha = "x";
my $results = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $percent = 0;

workDir();
chdir $processdir;
processURL();
print "\nPress <enter> to continue\n";
<>;
my $displaydate = strftime('%Y%m%d%H%M%S', localtime);
open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file';
processData();
close $outfile;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt\n";
unlink 'processing.html';

sub processURL() {
print "\nProcessing $url_to_process$alpha&a=$results&s=$range\n";
getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page';

   while( <tempfile.html> ) {
      open( FH, "$_" ) or die;
      while( <FH> ) {
         if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
            $total_records = $6;
            print "Total records to process is $total_records\n";
            }
         }
         close FH;
   }
   unlink 'tempfile.html';
}

sub processData() {
   while ( $range <= $total_records) {
      my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
      getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page';
      $te->parse_file('processing.html');
      my ($table) = $te->tables;
      foreach my $ts ($te->table_states) {
         foreach my $row ($ts->rows) {
            cleanup(@$row);
        # Add a table column delimiter in this case ||
            print $outfile join("||", @$row)."\n";
            }
         }
      $| = 1; 
      print "Processed records $range to $counter";
      print "\r";
      $counter = $counter + 50;
      $range = $range + 50;
   }
}

sub cleanup() {
   for ( @_ ) {
      s/\s+/ /g;
   }
}

sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
   mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
   }
}

具有以下输出:

1||9752||Deutsche Schule Alamogordo  USA  Alamogorde - New Mexico  || ||Deutschsprachige Auslandsschule||
2||9931||Deutsche Schule der Borromäerinnen Alexandrien ET  Alexandrien - Ägypten  || ||Begegnungsschule (Auslandsschuldienst)||
3||1940||Max-Keller-Schule, Berufsfachschule f.Musik Alt- ötting d.Berufsfachschule für Musik Altötting e.V. Kapellplatz 36 84503  Altötting  ||08671/1735 08671/84363||Berufsfachschulen f. Musik|| www.max-keller-schule.de
4||0006||Max-Reger-Gymnasium Amberg  Kaiser-Wilhelm-Ring 7 92224  Amberg  ||09621/4718-0 09621/4718-47||Gymnasien|| www.mrg-amberg.de

使用 ||作为分隔符。

我的问题是我需要有更多字段 - 我需要进行以下划分 - 请参阅示例:

name: Volksschule Abenberg (Grundschule)
street: Güssübelstr. 2
postal-code and town: 91183 Abenberg
fax and telephone: 09178/215 09178/905060
type of school: Volksschulen
website: home.t-online.de/home/vs-abenberg

如何添加更多字段?这显然必须在这里完成,不是吗!?

 my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);

但如何呢?我尝试了很多事情,但总是得到不好的结果。 我玩了一下 - 并尝试了另一个解决方案 - 但这里我有很好的 CSV 数据 - 但不幸的是没有蜘蛛逻辑...

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/&nbsp;/ /g; # expand the spaces

my $te = new HTML::TableExtract();
$te->parse($html);

my @cols = qw(
    rownum
    number
    name
    phone
    type
    website
);

my @fields = qw(
    rownum
    number
    name
    street
    postal
    town
    phone
    fax
    type
    website
);

my $csv = Text::CSV->new({ binary => 1 });

foreach my $ts ($te->table_states) {
    foreach my $row ($ts->rows) {

        #  trim leading/trailing whitespace from base fields
        s/^s+//, s/\s+$// for @$row;

        # load the fields into the hash using a "hash slice"
        my %h;
        @h{@cols} = @$row;

        # derive some fields from base fields, again using a hash slice
        @h{qw/name street postal town/} = split /n+/, $h{name};
        @h{qw/phone fax/} = split /n+/, $h{phone};

        #  trim leading/trailing whitespace from derived fields
        s/^s+//, s/\s+$// for @h{qw/name street postal town/};

        $csv->combine(@h{@fields});
        print $csv->string, "\n";
    }
} 

好吧 - 有了这个我尝试了另一个解决方案 - 但这里我有很好的 CSV 数据 - 但不幸的是没有蜘蛛逻辑。

如何在这里添加蜘蛛逻辑!?

好吧,我需要一些帮助 - 无论是在第一个脚本还是在第二个脚本中!

I have a little parser that parses a site - with 6150 records. But I need to have this in a CSV-format.

First of all see here the target site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750

I need all the data - with separation in the filed of

number
schoolnumber
school-name
Adress
Street
Postal Code
phone
fax
School-type
website

Well - I have a script: I am very interested what you think about this. Not all the fields are gained yet - I need more of them!

#!/usr/bin/perl
use strict;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);

my $total_records = 0;
my $alpha = "x";
my $results = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $percent = 0;

workDir();
chdir $processdir;
processURL();
print "\nPress <enter> to continue\n";
<>;
my $displaydate = strftime('%Y%m%d%H%M%S', localtime);
open my $outfile, '>', "webdata_for_$alpha\_$displaydate.txt" or die 'Unable to create file';
processData();
close $outfile;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$alpha\_$displaydate.txt\n";
unlink 'processing.html';

sub processURL() {
print "\nProcessing $url_to_process$alpha&a=$results&s=$range\n";
getstore("$url_to_process$alpha&a=$results&s=$range", 'tempfile.html') or die 'Unable to get page';

   while( <tempfile.html> ) {
      open( FH, "$_" ) or die;
      while( <FH> ) {
         if( $_ =~ /^.*?(Treffer \<b\>)(\d+)( - )(\d+)(<\/b> \w+ \w+ \<b\>)(\d+).*/ ) {
            $total_records = $6;
            print "Total records to process is $total_records\n";
            }
         }
         close FH;
   }
   unlink 'tempfile.html';
}

sub processData() {
   while ( $range <= $total_records) {
      my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);
      getstore("$url_to_process$alpha&a=$results&s=$range", 'processing.html') or die 'Unable to get page';
      $te->parse_file('processing.html');
      my ($table) = $te->tables;
      foreach my $ts ($te->table_states) {
         foreach my $row ($ts->rows) {
            cleanup(@$row);
        # Add a table column delimiter in this case ||
            print $outfile join("||", @$row)."\n";
            }
         }
      $| = 1; 
      print "Processed records $range to $counter";
      print "\r";
      $counter = $counter + 50;
      $range = $range + 50;
   }
}

sub cleanup() {
   for ( @_ ) {
      s/\s+/ /g;
   }
}

sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
   mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
   }
}

with the following output:

1||9752||Deutsche Schule Alamogordo  USA  Alamogorde - New Mexico  || ||Deutschsprachige Auslandsschule||
2||9931||Deutsche Schule der Borromäerinnen Alexandrien ET  Alexandrien - Ägypten  || ||Begegnungsschule (Auslandsschuldienst)||
3||1940||Max-Keller-Schule, Berufsfachschule f.Musik Alt- ötting d.Berufsfachschule für Musik Altötting e.V. Kapellplatz 36 84503  Altötting  ||08671/1735 08671/84363||Berufsfachschulen f. Musik|| www.max-keller-schule.de
4||0006||Max-Reger-Gymnasium Amberg  Kaiser-Wilhelm-Ring 7 92224  Amberg  ||09621/4718-0 09621/4718-47||Gymnasien|| www.mrg-amberg.de

With the || being the delimiter.

My problem is that I need to have more fields - I need to have the following divided - see an example:

name: Volksschule Abenberg (Grundschule)
street: Güssübelstr. 2
postal-code and town: 91183 Abenberg
fax and telephone: 09178/215 09178/905060
type of school: Volksschulen
website: home.t-online.de/home/vs-abenberg

How to add more fields? This obviously has to be done in this line here, doesn't it!?

 my $te = HTML::TableExtract->new(headers => [qw(lfd Schul Schulname Telefon Schulart Webseite)]);

But how? I tried out several things, but I always got bad results.
I played around - and tried another solution - but here I have good CSV-data - but unfortunatly no spider logic...

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/ / /g; # expand the spaces

my $te = new HTML::TableExtract();
$te->parse($html);

my @cols = qw(
    rownum
    number
    name
    phone
    type
    website
);

my @fields = qw(
    rownum
    number
    name
    street
    postal
    town
    phone
    fax
    type
    website
);

my $csv = Text::CSV->new({ binary => 1 });

foreach my $ts ($te->table_states) {
    foreach my $row ($ts->rows) {

        #  trim leading/trailing whitespace from base fields
        s/^s+//, s/\s+$// for @$row;

        # load the fields into the hash using a "hash slice"
        my %h;
        @h{@cols} = @$row;

        # derive some fields from base fields, again using a hash slice
        @h{qw/name street postal town/} = split /n+/, $h{name};
        @h{qw/phone fax/} = split /n+/, $h{phone};

        #  trim leading/trailing whitespace from derived fields
        s/^s+//, s/\s+$// for @h{qw/name street postal town/};

        $csv->combine(@h{@fields});
        print $csv->string, "\n";
    }
} 

Well - with this I tried another solution - but here I have good CSV-data - but unfortunately no spider logic.

How to add the spider-logic here!?

Well I need some help - either in the first or in the second script!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

深白境迁sunset 2024-10-26 02:41:33

该网站使用 br 标签来分隔每个单元格内的子字段,就像您想要划分数据一样。 HTML::TableExtract 在您的第一个程序中默认将它们转换为换行符,但您的 cleanup 例程会丢弃此信息。

在您的第一个程序中,在展平其余空白之前添加类似 s/\n/||/sg; (假设具有相同的分隔符)的内容。

The website uses br tags to separate the sub-fields within each cell, very much like you want to divide the data. HTML::TableExtract turns these into newlines by default In your first program, but your cleanup routine throws this information away.

In your first program, add something like s/\n/||/sg; (assuming the same separator) before you flatten the rest of the whitespace.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文