如何将 CSV 格式添加到 LWP::Simple

发布于 2024-10-18 07:31:02 字数 7112 浏览 1 评论 0原文

关于此 Target-Url 德国学校数据库 http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=100&s=1750

这个目标网址有一个网格-(或一个表格)-我有一个代码运行良好 - 但它无法处理网格...

它会吐出如下所示的数据...

lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite

1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183  Abenberg   09178/509210  Realschulen  mrs-marienburg.homepage.t-online.de 
2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183  Abenberg   09178/215 09178/905060 Volksschulen  home.t-online.de/home/vs-abenberg 
3 6913 Mittelschule Abenberg  Güssübelstr. 2 91183  Abenberg   09178/215 09178/905060 Volksschulen  home.t-online.de/home/vs-abenberg 
4 0402 Johann-Turmair-Realschule Staatliche Realschule Abensberg Stadionstraße 46 93326  Abensberg   09443/9143-0,12,13 09443/914330 Realschulen  www.rs-abensberg.de 

这很糟糕 - 我需要有一个分隔符 - 我怎样才能在结果中添加一些分隔符 - 或者 逗号或制表符...?

这里是代码:

  #!/usr/bin/perl
    use strict;
    use warnings;
    use HTML::TableExtract;
    use LWP::Simple;
    use Cwd;
    use POSIX qw(strftime);
    my $te = HTML::TableExtract->new;
    my $total_records = 0;
    my $suchbegriffe = "e";
    my $treffer = 50;
    my $range = 0;
    my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
    my $processdir = "processing";
    my $counter = 50;
    my $displaydate = "";
    my $percent = 0;

    &workDir();
    chdir $processdir;
    &processURL();
    print "\nPress <enter> to continue\n";
    <>;
    $displaydate = strftime('%Y%m%d%H%M%S', localtime);
    open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
    &processData();
    close OUTFILE;
    print "Finished processing $total_records records...\n";
    print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
    unlink 'processing.html';
    die "\n";

    sub processURL() {
    print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
    getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';

       while( <tempfile.html> ) {
          open( FH, "$_" ) or die;
          while( <FH> ) {
             if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
                $total_records = $6;
                print "Total records to process is $total_records\n";
                }
             }
             close FH;
       }
       unlink 'tempfile.html';
    }

    sub processData() {
       while ( $range <= $total_records) {
          getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
          $te->parse_file('processing.html');
          my ($table) = $te->tables;
          for my $row ( $table->rows ) {
             cleanup(@$row);
             print OUTFILE "@$row\n";
          }
          $| = 1; 
          print "Processed records $range to $counter";
          print "\r";
          $counter = $counter + 50;
          $range = $range + 50;
          $te = HTML::TableExtract->new;
       }
    }

    sub cleanup() {
       for ( @_ ) {
          s/s+/ /g;
       }
    }

    sub workDir() {
    # Use home directory to process data
    chdir or die "$!";
    if ( ! -d $processdir ) {
       mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
       }
    }  

顺便说一句 - 请参阅分隔符的一个示例...:

my @cols = qw(
    rownum
    number
    name
    phone
    type
    website
);

my @fields = qw(
    rownum
    number
    name
    street
    postal
    town
    phone
    fax
    type
    website
);

注意:这来自一位用户,他在该线程上给了我一些很好的提示。看 HTML::TableExtract:如何运行正确的参数 [查看实例]

嗯 - 我想将这些想法迁移到上面提到的蜘蛛和解析器中。这可能吗..您能给我一些如何添加一些代码行的提示 - 以便结果以 csv-formate 格式输出...!?

任何和所有的帮助将不胜感激..

更新:根据muu的想法太短了...:

  #!/usr/bin/perl
    use strict;
    use warnings;
    use HTML::TableExtract;
    use LWP::Simple;
    use Text::CSV
    use Cwd;
    use POSIX qw(strftime);
    my $te = HTML::TableExtract->new;
    my $total_records = 0;
    my $suchbegriffe = "e";
    my $treffer = 50;
    my $range = 0;
    my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
    my $processdir = "processing";
    my $counter = 50;
    my $displaydate = "";
    my $percent = 0;

    &workDir();
    chdir $processdir;
    &processURL();
    print "\nPress <enter> to continue\n";
    <>;
    $displaydate = strftime('%Y%m%d%H%M%S', localtime);
    open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
    &processData();
    close OUTFILE;
    print "Finished processing $total_records records...\n";
    print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
    unlink 'processing.html';
    die "\n";

    sub processURL() {
    print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
    getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';

       while( <tempfile.html> ) {
          open( FH, "$_" ) or die;
          while( <FH> ) {
             if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
                $total_records = $6;
                print "Total records to process is $total_records\n";
                }
             }
             close FH;
       }
       unlink 'tempfile.html';
    }

    sub processData() {
       while ( $range <= $total_records) {
          getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
          $te->parse_file('processing.html');
          my ($table) = $te->tables;
          for my $row ( $table->rows ) {
             cleanup(@$row);

            $csv->combine(@$row);
      print OUTFILE $csv->string();

          }
          $| = 1; 
          print "Processed records $range to $counter";
          print "\r";
          $counter = $counter + 50;
          $range = $range + 50;
          $te = HTML::TableExtract->new;
       }
    }


sub cleanup {
   for ( @_ ) {
      s/\s+/ /g;
   }
}
    sub workDir() {
    # Use home directory to process data
    chdir or die "$!";
    if ( ! -d $processdir ) {
       mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
       }
    }  

更新:我做了一些试验:

martin@suse-linux:~/perl>珀尔 perl_bayern_newstack.pl 不匹配(在 正则表达式;标记为 <-- HERE 在 m/^.*?(特雷弗)(d+)( - )(d+)( <-- 这里

什么问题......?

Regarding this Target-Url a germany schoolary-database http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=100&s=1750

this target url has got a grid- (or a table) - I have a code that runs nicely - but it cannot handle the grid...

it spits out the data like the following ...

lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite

1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183  Abenberg   09178/509210  Realschulen  mrs-marienburg.homepage.t-online.de 
2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183  Abenberg   09178/215 09178/905060 Volksschulen  home.t-online.de/home/vs-abenberg 
3 6913 Mittelschule Abenberg  Güssübelstr. 2 91183  Abenberg   09178/215 09178/905060 Volksschulen  home.t-online.de/home/vs-abenberg 
4 0402 Johann-Turmair-Realschule Staatliche Realschule Abensberg Stadionstraße 46 93326  Abensberg   09443/9143-0,12,13 09443/914330 Realschulen  www.rs-abensberg.de 

That is bad - I need to have a separator - how can I get some separators into the results - either
commas or tabs.... ?

Here the code:

  #!/usr/bin/perl
    use strict;
    use warnings;
    use HTML::TableExtract;
    use LWP::Simple;
    use Cwd;
    use POSIX qw(strftime);
    my $te = HTML::TableExtract->new;
    my $total_records = 0;
    my $suchbegriffe = "e";
    my $treffer = 50;
    my $range = 0;
    my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
    my $processdir = "processing";
    my $counter = 50;
    my $displaydate = "";
    my $percent = 0;

    &workDir();
    chdir $processdir;
    &processURL();
    print "\nPress <enter> to continue\n";
    <>;
    $displaydate = strftime('%Y%m%d%H%M%S', localtime);
    open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
    &processData();
    close OUTFILE;
    print "Finished processing $total_records records...\n";
    print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
    unlink 'processing.html';
    die "\n";

    sub processURL() {
    print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
    getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';

       while( <tempfile.html> ) {
          open( FH, "$_" ) or die;
          while( <FH> ) {
             if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
                $total_records = $6;
                print "Total records to process is $total_records\n";
                }
             }
             close FH;
       }
       unlink 'tempfile.html';
    }

    sub processData() {
       while ( $range <= $total_records) {
          getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
          $te->parse_file('processing.html');
          my ($table) = $te->tables;
          for my $row ( $table->rows ) {
             cleanup(@$row);
             print OUTFILE "@$row\n";
          }
          $| = 1; 
          print "Processed records $range to $counter";
          print "\r";
          $counter = $counter + 50;
          $range = $range + 50;
          $te = HTML::TableExtract->new;
       }
    }

    sub cleanup() {
       for ( @_ ) {
          s/s+/ /g;
       }
    }

    sub workDir() {
    # Use home directory to process data
    chdir or die "$!";
    if ( ! -d $processdir ) {
       mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
       }
    }  

btw - see one example of the separators... :

my @cols = qw(
    rownum
    number
    name
    phone
    type
    website
);

my @fields = qw(
    rownum
    number
    name
    street
    postal
    town
    phone
    fax
    type
    website
);

note: this steems from a user that gave me some great hints on this thread. See
HTML::TableExtract: how to run the right argument [see live example]

Well - i want to migrate the ideas into the above mentioned spider&parser. Is this possible..Can you give me some hints how to add some lines of code - so that the results are spit out in csv-formate...!?

any and all help will be greatly appreciated..

zero

Update: according the ideas of muu is too short...:

  #!/usr/bin/perl
    use strict;
    use warnings;
    use HTML::TableExtract;
    use LWP::Simple;
    use Text::CSV
    use Cwd;
    use POSIX qw(strftime);
    my $te = HTML::TableExtract->new;
    my $total_records = 0;
    my $suchbegriffe = "e";
    my $treffer = 50;
    my $range = 0;
    my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
    my $processdir = "processing";
    my $counter = 50;
    my $displaydate = "";
    my $percent = 0;

    &workDir();
    chdir $processdir;
    &processURL();
    print "\nPress <enter> to continue\n";
    <>;
    $displaydate = strftime('%Y%m%d%H%M%S', localtime);
    open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
    &processData();
    close OUTFILE;
    print "Finished processing $total_records records...\n";
    print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
    unlink 'processing.html';
    die "\n";

    sub processURL() {
    print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
    getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';

       while( <tempfile.html> ) {
          open( FH, "$_" ) or die;
          while( <FH> ) {
             if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
                $total_records = $6;
                print "Total records to process is $total_records\n";
                }
             }
             close FH;
       }
       unlink 'tempfile.html';
    }

    sub processData() {
       while ( $range <= $total_records) {
          getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
          $te->parse_file('processing.html');
          my ($table) = $te->tables;
          for my $row ( $table->rows ) {
             cleanup(@$row);

            $csv->combine(@$row);
      print OUTFILE $csv->string();

          }
          $| = 1; 
          print "Processed records $range to $counter";
          print "\r";
          $counter = $counter + 50;
          $range = $range + 50;
          $te = HTML::TableExtract->new;
       }
    }


sub cleanup {
   for ( @_ ) {
      s/\s+/ /g;
   }
}
    sub workDir() {
    # Use home directory to process data
    chdir or die "$!";
    if ( ! -d $processdir ) {
       mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
       }
    }  

Update: I did some trials:

martin@suse-linux:~/perl> perl
perl_bayern_newstack.pl Unmatched ( in
regex; marked by <-- HERE in
m/^.*?(Treffer )(d+)( - )(d+)( <--
HERE

what is wrong here ...?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

七婞 2024-10-25 07:31:02

您应该使用 Text::CSV 来构建输出,这将为您省去麻烦尝试手动处理嵌入的引号和分隔符。因此,使用您所需的选项创建一个 Text::CSV 实例,并将其替换为:

print OUTFILE "@$row\n";

其中

$csv->combine(@$row);
print OUTFILE $csv->string();

$csv 是您的 Text::CSV 实例。

另外,您的 cleanup 函数有点损坏,它应该如下所示:

sub cleanup {
   for ( @_ ) {
      s/\s+/ /g;
   }
}

缺少的 \ 可能只是您粘贴代码时的拼写错误,但也可能不是。

并且不要使用原型,它们不会做你可能认为的事情。你说的地方:

sub cleanup() {

只是说

sub cleanup {

不要调用带有前导符号的函数(即不要 &workDir();,只需 workDir(); 即可)除非你知道它的副作用是什么。

You should use Text::CSV to build your output, that will save you the hassle of trying to handle embedded quotes and separators by hand. So, create an instance of Text::CSV with your desired options and replace this:

print OUTFILE "@$row\n";

with

$csv->combine(@$row);
print OUTFILE $csv->string();

Where $csv is your Text::CSV instance.

Also, your cleanup function is a bit broken, it should look like this:

sub cleanup {
   for ( @_ ) {
      s/\s+/ /g;
   }
}

The missing \ might be just a typo when you pasted your code in but maybe not.

And don't use prototypes, they don't do what you probably think they do. Where you say:

sub cleanup() {

just say

sub cleanup {

And don't call functions with the leading sigil (i.e. don't &workDir();, just workDir(); will do) unless you know what its side effects are.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文