LWP::Simple 运行得很好:如何存储 6000 ++记录在文件中并进行一些清理?

发布于 2024-10-19 15:08:09 字数 6071 浏览 2 评论 0原文

亲爱的社区晚上好!

我想处理多个网页,有点像网络蜘蛛/爬虫。我有一些位 - 但现在我需要一些改进的蜘蛛逻辑。请参阅目标网址 http://192.68.214.70/km/ asps/schulsuche.asp?q=e&a=50

更新:

感谢两条精彩的评论,我收获了很多。现在代码运行得非常好。 最后一个问题:如何将数据存储到文件中...如何强制解析器将结果写入文件。这比在命令行获取6000多条记录方便多了…… 如果输出在文件中完成,我需要进行一些最终清理:查看输出: 如果我们将所有输出与目标 url 进行比较 - 那么这肯定需要一些清理,你觉得怎么样?!再次查看目标网址 http://192.68.214.70/km /asps/schulsuche.asp?q=e&a=50

6114,7754,"Volksschule Zeil a.Mai",/Sa,"d a.Mai",(Gru,"09524/94992 09524/94997",,Volksschulen,
6115,7757,"Mittelschule Zeil - Sa","d a.Mai",Schulri,"g 
97475       Zeil","09524/94995
09524/94997",,Volksschulen,"      www.hauptschule-zeil-sand.de"
6116,3890,"Volksschule Zeilar",(Gru,"dschule)
Bgm.-Stallbauer-Str. 8
84367       Zeilar",,"08572/439
08572/920001",,Volksschulen,"      www.gs-zeilarn.de"
6117,4664,"Volksschule Zeitlar",(Gru,"dschule)
Schulstra�e 5
93197       Zeitlar",,"0941/63528
0941/68945",,Volksschulen,"      www.vs-zeitlarn.de"
6118,4818,"Mittelschule Zeitlar","Schulstra�e 5
93197       Zeitlar",,,"0941/63528
0941/68945",,Volksschulen,"      www.vs-zeitlarn.de"
6119,7684,"Volksschule Zeitlofs (Gru","dschule)
Raiffeise","Str. 36
97799       Zeitlofs",,"09746/347
09746/347",,Volksschulen,"      grundschule-zeitlofs.de"

感谢您提供所有信息! 零!

这里是老问题:作为 1-shot 函数的一部分,似乎工作得很好。但是一旦我将该函数作为循环的一部分包含在内,它就不会返回任何内容......这是怎么回事?

首先:请参阅目标 http://192.68。 214.70/km/asps/schulsuche.asp?q=e&a=50 该页面已有超过 6000 条结果!那么我如何获得所有结果?我使用模块 LWP::simple 并且我需要一些改进的参数,我可以使用它们来获取所有 6150 条记录...我有一个来自非常支持的成员 tadmic 的代码(请参阅此论坛) - 并且基本上运行得很好。但是在添加一些行之后 - (目前)它会吐出一些错误。

尝试:以下是前 5 个页面 URL:

http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200

我们可以看到,URL 中的“s”属性从第 1 页的 0 开始,然后为之后的每个页面增加 50。我们可以使用此信息创建一个循环:

#!/usr/bin/perl  
use warnings;  
use strict;  
use LWP::Simple;  
use HTML::TableExtract;  
use Text::CSV;  

my @cols = qw(  
    rownum  
    number  
    name  
    phone  
    type  
    website  
);  

my @fields = qw(  
    rownum  
    number  
    name  
    street  
    postal  
    town  
    phone  
    fax  
    type  
    website  
);  

my $i_first = "0";   
my $i_last = "6100";   
my $i_interval = "50";   

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {   
    my $html = get("http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i");   
    $html =~ tr/r//d;     # strip the carriage returns  
    $html =~ s/&nbsp;/ /g; # expand the spaces  

    my $te = new HTML::TableExtract();  
    $te->parse($html);  

    my $csv = Text::CSV->new({ binary => 1 });  

    foreach my $ts ($te->table_states) {  
        foreach my $row ($ts->rows) {  
            #trim leading/trailing whitespace from base fields  
            s/^s+//, s/\s+$// for @$row;  

            #load the fields into the hash using a "hash slice"  
            my %h;  
            @h{@cols} = @$row;  

            #derive some fields from base fields, again using a hash slice  
            @h{qw/name street postal town/} = split /n+/, $h{name};  
            @h{qw/phone fax/} = split /n+/, $h{phone};  

            #trim leading/trailing whitespace from derived fields  
            s/^s+//, s/\s+$// for @h{qw/name street postal town/};  

            $csv->combine(@h{@fields});  
            print $csv->string, "\n";  
        }  
    } 
}

我测试了代码并得到以下结果:

顺便说一句:这里是第 57 行和第 58 行: ...命令行告诉我这里有错误...:

    #trim leading/trailing whitespace from derived fields  
        s/^s+//, s/\s+$// for @h{qw/name street postal town/};  

你怎么认为?是不是少了一些反斜杠!? 如何修复并测试运行代码以使结果正确!?

期待您的来信 零

看看我得到的错误:

    Ot",,,Telefo,Fax,Schulat,Webseite                                                          Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        "lfd. N.",Schul-numme,Schul,"ame                                                                           
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame

good evening dear community!

i want to process multiple webpages, kind of like a web spider/crawler might. I have some bits - but now i need to have some improved spider-logic. See the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

Update:

thanks to two great comments i have gained alot. Now the code runs very nice.
Last quesstion: How to store the data into a file... How to force the parser to write the results into a file. This is much more convenient than getting more than 6000 records in the command line...
And if the outputs is done in a file i need to do some final cleanup: see the output:
If we compare all the output with the target url - then sure this needs some cleanup, what do you think?! Again see the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

6114,7754,"Volksschule Zeil a.Mai",/Sa,"d a.Mai",(Gru,"09524/94992 09524/94997",,Volksschulen,
6115,7757,"Mittelschule Zeil - Sa","d a.Mai",Schulri,"g 
97475       Zeil","09524/94995
09524/94997",,Volksschulen,"      www.hauptschule-zeil-sand.de"
6116,3890,"Volksschule Zeilar",(Gru,"dschule)
Bgm.-Stallbauer-Str. 8
84367       Zeilar",,"08572/439
08572/920001",,Volksschulen,"      www.gs-zeilarn.de"
6117,4664,"Volksschule Zeitlar",(Gru,"dschule)
Schulstra�e 5
93197       Zeitlar",,"0941/63528
0941/68945",,Volksschulen,"      www.vs-zeitlarn.de"
6118,4818,"Mittelschule Zeitlar","Schulstra�e 5
93197       Zeitlar",,,"0941/63528
0941/68945",,Volksschulen,"      www.vs-zeitlarn.de"
6119,7684,"Volksschule Zeitlofs (Gru","dschule)
Raiffeise","Str. 36
97799       Zeitlofs",,"09746/347
09746/347",,Volksschulen,"      grundschule-zeitlofs.de"

thx for any and all infos!
zero!

Here the old question: Seems to work fine as a part of a 1-shot function. But as soon as I include the function as part of a loop, it doesn't return anything...Whats the deal?

To begin with the beginning: see the target http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50
This page has got more than 6000 results! Well how do i get all the results? I use the module LWP::simple and i need to have some improved arguments that i can use in order to get all the 6150 records... i have a code that steems from the very supportive member tadmic (see this forum) - and that basically runs very nice. But after adding some lines - (at the moment) it spits out some errors.

Attempt: Here are the first 5 page URLs:

http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200

We can see that the "s" attribute in the URL starts at 0 for page 1, then increases by 50 for each page there after. We can use this information to create a loop:

#!/usr/bin/perl  
use warnings;  
use strict;  
use LWP::Simple;  
use HTML::TableExtract;  
use Text::CSV;  

my @cols = qw(  
    rownum  
    number  
    name  
    phone  
    type  
    website  
);  

my @fields = qw(  
    rownum  
    number  
    name  
    street  
    postal  
    town  
    phone  
    fax  
    type  
    website  
);  

my $i_first = "0";   
my $i_last = "6100";   
my $i_interval = "50";   

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {   
    my $html = get("http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i");   
    $html =~ tr/r//d;     # strip the carriage returns  
    $html =~ s/ / /g; # expand the spaces  

    my $te = new HTML::TableExtract();  
    $te->parse($html);  

    my $csv = Text::CSV->new({ binary => 1 });  

    foreach my $ts ($te->table_states) {  
        foreach my $row ($ts->rows) {  
            #trim leading/trailing whitespace from base fields  
            s/^s+//, s/\s+$// for @$row;  

            #load the fields into the hash using a "hash slice"  
            my %h;  
            @h{@cols} = @$row;  

            #derive some fields from base fields, again using a hash slice  
            @h{qw/name street postal town/} = split /n+/, $h{name};  
            @h{qw/phone fax/} = split /n+/, $h{phone};  

            #trim leading/trailing whitespace from derived fields  
            s/^s+//, s/\s+$// for @h{qw/name street postal town/};  

            $csv->combine(@h{@fields});  
            print $csv->string, "\n";  
        }  
    } 
}

i tested the code and get the following results:

btw: here the lines 57 and 58: ...the command line tells that ihave errors here..:

    #trim leading/trailing whitespace from derived fields  
        s/^s+//, s/\s+$// for @h{qw/name street postal town/};  

what do you think? Are there some backslashes missing!?
How to fix and testrun the code so that the results are correct!?

Look forward to hear from you
zero

see the errors that i get:

    Ot",,,Telefo,Fax,Schulat,Webseite                                                          Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        "lfd. N.",Schul-numme,Schul,"ame                                                                           
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

柠檬心 2024-10-26 15:08:09

正如您所说,此行不会删除回车符:

$html =~ tr/r//d;     # strip the carriage returns  

您将需要:

$html =~ tr/\r//d;     # strip the carriage returns

甚至可能:

$html =~ tr/\r\n//d;     # strip the carriage returns  

This line won't remove carriage returns as you say:

$html =~ tr/r//d;     # strip the carriage returns  

You will need:

$html =~ tr/\r//d;     # strip the carriage returns

And maybe even:

$html =~ tr/\r\n//d;     # strip the carriage returns  
明媚如初 2024-10-26 15:08:09

每当 $_undef 并且涉及它的替换发生时,就会出现这些警告。 s/// 构造隐式适用于 $_。解决方案是在尝试替换之前检查是否已定义。

除此之外,虽然与警告无关,但您的正则表达式中存在逻辑错误:

s/^s+//, s/\s+$// for @h{qw/name street postal town/};

请注意第一个构造中缺少 \

消除错误并简化:

defined and s{^ \s+ | \s+ $}{}gx for @h{qw/name street postal town/};

为了输出到文件,请在 for 循环之前添加以下内容:

open my $fh, '>', '/path/to/output/file' or die $!;

Replace:

print $csv->string, "\n";

with:

print $fh $csv->string, "\n";

这是从 print LIST的语法更改>打印文件句柄列表

参考:

Whenever $_ is undef and a substitution involving it occurs these warnings occur. s/// construct implicitly works on $_. The solution is to check if defined before trying a substitution.

Apart from that, though not related to the warnings, you have a logical error in your regular expression:

s/^s+//, s/\s+$// for @h{qw/name street postal town/};

Note the missing \ in the first construct.

Removing the error and simplifying:

defined and s{^ \s+ | \s+ $}{}gx for @h{qw/name street postal town/};

In order to output to a file add the following before the for loop:

open my $fh, '>', '/path/to/output/file' or die $!;

Replace:

print $csv->string, "\n";

with:

print $fh $csv->string, "\n";

That is a syntactic change from print LIST to print FILEHANDLE LIST.

Refer:

终遇你 2024-10-26 15:08:09

如果您尝试从页面中提取链接,请使用 WWW::Mechanize,它是 LWP 的包装器,可以正确解析 HTML 来为您获取链接,以及为抓取网页的人们提供无数其他便利的东西。

If you're trying to extract links from the pages, use WWW::Mechanize, which is a wrapper around LWP and properly parses the HTML to get the links for you, as well as a zillion other convenience things for people scraping web pages.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文