LWP::Simple 运行得很好:如何存储 6000 ++记录在文件中并进行一些清理?
亲爱的社区晚上好!
我想处理多个网页,有点像网络蜘蛛/爬虫。我有一些位 - 但现在我需要一些改进的蜘蛛逻辑。请参阅目标网址 http://192.68.214.70/km/ asps/schulsuche.asp?q=e&a=50
更新:
感谢两条精彩的评论,我收获了很多。现在代码运行得非常好。 最后一个问题:如何将数据存储到文件中...如何强制解析器将结果写入文件。这比在命令行获取6000多条记录方便多了…… 如果输出在文件中完成,我需要进行一些最终清理:查看输出: 如果我们将所有输出与目标 url 进行比较 - 那么这肯定需要一些清理,你觉得怎么样?!再次查看目标网址 http://192.68.214.70/km /asps/schulsuche.asp?q=e&a=50
6114,7754,"Volksschule Zeil a.Mai",/Sa,"d a.Mai",(Gru,"09524/94992 09524/94997",,Volksschulen,
6115,7757,"Mittelschule Zeil - Sa","d a.Mai",Schulri,"g
97475 Zeil","09524/94995
09524/94997",,Volksschulen," www.hauptschule-zeil-sand.de"
6116,3890,"Volksschule Zeilar",(Gru,"dschule)
Bgm.-Stallbauer-Str. 8
84367 Zeilar",,"08572/439
08572/920001",,Volksschulen," www.gs-zeilarn.de"
6117,4664,"Volksschule Zeitlar",(Gru,"dschule)
Schulstra�e 5
93197 Zeitlar",,"0941/63528
0941/68945",,Volksschulen," www.vs-zeitlarn.de"
6118,4818,"Mittelschule Zeitlar","Schulstra�e 5
93197 Zeitlar",,,"0941/63528
0941/68945",,Volksschulen," www.vs-zeitlarn.de"
6119,7684,"Volksschule Zeitlofs (Gru","dschule)
Raiffeise","Str. 36
97799 Zeitlofs",,"09746/347
09746/347",,Volksschulen," grundschule-zeitlofs.de"
感谢您提供所有信息! 零!
这里是老问题:作为 1-shot 函数的一部分,似乎工作得很好。但是一旦我将该函数作为循环的一部分包含在内,它就不会返回任何内容......这是怎么回事?
首先:请参阅目标 http://192.68。 214.70/km/asps/schulsuche.asp?q=e&a=50 该页面已有超过 6000 条结果!那么我如何获得所有结果?我使用模块 LWP::simple 并且我需要一些改进的参数,我可以使用它们来获取所有 6150 条记录...我有一个来自非常支持的成员 tadmic 的代码(请参阅此论坛) - 并且基本上运行得很好。但是在添加一些行之后 - (目前)它会吐出一些错误。
尝试:以下是前 5 个页面 URL:
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200
我们可以看到,URL 中的“s”属性从第 1 页的 0 开始,然后为之后的每个页面增加 50。我们可以使用此信息创建一个循环:
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my @cols = qw(
rownum
number
name
phone
type
website
);
my @fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $i_first = "0";
my $i_last = "6100";
my $i_interval = "50";
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {
my $html = get("http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i");
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
#trim leading/trailing whitespace from base fields
s/^s+//, s/\s+$// for @$row;
#load the fields into the hash using a "hash slice"
my %h;
@h{@cols} = @$row;
#derive some fields from base fields, again using a hash slice
@h{qw/name street postal town/} = split /n+/, $h{name};
@h{qw/phone fax/} = split /n+/, $h{phone};
#trim leading/trailing whitespace from derived fields
s/^s+//, s/\s+$// for @h{qw/name street postal town/};
$csv->combine(@h{@fields});
print $csv->string, "\n";
}
}
}
我测试了代码并得到以下结果:
顺便说一句:这里是第 57 行和第 58 行: ...命令行告诉我这里有错误...:
#trim leading/trailing whitespace from derived fields
s/^s+//, s/\s+$// for @h{qw/name street postal town/};
你怎么认为?是不是少了一些反斜杠!? 如何修复并测试运行代码以使结果正确!?
期待您的来信 零
看看我得到的错误:
Ot",,,Telefo,Fax,Schulat,Webseite Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58. Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58. Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58. Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58. "lfd. N.",Schul-numme,Schul,"ame
Sta�e
PLZ
Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
Sta�e
PLZ
Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
good evening dear community!
i want to process multiple webpages, kind of like a web spider/crawler might. I have some bits - but now i need to have some improved spider-logic. See the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50
Update:
thanks to two great comments i have gained alot. Now the code runs very nice.
Last quesstion: How to store the data into a file... How to force the parser to write the results into a file. This is much more convenient than getting more than 6000 records in the command line...
And if the outputs is done in a file i need to do some final cleanup: see the output:
If we compare all the output with the target url - then sure this needs some cleanup, what do you think?! Again see the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50
6114,7754,"Volksschule Zeil a.Mai",/Sa,"d a.Mai",(Gru,"09524/94992 09524/94997",,Volksschulen,
6115,7757,"Mittelschule Zeil - Sa","d a.Mai",Schulri,"g
97475 Zeil","09524/94995
09524/94997",,Volksschulen," www.hauptschule-zeil-sand.de"
6116,3890,"Volksschule Zeilar",(Gru,"dschule)
Bgm.-Stallbauer-Str. 8
84367 Zeilar",,"08572/439
08572/920001",,Volksschulen," www.gs-zeilarn.de"
6117,4664,"Volksschule Zeitlar",(Gru,"dschule)
Schulstra�e 5
93197 Zeitlar",,"0941/63528
0941/68945",,Volksschulen," www.vs-zeitlarn.de"
6118,4818,"Mittelschule Zeitlar","Schulstra�e 5
93197 Zeitlar",,,"0941/63528
0941/68945",,Volksschulen," www.vs-zeitlarn.de"
6119,7684,"Volksschule Zeitlofs (Gru","dschule)
Raiffeise","Str. 36
97799 Zeitlofs",,"09746/347
09746/347",,Volksschulen," grundschule-zeitlofs.de"
thx for any and all infos!
zero!
Here the old question: Seems to work fine as a part of a 1-shot function. But as soon as I include the function as part of a loop, it doesn't return anything...Whats the deal?
To begin with the beginning: see the target http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50
This page has got more than 6000 results! Well how do i get all the results? I use the module LWP::simple and i need to have some improved arguments that i can use in order to get all the 6150 records... i have a code that steems from the very supportive member tadmic (see this forum) - and that basically runs very nice. But after adding some lines - (at the moment) it spits out some errors.
Attempt: Here are the first 5 page URLs:
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200
We can see that the "s" attribute in the URL starts at 0 for page 1, then increases by 50 for each page there after. We can use this information to create a loop:
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my @cols = qw(
rownum
number
name
phone
type
website
);
my @fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $i_first = "0";
my $i_last = "6100";
my $i_interval = "50";
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {
my $html = get("http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i");
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
#trim leading/trailing whitespace from base fields
s/^s+//, s/\s+$// for @$row;
#load the fields into the hash using a "hash slice"
my %h;
@h{@cols} = @$row;
#derive some fields from base fields, again using a hash slice
@h{qw/name street postal town/} = split /n+/, $h{name};
@h{qw/phone fax/} = split /n+/, $h{phone};
#trim leading/trailing whitespace from derived fields
s/^s+//, s/\s+$// for @h{qw/name street postal town/};
$csv->combine(@h{@fields});
print $csv->string, "\n";
}
}
}
i tested the code and get the following results:
btw: here the lines 57 and 58: ...the command line tells that ihave errors here..:
#trim leading/trailing whitespace from derived fields
s/^s+//, s/\s+$// for @h{qw/name street postal town/};
what do you think? Are there some backslashes missing!?
How to fix and testrun the code so that the results are correct!?
Look forward to hear from you
zero
see the errors that i get:
Ot",,,Telefo,Fax,Schulat,Webseite Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58. Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58. Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58. Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58. "lfd. N.",Schul-numme,Schul,"ame
Sta�e
PLZ
Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
Sta�e
PLZ
Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
正如您所说,此行不会删除回车符:
您将需要:
甚至可能:
This line won't remove carriage returns as you say:
You will need:
And maybe even:
每当
$_
为undef
并且涉及它的替换发生时,就会出现这些警告。s///
构造隐式适用于$_
。解决方案是在尝试替换之前检查是否已定义。除此之外,虽然与警告无关,但您的正则表达式中存在逻辑错误:
请注意第一个构造中缺少
\
。消除错误并简化:
为了输出到文件,请在
for
循环之前添加以下内容:Replace:
with:
这是从
print LIST
到的语法更改>打印文件句柄列表
。参考:
已定义
打开
打印
Whenever
$_
isundef
and a substitution involving it occurs these warnings occur.s///
construct implicitly works on$_
. The solution is to check ifdefined
before trying a substitution.Apart from that, though not related to the warnings, you have a logical error in your regular expression:
Note the missing
\
in the first construct.Removing the error and simplifying:
In order to output to a file add the following before the
for
loop:Replace:
with:
That is a syntactic change from
print LIST
toprint FILEHANDLE LIST
.Refer:
defined
open
print
如果您尝试从页面中提取链接,请使用 WWW::Mechanize,它是 LWP 的包装器,可以正确解析 HTML 来为您获取链接,以及为抓取网页的人们提供无数其他便利的东西。
If you're trying to extract links from the pages, use WWW::Mechanize, which is a wrapper around LWP and properly parses the HTML to get the links for you, as well as a zillion other convenience things for people scraping web pages.