如何使用 Text::CSV 使 LWP 和 HTML::TableExtract 吐出 CSV
我目前正在开发一个小型解析器。
我的第一个脚本取得了非常好的结果!这能跑得很好! 它从页面获取数据: http://192.68.214.70 /km/asps/schulsuche.asp?q=n&a=20 (注意 6142 条记录) - 但请注意 - 数据没有分离,因此后续处理数据有点困难。因此我有第二个脚本 - 见下文!
注意 - 朋友帮助我完成了这两个脚本。我需要介绍自己,作为一名真正的新手,需要二合一迁移帮助。所以,你看,我的 Perl 知识还没有那么丰富,我无法自己迁移到 Perl 知识!任何和所有的帮助都会很棒!
第一个脚本:蜘蛛和解析器:它吐出这样的数据:
lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite
1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183 Abenberg 09178/509210 Realschulen mrs-marienburg.homepage.t-online.de
2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183 Abenberg 09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
3 6913 Mittelschule Abenberg Güssübelstr. 2 91183 Abenberg 09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
4 0402 Johann-Turmair-Realschule Staatliche Realschule Abensberg Stadionstraße 46 93326 Abensberg 09443/9143-0,12,13 09443/914330 Realschulen www.rs-abensberg.de
但我需要分隔数据:用逗号或类似的东西!
我还有第二个脚本。这部分可以做CSV格式。我想将它与蜘蛛逻辑结合起来。但首先让我们看一下第一个脚本:具有出色的蜘蛛逻辑。
查看适当的代码:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(@$row);
print OUTFILE "@$row\n";
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
}
sub cleanup() {
for ( @_ ) {
s/s+/ /g;
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
但是由于上面的脚本不幸地不关心分隔符,我不得不关心一个方法,该方法确实会查找分隔符。为了得到分离的数据(输出)。
因此,通过分离,我可以处理数据 - 并将其存储在 mysql-table 中..或者做其他事情...所以这里[下面]是这些位 - 计算出 csv-formate 注意 - 我想要将下面的代码放入上面的代码中 - 将上述代码的蜘蛛逻辑与以 CSV 格式输出数据的逻辑结合起来。 在代码中的哪里设置问题:我们可以确定这一点以将一个点迁移到另一个点...!? 那真是太棒了...我希望我能明确我的想法...!?我们是否能够利用这两个部分(/scripts)的优势将它们迁移到一个部分?
所以问题是:在脚本中将 CSV 脚本设置到哪里(上图)
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20';
$html =~ tr/\r//d; # strip carriage returns
$html =~ s/ / /g; # expand spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my @cols = qw(
rownum
number
name
phone
type
website
);
my @fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
# trim leading/trailing whitespace from base fields
s/^\s+//, s/\s+$// for @$row;
# load the fields into the hash using a "hash slice"
my %h;
@h{@cols} = @$row;
# derive some fields from base fields, again using a hash slice
@h{qw/name street postal town/} = split /\n+/, $h{name};
@h{qw/phone fax/} = split /\n+/, $h{phone};
# trim leading/trailing whitespace from derived fields
s/^\s+//, s/\s+$// for @h{qw/name street postal town/};
$csv->combine(@h{@fields});
print $csv->string, "\n";
}
}
问题是我使用第一个脚本取得了非常好的结果!它从页面获取数据: http://192.68.214.70 /km/asps/schulsuche.asp?q=n&a=20 (注意 6142 条记录) - 但注意 - 数据没有分开......!
我还有第二个脚本。这部分可以做CSV格式。我想将它与蜘蛛逻辑结合起来。
要插入的部分在哪里?我期待任何和所有的帮助。
如果我必须更精确 - 请告诉我......
I am currently working on a little parser.
i have had very good results with the first script! This was able to run great!
It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated, so the subequent work with the data is a bit difficult. Therefore i have a second script - see below!
Note - friends helped me with the both scripts. I need to introduce myself as a true novice who needs help in migration two in one. So, you see, my Perl-knowlgedge is not so elaborated that i am able to do the migration into one on my own! Any and all help would be great!
The first script: a spider and parser: it spits out the data like this:
lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite
1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183 Abenberg 09178/509210 Realschulen mrs-marienburg.homepage.t-online.de
2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183 Abenberg 09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
3 6913 Mittelschule Abenberg Güssübelstr. 2 91183 Abenberg 09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
4 0402 Johann-Turmair-Realschule Staatliche Realschule Abensberg Stadionstraße 46 93326 Abensberg 09443/9143-0,12,13 09443/914330 Realschulen www.rs-abensberg.de
But i need to separate the data: with commas or someting like that!
And i have a second script. This part can do the CSV-formate. i want to ombine it with the spider-logic. But first lets have a look at the first script: with the great spider-logic.
see the code that is appropiate:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(@$row);
print OUTFILE "@$row\n";
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
}
sub cleanup() {
for ( @_ ) {
s/s+/ /g;
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
But as this-above script-unfortunatley does not take care for the separators i have had to take care for a method, that does look for separators. In order to get the data (output) separated.
So with the separation i am able to work with the data - and store it in a mysql-table.. or do something else...So here [below] are the bits - that work out the csv-formate Note - i want to put the code below into the code above - to combine the spider-logic of the above mentioned code with the logic of outputting the data in CSV-formate.
where to set in the code Question: can we identify this point to migrate the one into the other... !?
That would be amazing... I hope i could make clear what i have in mind...!? Are we able to use the benefits of the both parts (/scripts ) migrating them into one?
So the question is: where to set in with the CSV-Script into the script (above)
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20';
$html =~ tr/\r//d; # strip carriage returns
$html =~ s/ / /g; # expand spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my @cols = qw(
rownum
number
name
phone
type
website
);
my @fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
# trim leading/trailing whitespace from base fields
s/^\s+//, s/\s+$// for @$row;
# load the fields into the hash using a "hash slice"
my %h;
@h{@cols} = @$row;
# derive some fields from base fields, again using a hash slice
@h{qw/name street postal town/} = split /\n+/, $h{name};
@h{qw/phone fax/} = split /\n+/, $h{phone};
# trim leading/trailing whitespace from derived fields
s/^\s+//, s/\s+$// for @h{qw/name street postal town/};
$csv->combine(@h{@fields});
print $csv->string, "\n";
}
}
The thing is that i have had very good results with the first script! It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated...!
And i have a second script. This part can do the CSV-formate. i want to combine it with the spider-logic.
where is the part to insert? I look forward to any and all help.
if i have to be more precice - just let me know...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
既然您已经输入了完整的脚本,我假设您想要对整个事情进行批评。
既然您只在一个块中使用
$te
,为什么要在这个外部作用域中声明和初始化它呢?同样的问题适用于大多数变量——尝试在尽可能的最内层范围内声明它们。一般来说,英语变量名称将使您能够与比德语名称更多的人进行协作。我懂德语,所以我理解你的代码的意图,但大多数代码都不懂。
不要使用
&
来调用 subs。只需使用workDir;
调用它们即可。自 1994 年以来就没有必要使用&
,并且它可能会导致令人讨厌的陷阱,因为&callMySub;
是一种特殊情况,它不执行以下操作您可能会认为,虽然callMySub;
做了正确的事情。现在一般首选词法文件句柄:
open my $outfile, ">file";
另外,您应该检查 open 中的错误或use autodie;
来使 open die失败时。如果您想用逗号分隔数据,则需要更改此行。看看 join 函数,它可以做你想做的事情。
在循环末尾而不是开头初始化
$te
是非常奇怪的。在循环顶部声明和初始化$te
更为惯用。您是指
s/\s+/ /g;
吗?我还没有评论你的第二个剧本;也许你应该把它作为一个单独的问题来问。
Since you have entered a complete script, I'll assume you want critique of the whole thing.
Since you only use
$te
in one block, why are you declaring and initializing it in this outer scope? The same question applies to most of your variables -- try to declare them in the innermost scope possible.In general, english variable names will enable you to collaborate with far more people than german names. I understand german, so I understand the intent of your code, but most of SO doesn't.
Don't use
&
to call subs. Just call them withworkDir;
. It hasn't been necessary to use&
since 1994, and it can lead to a nasty gotcha because&callMySub;
is a special case which doesn't do what you might think, whilecallMySub;
does the Right Thing.Generally lexical filehandles are preferred these days:
open my $outfile, ">file";
Also, you should check for errors from open oruse autodie;
to make open die on failure.This is the line to change if you want to put commas in separating your data. Look at the join function, it can do what you want.
It's very strange to initialize
$te
at the end of the loop instead of the beginning. It's much more idiomatic to declare and initialize$te
at the top of the loop.Did you mean
s/\s+/ /g;
?I haven't commented on your second script; perhaps you should ask it as a separate question.