HTML::TableExtract:如何运行正确的参数 [参见实例]
关于解析器的问题。是否有机会捕获分隔表格的一些分隔符...paser 脚本运行良好。注意 - 我想将数据存储到 MySQL 数据库中。因此,最好有一些分隔符 - (逗号、制表符或其他形式 - 制表符分隔值或逗号分隔值 是方便使用的格式...
(此处的数据来自以下站点:http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20)
lfd。编号。舒尔纳默·舒尔纳姆 Straße PLZ Ort 电话传真 Schulart 网站 1 0401 Mädchenrealschule 马里恩堡、阿本贝格、迪奥泽塞 艾希施塔特马林堡 1 91183 阿本贝格09178/509210 实科大学 玛丽恩堡夫人.homepage.t-online.de 2 6581 人民学校 阿贝贝格(Grundschule)Güssübelstr。 2 91183阿本贝格09178/215 09178/905060 人民大学 home.t-online.de/home/vs-abenberg 6 3074 私人职业学院 桑德帕德。弗德隆, 阿本斯贝格 Förderschwerpunkt Lernen 雷根斯堡大街 60 93326 阿本斯贝格 09443/709191 09443/709193 Berufsschulen zur sonderpädog。 福德龙 www.berufsschule-abensberg.de
好吧,我需要将这些行分成至少三列 - 取第一条记录。
名称:国民学校 阿贝贝格(Grundschule)街: 古苏贝尔街2 邮政编码和城镇: 91183阿贝格传真和电话: 09178/215 09178/905060 类型 学校:Volksschulen 网站: home.t-online.de/home/vs-abenberg
或者甚至更好 - 我已将邮政编码和城镇分成两个单独的列!? 问题:这可能吗?
顺便:看第一条记录:(这里我只显示学校名称)
1 0401 Mädchenrealschule 马里恩堡,阿本贝格,6 3074 私人 Berufsschule zur sonderpäd。福德隆, 阿本斯贝格 Förderschwerpunkt Lernen
名称中包含一些逗号;这是否会使创建一个创建 csv-fomate 的解析器变得困难?
知道如何在 Perl 中做到这一点...如果可能的话那就太好了! 非常感谢有关这个小问题的提示 - 除此之外,一切都很棒且令人着迷!
零
顺便说一句 - 如果你愿意 - 我可以添加代码。这里没问题。
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(@$row);
print OUTFILE "@$row\n";
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
}
sub cleanup() {
for ( @_ ) {
s/s+/ /g;
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
A question regarding a parser. Is there any chance to catch some separators within the that separate the table... The paser script runs allready nicely. Note - i want to store the data into a MySQL database. So it would be great to have some seperators - (commas, tabs or somewhat else - a tab seperated values or comma seperated values
are handy formats to work with...
( here the data out of the following site: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20 )
lfd. Nr. Schul- nummer Schulname
Straße PLZ Ort Telefon Fax Schulart
Webseite 1 0401 Mädchenrealschule
Marienburg, Abenberg, der Diözese
Eichstätt Marienburg 1 91183
Abenberg 09178/509210 Realschulen
mrs-marienburg.homepage.t-online.de 2
6581 Volksschule
Abenberg (Grundschule) Güssübelstr. 2
91183 Abenberg 09178/215
09178/905060 Volksschulen
home.t-online.de/home/vs-abenberg 6
3074 Private Berufsschule zur
sonderpäd. Förderung,
Förderschwerpunkt Lernen, Abensberg
Regensburger Straße 60 93326
Abensberg 09443/709191 09443/709193
Berufsschulen zur sonderpädog.
Förderung
www.berufsschule-abensberg.de
Well i need to have those lines divided into at least three columns - take the first record.
name: Volksschule
Abenberg (Grundschule) street:
Güssübelstr. 2 postal-code and town:
91183 Abenberg fax and telephone:
09178/215 09178/905060 type of
school: Volksschulen website:
home.t-online.de/home/vs-abenberg
Or even better - i have divided the postal-code and town into two seperate columns!?
Question: is this possible?
By the way: see the first record: (here i only show the names of the school)
1 0401 Mädchenrealschule
Marienburg, Abenberg, 6 3074 Private
Berufsschule zur sonderpäd. Förderung,
Förderschwerpunkt Lernen, Abensberg
Those have some commas inside the name; does this make it difficult to create a parser that creates csv-fomate?
any idea how to do this in Perl... If possible it would be just great!!
many many thx for a hint regarding this little issue - besides this all is great and fascinating!
zero
BTW - if you want - i can add the code. No problem here.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(@$row);
print OUTFILE "@$row\n";
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
}
sub cleanup() {
for ( @_ ) {
s/s+/ /g;
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我建议使用 HTML::Parser 模块,您可以调整该模块以提取表格单元格的值。请参阅文档:
http://search.cpan.org/perldoc?HTML::Parser
I recommend the use of HTML::Parser module, which you can adjust in order to extract the values of the table cells. See documentation:
http://search.cpan.org/perldoc?HTML::Parser