使用 HTML::TableExtract [Perl] 解析文档以仅获取一些标签和值 [逐行]

发布于 2024-10-08 08:02:23 字数 2452 浏览 0 评论 0原文

晚上好,亲爱的社区,

首先 - 我非常非常高兴我找到了这个好地方。我非常非常喜欢这个论坛,因为它有一个很棒且支持的社区!我从你们这里学到了很多东西!每个问题都有一些优秀的审阅者,并且每个线程都是丰富的价值和学习资产。

好吧,我对 Perl 还很陌生,对这个板也相当陌生:我目前正在开发一个小解析器:我想解析一个表

单击此处查看目标网址 - 带有非常简单的表格(仅某些行)

此页面有一个表格:以及一个带有值的表格和标签。我们需要提供唯一标识相关表的内容。这可以是其标头的内容或 HTML 属性。在本例中,文档中只有一张表,因此我们甚至不需要这样做。但是,如果要向构造函数提供任何内容,我会提供表的类。 我们不需要表的列。该表的第一列由标签组成,第二列由值组成。为了同时获取标签和值,我们应该逐行处理表格。好吧 - 可以这样完成吗:

#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;

my $te = HTML::TableExtract->new(
    attribs => { class => 'bp_ergebnis_tab_info' },
);

$te->parse_file('t.html');
# here the file with the captured site is stored 

foreach my $table ( $te->tables ) { 
    foreach my $row ($table->rows) {
        print "   ", join(',', @$row), "\n";
    }
}

查看结果:

 martin@suse-linux:~/perl> perl  parser_perl_nrw2.pl
 Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17.
  Schuldaten,
  Schule hat Schulbetrieb
  Schulnummer,143960

   Amtliche Bezeichnung,�Franziskusschule Kath. Hauptschule Ahaus - Sekundarstufe I -

   Strasse,Hof zum Ahaus 6

   Plz und Ort,48683 Ahaus

   Telefon,02561 4291990

   Fax,02561 42919920

   E-Mail-Adresse,[email protected]


   Internet,http://www.franziskusschule.de
  ,Schule in �ffentlicher Tr�gerschaft

好吧,我想获取上面显示的数据 - 但如果你看到下面 - 还有更多的文本和代码行,...像这样说话。 (/我想要摆脱以下几行!!!)

Use of uninitialized value $row in join or string at parser_perl_nrw2.pl
line 17.
,Schülergesamtzahl,648
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl
line 17.
,Ganztagsunterricht,Ja (erweiterter Ganztagsbetrieb)
Sonstiges,Teilnahme am Projekt 'Betrieb und Schule (BUS)'
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl
line 17.
Unterrichtsangebote,
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl
line 17.
Schule erteilt Unterricht in Fremdsprache(n)...,
,Englisch

问题:我如何摆脱未经消毒的数据!一切都很好 - 但我想摆脱未经消毒的数据...这非常丑陋 - 而且因为我想将数据存储到数据库中 - 我不需要未经消毒的数据...!

一如既往:我们将不胜感激任何和所有的帮助 - 提前非常感谢!

问候 零

good evening dear community,

first of all - i am very very happy that i have found this great place. I like this forum very very much, since it has a great and supportive community! I learn alot form you folks here! Each question has got some great reviewers and - each thread is a rich value and learning asset.

Well i am farily new to Perl - and fairly new to this board here: i am currently workin out a little parser: i want to parse a table

click here to see the target url- with the very simple table (some rows only)

This page has a table: well a table with vaules and lables. We need to provide something that uniquely identifies the table in question. This can be the content of its headers or the HTML attributes. In this case, there is only one table in the document, so we don't even need to do that. But, what about to provide anything to the constructor, I would provide the class of the table.
We do not want the columns of the table. The first column of this table consists of labels and the second column consists of values. To get the labels and values at the same time, we should process the table row-by-row. Well - can this be done like so:

#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;

my $te = HTML::TableExtract->new(
    attribs => { class => 'bp_ergebnis_tab_info' },
);

$te->parse_file('t.html');
# here the file with the captured site is stored 

foreach my $table ( $te->tables ) { 
    foreach my $row ($table->rows) {
        print "   ", join(',', @$row), "\n";
    }
}

See the results:

 martin@suse-linux:~/perl> perl  parser_perl_nrw2.pl
 Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17.
  Schuldaten,
  Schule hat Schulbetrieb
  Schulnummer,143960

   Amtliche Bezeichnung,�Franziskusschule Kath. Hauptschule Ahaus - Sekundarstufe I -

   Strasse,Hof zum Ahaus 6

   Plz und Ort,48683 Ahaus

   Telefon,02561 4291990

   Fax,02561 42919920

   E-Mail-Adresse,[email protected]


   Internet,http://www.franziskusschule.de
  ,Schule in �ffentlicher Tr�gerschaft

WELL i want to get the data that are shown above - but if you see below - there are some more lines of text and code, ... talking like so. (/i want to´get rid of these following lines!!!)

Use of uninitialized value $row in join or string at parser_perl_nrw2.pl
line 17.
,Schülergesamtzahl,648
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl
line 17.
,Ganztagsunterricht,Ja (erweiterter Ganztagsbetrieb)
Sonstiges,Teilnahme am Projekt 'Betrieb und Schule (BUS)'
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl
line 17.
Unterrichtsangebote,
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl
line 17.
Schule erteilt Unterricht in Fremdsprache(n)...,
,Englisch

Question: how do i get rid of the unsanitized data! All is nice - but i want to get rid of the unsanitized data... that is very very ugly - and since i want to store the data into a database - i do not need the unsanitized data...!

As allways: any and all help will be greatly appreciated - many thanks in advance!

regards
zero

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

清引 2024-10-15 08:02:23

您想摆脱未初始化值警告吗?

某些表格单元格是空的,因此您可能需要测试它们或将其过滤掉。例如:

foreach my $table ( $te->tables ) {
        foreach my $row ($table->rows) {
        my @values = grep {defined} @$row;
        print "   ", join(',', @values), "\n";
       }
    }

您还可以完全禁用该特定块的警告,并且没有警告“未初始化”,但这通常不是一个好的做法。

You want to get rid of the uninitialized value warnings?

Some of the table cells are empty so you may want to test for them or filter them out. Like this for example:

foreach my $table ( $te->tables ) {
        foreach my $row ($table->rows) {
        my @values = grep {defined} @$row;
        print "   ", join(',', @values), "\n";
       }
    }

You could also outright disable warnings for that particular block with no warnings ' uninitialized', but it is generally not a good practice.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文