Perl 中模式所需的单行正则表达式

发布于 2024-10-26 18:53:58 字数 724 浏览 4 评论 0原文

我需要使用 perl 读取许多包含类似结构的 HTML 文件。

该结构包括 STRRRR...E

  • S=表格开始之前的html标题
  • T=html文件中唯一的表格开始结构(我可以识别它)
  • R = html 元素组(那些是 tr 的,我也可以识别它)
  • E=所有剩余 - 表示结束 R

我想使用单行“m”perlop 提取数组中的所有 R .

我正在寻找这样的东西:

@all_Rs = $htmlfile=~m{ST(R)*E}gs;

但它从未成功。

到目前为止,我一直在尝试使用删除不需要的文本、for 循环等方法来做到这一点。 我想从此页面提取所有行: http:// www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx 并且这样的页面有很多。

I need to read many HTML files containing similar structure using perl.

The structure consists of
STRRRR...E

  • S=html header just before table begins
  • T=unique table start structure in the html file(I can identify it)
  • R=Group of html elements(those are tr's, I can identify it too)
  • E=All remaining - singnifies end R's

I want to extract all R's in array using single line "m" perlop.

I'm looking for something like this:

@all_Rs = $htmlfile=~m{ST(R)*E}gs;

But it has never worked out.

Until now I've been doing round about way to do it like using deleting unwanted text, for loop etc.
I want to extract all rows from this page: http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx
and there are many such pages.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

小嗲 2024-11-02 18:53:58

正则表达式是错误的工具。使用 HTML 解析器。

use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_content(<<'END_OF_HTML');
<html>
    <table>
        <tr>1
        <tr>2
        <tr>3
        <tr>4
        <tr>5
    </table>
</html>
END_OF_HTML

print $_->as_text for $tree->findnodes('//tr');

HTML::TreeBuilder::XPath 继承自 HTML::TreeBuilder

Regex is the wrong tool. Use an HTML parser.

use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_content(<<'END_OF_HTML');
<html>
    <table>
        <tr>1
        <tr>2
        <tr>3
        <tr>4
        <tr>5
    </table>
</html>
END_OF_HTML

print $_->as_text for $tree->findnodes('//tr');

HTML::TreeBuilder::XPath inherits from HTML::TreeBuilder.

世界如花海般美丽 2024-11-02 18:53:58

daxim 关于使用真正的解析器是正确的。我个人的选择是 XML::LibXML

use XML::LibXML
my $parser = XML::LibXML->new();
$parser->recover(1);                 # don't fail on parsing errors
my $doc = do { 
    local $SIG{__WARN__} = sub {};   # silence warning about parsing errors
    $parser->parse_html_file('http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx');
};

print $_->toString() for $doc->findnodes('//tr[td[1][@class="td_background"]]');

这让我从该页面获得每个车站行。

为了进行更多的工作,我们可以有一个很好的数据结构来保存每个单元格中的文本。

use Data::Dumper;
my @data = map {
    my $row = $_;
    [ map {
        $_->findvalue('normalize-space(text())');
    } $row->findnodes('td') ]
} $doc->findnodes('//tr[td[1][@class="td_background"]]');
print Dumper \@data;

daxim is right about using a real parser. My personal choice is XML::LibXML.

use XML::LibXML
my $parser = XML::LibXML->new();
$parser->recover(1);                 # don't fail on parsing errors
my $doc = do { 
    local $SIG{__WARN__} = sub {};   # silence warning about parsing errors
    $parser->parse_html_file('http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx');
};

print $_->toString() for $doc->findnodes('//tr[td[1][@class="td_background"]]');

This gets me each station row from that page.

For a bit more work we can have a nice data structure to hold the text in each cell.

use Data::Dumper;
my @data = map {
    my $row = $_;
    [ map {
        $_->findvalue('normalize-space(text())');
    } $row->findnodes('td') ]
} $doc->findnodes('//tr[td[1][@class="td_background"]]');
print Dumper \@data;
晨敛清荷 2024-11-02 18:53:58

如果您想处理 HTML 表格,请考虑使用知道如何处理 HTML 表格的模块!

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;


my $html = get 'http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx';
$html =~ s/ / /g;

my $te = new HTML::TableExtract( depth => 1, count => 2 );
$te->parse($html);
foreach my $ts ($te->table_states) {
   foreach my $row ($ts->rows) {
      next if $row->[0] =~ /^\s*(Next|Station)/;
      next if $row->[4] =~ /^\s*(ARR\/DEP|RESERVATION)/;
      foreach my $cell (@$row) {
          $cell =~ s/^\s+//;
          $cell =~ s/\s+$//;
          print "$cell\n";
      }
      print "\n";
   }
}

If you want to process an HTML table, consider using a module that knows how to process HTML tables!

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;


my $html = get 'http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx';
$html =~ s/ / /g;

my $te = new HTML::TableExtract( depth => 1, count => 2 );
$te->parse($html);
foreach my $ts ($te->table_states) {
   foreach my $row ($ts->rows) {
      next if $row->[0] =~ /^\s*(Next|Station)/;
      next if $row->[4] =~ /^\s*(ARR\/DEP|RESERVATION)/;
      foreach my $cell (@$row) {
          $cell =~ s/^\s+//;
          $cell =~ s/\s+$//;
          print "$cell\n";
      }
      print "\n";
   }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文