如何使用 Perl 正则表达式从 HTML 文件中提取信息？

发布于 2024-12-10 17:25:28 字数 898 浏览 1 评论 0原文

我有两个文件，XML 和 HTML，需要从这些文件中提取某些模式的数据。

我的 XML 文件格式非常好，我可以使用 readline 读取一行并在标签之间搜索数据。

if($line =~ /\<tag1\>$varvalue\<\/tag1\>/)`

然而，对于我的 HTML，它有我见过的最糟糕的代码之一，文件如下：

<div class="theater">
    <h2>
    <a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
    <div class="address">
        <i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
    </div>
</div>

<div class="mtitle">
    <a href="/movie/dream-house-2011"  title="Dream House" onmouseover="mB(event, 771204354);"  >**Dream House**</a>
    <span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>

<div class="times">

    **1:00 PM,**
</div>

现在，我需要从这个文件中选择以粗体显示的数据。

我可以使用 Perl 正则表达式从该文件中搜索数据。

原文

I have two files, XML and an HTML and need to extract data from these on certain patterns.

My XML file is pretty well formatted and I can use readline to read a line and search data between tags.

if($line =~ /\<tag1\>$varvalue\<\/tag1\>/)`

However, for my HTML, it has one of the worst code I have seen and the file is like:

<div class="theater">
    <h2>
    <a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
    <div class="address">
        <i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
    </div>
</div>

<div class="mtitle">
    <a href="/movie/dream-house-2011"  title="Dream House" onmouseover="mB(event, 771204354);"  >**Dream House**</a>
    <span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>

<div class="times">

    **1:00 PM,**
</div>

Now from this file I need to pick data which is shown in bold.

I can use Perl regular expression to search data from this file.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

朕就是辣么酷 2024-12-17 17:25:28

RegEx 匹配开放标记（XHTML 自包含标记除外）

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

使用正则表达式来解析 HTML：为什么不呢？

当你读完这些内容后回来:)

编辑：并到实际上解决你的问题看看这个模块：

http://perlmeme.org/tutorials/html_parser.html

一些解析 html 文件的示例：

#!/usr/local/bin/perl

use HTML::TreeBuilder;

$tree = HTML::TreeBuilder->new;
$tree->parse_file('C:\Users\Stefanos\workspace\HTML_Parser_Test\test.html');

@divs = $tree->find('div');

$tree->delete;

在这个示例中，我只是使用您的标签作为 .html 文件的主体。 div 存储在 @divs 数组中。因为我不知道你想找到哪个文本，因为 ** 不是一个元素，所以我无法进一步帮助你..

PS 我从未使用过这个模块，但我只是在 5 分钟内完成了它，所以它并不难解析 html 文件并找到您想要的任何内容..

正则表达式匹配任何特定标签并将内容结果存储到 $1 中：

if ($subject =~ m!<tagname[^>]*>(.*?)</tagname>!s) {
    # Successful match
}

尽管当您有嵌套元素时您很快就会意识到这种方法的局限性..

将标记名替换为实际标记.. 例如在你的情况下，我，a，span，div虽然对于div 您还将获得第一个 div 的内容，这不是您想要的。

RegEx match open tags except XHTML self-contained tags

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

Using regular expressions to parse HTML: why not?

When you are done reading those come back :)

Edit : and to actually solve your problem take a look at this module :

http://perlmeme.org/tutorials/html_parser.html

Some sample to parse the an html file :

#!/usr/local/bin/perl

use HTML::TreeBuilder;

$tree = HTML::TreeBuilder->new;
$tree->parse_file('C:\Users\Stefanos\workspace\HTML_Parser_Test\test.html');

@divs = $tree->find('div');

$tree->delete;

In this example I just used your tags as the main body of an .html file. The divs are stored in the @divs array. Since I have no idea which text you want to find, because ** is not a element I can't help you further..

P.S. I have never used this module but I just did it in 5 minutes so it is not so hard to parse the html file and find whatever you want..

Regex to match any specific tag and store of contents result into $1:

if ($subject =~ m!<tagname[^>]*>(.*?)</tagname>!s) {
    # Successful match
}

Although you will soon realize the limitations of this approach when you have nested elements..

Replace tagname with actual tag.. e.g. in your case i, a, span, div although for div you will also get the contents of the first div which is not what you want..

回复收藏 0 原文

够运 2024-12-17 17:25:28

使用正则表达式解析 XML 和 HTML 是一件傻事。有许多简单易用的 Perl 模块可用于解析 HTML。这是使用 HTML::TokeParser::Simple 的内容。我省略了将电影和放映时间与剧院关联起来的代码（因为我无意构建适当的输入文件）：

#!/usr/bin/env perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);

my @theaters;

while (my $div = $parser->get_tag('div')) {
    my $class = $div->get_attr('class');
    next unless defined($class) and $class eq 'theater';

    my %record;

    $record{theater} = $parser->get_text('/a');
    $record{address} = $parser->get_text('/i');

    s{(?:^\s+)|(?:\s+\z)}{} for values %record;

    push @theaters, \%record;
}

use YAML;
print Dump \@theaters;

__DATA__
<div class="theater">
    <h2>
    <a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
    <div class="address">
        <i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
    </div>
</div>

<div class="mtitle">
    <a href="/movie/dream-house-2011"  title="Dream House" onmouseover="mB(event, 771204354);"  >**Dream House**</a>
    <span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>

<div class="times">

    **1:00 PM,**
</div>

<div class="theater">
    <h2>
    <a href="/showtimes/university-village-3" >**Some other theater*</a></h2>
    <div class="address">
        <i>**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**</i>
    </div>
</div>

输出：

[sinan@macardy]:~/tmp> ./tt.pl
---
- address: '**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**'
  theater: '**University Village 3**'
- address: '**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**'
  theater: '**Some other theater*'

Parsing XML and HTML using regular expressions is a fool's errand. There are many simple to use Perl modules for parsing HTML. Here is something using HTML::TokeParser::Simple. I've omitted the code to associate movies and showtimes with theaters (because I have no intention of building an appropriate input file):

#!/usr/bin/env perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);

my @theaters;

while (my $div = $parser->get_tag('div')) {
    my $class = $div->get_attr('class');
    next unless defined($class) and $class eq 'theater';

    my %record;

    $record{theater} = $parser->get_text('/a');
    $record{address} = $parser->get_text('/i');

    s{(?:^\s+)|(?:\s+\z)}{} for values %record;

    push @theaters, \%record;
}

use YAML;
print Dump \@theaters;

__DATA__
<div class="theater">
    <h2>
    <a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
    <div class="address">
        <i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
    </div>
</div>

<div class="mtitle">
    <a href="/movie/dream-house-2011"  title="Dream House" onmouseover="mB(event, 771204354);"  >**Dream House**</a>
    <span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>

<div class="times">

    **1:00 PM,**
</div>

<div class="theater">
    <h2>
    <a href="/showtimes/university-village-3" >**Some other theater*</a></h2>
    <div class="address">
        <i>**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**</i>
    </div>
</div>

Output:

[sinan@macardy]:~/tmp> ./tt.pl
---
- address: '**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**'
  theater: '**University Village 3**'
- address: '**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**'
  theater: '**Some other theater*'

回复收藏 0 原文

~没有更多了~