哪些 Perl 模块适合数据处理?

发布于 2024-09-24 20:42:13 字数 1886 浏览 2 评论 0原文

九年前,当我开始使用 Perl 解析 HTML 和自由文本时,我读了经典的 Data Munging with Perl 。有人知道 David 是否计划更新这本书,或者是否有类似的书籍或网页,其中新的解析模块如 XML-TwigRegexp-Grammars 等,有解释吗?

我认为在过去的九年里,有些模块仍然像以前一样好,有些模块是最新的,但具有新的有趣的方法,有些模块有更好的替代品。例如, Parse-RecDescent 仍然是自由文本解析的唯一选项,或者将是Perl 6 在许多情况下影响了 Regexp-Grammars 其替代品?

我已经四年没有使用 Perl 进行主动 HTML、XML 或自由文本数据挖掘了,所以我在这方面的工具包可能有点过时了。因此,任何来自熟悉该领域当前 CPAN 模块的人员对 HTML 和 DOM 操作、链接提取/验证、Web 测试(如 Mechanize)、XML 操作和自由文本解析的反馈都将非常受欢迎。

我的工具包中添加了一些新内容:

仍在我的工具包中:

Nine years ago when I started to parsing HTML and free text with Perl I read the classic Data Munging with Perl. Does someone know if David is planning to update the book or if there are similar books or web pages where the new parsing modules like XML-Twig, Regexp-Grammars, etc, are explained?

I assume that in the last nine years some modules still are as good as they were, some are up to date but with new interesting methods and some have better replacements. For example, is still Parse-RecDescent the only option for free text parsing or will be the Perl 6 influenced Regexp-Grammars its replacement in many scenarios?

I have been four years without active HTML, XML or free text data mining with Perl, so probably my toolkit in this area is a bit outdated. Therefore any feedback for HTML and DOM manipulation, link extraction/verification, web-testing like Mechanize, XML manipulation and free text parsing , from people that is up to date with the current CPAN modules in this area will be more than welcome.

Some new additions to my toolkit:

still in my toolkit:

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梨涡少年 2024-10-01 20:42:13

《Data Munging with Perl》不太可能有第二版。恐怕经济上根本不可行。

但是,你说得对,自 2001 年以来,技术已经取得了长足的进步,并且有很多新的和改进的模块,它们涵盖了与书中讨论的模块相同的领域,例如,我不记得最后一个我使用 XML::Parser 或 XML::DOM 的时候。我现在的大部分 XML 工作似乎都使用 XML::LibXML。当然,我对数据库的讨论也是不完整的,因为它没有提到 DBIx::Class。

通过我的 Perl 博客上的一些帖子来更新一些信息也许会是一个有趣的想法。我会考虑一下。谢谢你的主意。

It's unlikely that there will ever be a second edition of "Data Munging with Perl". I'm afraid that the economics just don't stack up.

But, you're right that technology has moved on a long way since 2001 and there are plenty of new and improved modules that cover much of the same area as the modules discussed in the book, For example, I can't remember the last time I used XML::Parser or XML::DOM. I seem to use XML::LibXML for the majority of my XML work these days. Also, of course, my discussion of databases is incomplete because it doesn't mention DBIx::Class.

Perhaps it would be an interesting idea to update some of this information through some posts on my Perl blog. I'll give it some thought. Thanks for the idea.

木槿暧夏七纪年 2024-10-01 20:42:13

回复: Parse::RecDescent <=> Regexp::Grammars

Damian Conway 被引用说 Regexp::Grammars 解析::RecDescent。但即便如此,如果 Parse::RecDescent 仍然得到工作已为您完成,然后继续使用它。你熟悉的工具比你不知道的工具更好!

但是,如果性能是一个关键问题并且您运行的是 perl 5.10+,那么请考虑 Regexp: :语法

希望 Dave 不介意,但这是他的第一个 Parse::RecDescent 来自 Data Munging with Perl (11.1.1) 的示例转换为 Regexp::Grammars

use 5.010;
use warnings;
use Regexp::Grammars;

my $parser = qr{
    <Sentence>

    <rule: Sentence>        <subject> <verb> <object>
    <rule: subject>         <noun_phrase>
    <rule: object>          <noun_phrase>
    <rule: noun_phrase>     <pronoun> | <proper_noun> | <article> <noun>

    <token: verb>           wrote | likes | ate
    <token: article>        a | the | this
    <token: pronoun>        it | he
    <token: proper_noun>    Perl | Dave | Larry
    <token: noun>           book | cat
}xms;

while (<DATA>) {
    chomp;
    print "'$_' is ";
    print 'NOT ' unless $_ =~ $parser;
    say 'a valid sentence';
}

__DATA__
Larry wrote Perl
Larry wrote a book
Dave likes Perl
Dave likes the book
Dave wrote this book
the cat ate the book
Dave got very angry

注意。对于那些没有这本书的人来说,只有“戴夫非常生气”是一个无效的句子:)

/I3az/

re: Parse::RecDescent <=> Regexp::Grammars

Damian Conway has been quoted saying that Regexp::Grammars is the successor to Parse::RecDescent. But even so if Parse::RecDescent still gets the job done for you then continue to use it. The tool you know well is better than the tool you don't know!

However if performance is a key issue and you are running perl 5.10+ then do consider Regexp::Grammars.

Hope Dave doesn't mind but here is his first Parse::RecDescent example from Data Munging with Perl (11.1.1) converted to Regexp::Grammars:

use 5.010;
use warnings;
use Regexp::Grammars;

my $parser = qr{
    <Sentence>

    <rule: Sentence>        <subject> <verb> <object>
    <rule: subject>         <noun_phrase>
    <rule: object>          <noun_phrase>
    <rule: noun_phrase>     <pronoun> | <proper_noun> | <article> <noun>

    <token: verb>           wrote | likes | ate
    <token: article>        a | the | this
    <token: pronoun>        it | he
    <token: proper_noun>    Perl | Dave | Larry
    <token: noun>           book | cat
}xms;

while (<DATA>) {
    chomp;
    print "'$_' is ";
    print 'NOT ' unless $_ =~ $parser;
    say 'a valid sentence';
}

__DATA__
Larry wrote Perl
Larry wrote a book
Dave likes Perl
Dave likes the book
Dave wrote this book
the cat ate the book
Dave got very angry

NB. For those you don't have the book only "Dave got very angry" is an invalid sentence :)

/I3az/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文