解析大型 html 文件（本地）- 使用 Perl 或 PHP

发布于 2024-10-05 07:50:09 字数 336 浏览 3 评论 0原文

我有一个大文档 - 我需要解析它并仅吐出这部分：schule.php?schulnr=80287&lschb=

我如何解析这些内容！？

<td>
    <A HREF="schule.php?schulnr=80287&lschb=" target="_blank">
        <center><img border=0 height=16 width=15 src="sh_info.gif"></center>
    </A>
</td>

很高兴收到你的来信

原文

I have a large document - I need to parse it and spit out only this part: schule.php?schulnr=80287&lschb=

how do I parse the stuff!?

<td>
    <A HREF="schule.php?schulnr=80287&lschb=" target="_blank">
        <center><img border=0 height=16 width=15 src="sh_info.gif"></center>
    </A>
</td>

Love to hear from you

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烦人精 2024-10-12 07:50:10

您应该使用 DOM 解析器，例如 PHP Simple HTML DOM Parser

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

You ought to use a DOM parser like PHP Simple HTML DOM Parser

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all links 
foreach($html->find('a') as $element) 
       echo $element->href . '<br>';

回复收藏 0 原文

那一片橙海， 2024-10-12 07:50:10

在 Perl 中，最快、最好的方法，我知道扫描 HTML 是 HTML::PullParser。这是基于强大的 HTML 解析器，而不是像 Perl 正则表达式（无递归）这样的简单 FSA。

这更像是 SAX 过滤器，而不是 DOM。

use 5.010;
use constant NOT_FOUND => -1;
use strict;
use warnings;

use English qw<$OS_ERROR>;
use HTML::PullParser ();

my $pp 
    = HTML::PullParser->new(
      # your file or even a handle
      file        => 'my.html'
      # specifies that you want a tuple of tagname, attribute hash
    , start       => 'tag, attr' 
      # you only want to look at tags with tagname = 'a'
    , report_tags => [ 'a' ],
    ) 
    or die "$OS_ERROR"
    ;

my $anchor_url;
while ( defined( my $t = $pp->get_token )) { 
    next unless ref $t or $t->[0] ne 'a'; # this shouldn't happen, really
    my $href = $t->[1]->{href};
    if ( index( $href, 'schule.php?' ) > NOT_FOUND ) { 
        $anchor_url = $href;
        last;
    }
}

In Perl, the quickest and best way, I know to scan HTML is HTML::PullParser. This is based on a robust HTML parser, not simple FSA like Perl regex (without recursion).

This is more like a SAX filter, than a DOM.

use 5.010;
use constant NOT_FOUND => -1;
use strict;
use warnings;

use English qw<$OS_ERROR>;
use HTML::PullParser ();

my $pp 
    = HTML::PullParser->new(
      # your file or even a handle
      file        => 'my.html'
      # specifies that you want a tuple of tagname, attribute hash
    , start       => 'tag, attr' 
      # you only want to look at tags with tagname = 'a'
    , report_tags => [ 'a' ],
    ) 
    or die "$OS_ERROR"
    ;

my $anchor_url;
while ( defined( my $t = $pp->get_token )) { 
    next unless ref $t or $t->[0] ne 'a'; # this shouldn't happen, really
    my $href = $t->[1]->{href};
    if ( index( $href, 'schule.php?' ) > NOT_FOUND ) { 
        $anchor_url = $href;
        last;
    }
}

回复收藏 0 原文

内心荒芜 2024-10-12 07:50:10

Rfvgyhn 所说的，但在 Perl 风格中，因为这是标签之一： use HTML::TreeBuilder

另外，由于正则表达式几乎从来都不是解析 XML/HTML 的好主意（有时它已经足够好了，但有一些主要注意事项），阅读强制性且臭名昭著的 StackOverflow 帖子：

正则表达式匹配除了 XHTML 自包含标签之外的开放标签

请注意，如果您的任务的全部范围实际上是“解析 HREF 链接”，并且您没有“”标签和链接（例如 HREF="something" 子字符串）保证不会在任何其他上下文中使用（例如在注释中或作为文本，或让“HREF=”成为链接的一部分本身），它可能属于上面的正则表达式使用的“足够好”类别：

my @lines = <>; # Replace with proper method of reading in your file
my @hrefs = map { $_ =~ /href="([^"]+)"/gi; } @lines;

What Rfvgyhn said, but in Perl flavor since that was one of the tags: use HTML::TreeBuilder

Plus, for reasons as to why RegEx is almost never a good idea to parse XML/HTML (sometimes it's Good Enough With Major Caveats), read the obligatory and infamous StackOverflow post:

RegEx match open tags except XHTML self-contained tags

Mind you, if the full extent of your task is literally "parse out HREF links", AND you don't have "<link>" tags AND the links (e.g. HREF="something" substrings) are guaranteed not to be used in any other context (e.g. in comments, or as text, or have "HREF=" be part of the link itself), it just might fall into the "Good Enough" category above for regex usage:

my @lines = <>; # Replace with proper method of reading in your file
my @hrefs = map { $_ =~ /href="([^"]+)"/gi; } @lines;

回复收藏 0 原文

倾城泪 2024-10-12 07:50:10

您也可以这样做（不是 perl，而是更“直观”）：

将文档加载到浏览器中，
如果可能，
安装 Firebug 扩展/附加组件
安装 FirePath 扩展
复制 + 粘贴此 XPath 表达式
进入标有“XPpath：”的文本字段
//a[contains(@href, "schule")]/@href
单击“Eval”按钮。

还有一些工具可以在命令行上执行此操作，例如“xmllint”（对于 unix）

xmllint --html --xpath '//a[contains(@href, "schule")]/@href' myfile.php.or.html

您可以从中进行进一步的处理。

You could also do it this way (it's not perl but more "visual"):

Load the document into your browser,
if possible
Install Firebug extension/add-on
Install FirePath extension
Copy + Paste this XPath expression
into the text field labeled "XPpath:"
//a[contains(@href, "schule")]/@href
Click "Eval" button.

There are also tools to do this on the command line, e.g. "xmllint" (for unix)

xmllint --html --xpath '//a[contains(@href, "schule")]/@href' myfile.php.or.html

You could do further processing from thereon.

回复收藏 0 原文

~没有更多了~

关于作者

ㄟ。诗瑗

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

解析大型 html 文件（本地）- 使用 Perl 或 PHP

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

解析大型 html 文件（本地）- 使用 Perl 或 PHP

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。