如何使 pQuery 能够处理格式稍有缺陷的 HTML?

发布于 2024-09-26 22:25:51 字数 674 浏览 8 评论 0原文

pQuery 是 jQuery JavaScript 框架到 Perl 的实用移植,可用于屏幕抓取。

pQuery 对格式错误的 HTML 非常敏感。考虑以下示例:

use pQuery;

my $html_malformed = "<html><head><title>foo</title></head><body>bar</body></html>>";
my $page = pQuery($html_malformed);
my $title = $page->find("title");
print "The title is: ", $title->html, "\n";

由于格式错误的 HTML 中存在双“>>”,pQuery 将找不到上例中的标题标记。

为了使基于 pQuery 的应用程序更能容忍格式错误的 HTML,我需要在将 HTML 传递给 pQuery 之前对其进行清理,以对其进行预处理。

从上面给出的代码片段开始,清理 HTML 使其能够被 pQuery 解析的最强大的纯 Perl 方法是什么?

pQuery is a pragmatic port of the jQuery JavaScript framework to Perl which can be used for screen scraping.

pQuery quite sensitive to malformed HTML. Consider the following example:

use pQuery;

my $html_malformed = "<html><head><title>foo</title></head><body>bar</body></html>>";
my $page = pQuery($html_malformed);
my $title = $page->find("title");
print "The title is: ", $title->html, "\n";

pQuery won't find the title tag in the example above due to the double ">>" in the malformed HTML.

To make my pQuery based applications more tolerant to malformed HTML I need to pre-process the HTML by cleaning it up before passing it to pQuery.

Starting with the code fragment given above, what is the most robust pure-perl way to clean-up the HTML to make it parse:able by pQuery?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

牵你手 2024-10-03 22:25:51

我会将其报告为 pQuery 中的错误。这是一个解决方法:

use HTML::TreeBuilder;
use pQuery;

my $html_malformed = "<html><head><title>foo</title></head><body>bar</body></html>>";
my $html_cleaned = HTML::TreeBuilder->new_from_content($html_malformed);
my $page = pQuery($html_cleaned->as_HTML);
$html_cleaned->delete;
my $title = $page->find("title");
print "The title is: ", $title->html, "\n";

这没有多大意义,因为 pQuery 已经使用了 HTML::TreeBuilder 作为其底层解析机制,但它确实有效。

I'd report this as a bug in pQuery. Here's a workaround:

use HTML::TreeBuilder;
use pQuery;

my $html_malformed = "<html><head><title>foo</title></head><body>bar</body></html>>";
my $html_cleaned = HTML::TreeBuilder->new_from_content($html_malformed);
my $page = pQuery($html_cleaned->as_HTML);
$html_cleaned->delete;
my $title = $page->find("title");
print "The title is: ", $title->html, "\n";

This doesn't make a lot of sense, since pQuery already uses HTML::TreeBuilder as its underlying parsing mechanism, but it does work.

把时间冻结 2024-10-03 22:25:51

尝试 HTML::Tidy,它可以修复无效的 HTML。

Try HTML::Tidy, which fixes invalid HTML.

北城挽邺 2024-10-03 22:25:51

这就是你想要的吗?

$html_malformed =~ r|<+(<.*?>)>+|$1|g;

is that what you want?

$html_malformed =~ r|<+(<.*?>)>+|$1|g;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文