忽略 HTML::TreeBuilder 输出 Perl 中的文本

发布于 2024-11-15 06:26:55 字数 870 浏览 1 评论 0原文

我需要忽略或删除所有 HTML 元素之间的所有文本，以便可以从给定网页生成空白模板。

我正在使用 perl 模块 HTML::TreeBuilder 和 HTML::Element 进行解析。

我已经尝试过文档中提到的ignore_text方法，但没有提供正确的结果。

我也尝试过使用 DOMXpath 和 PHP 来做同样的事情，结果似乎管理起来太麻烦了。正则表达式可能有效，但对我来说是最后的手段。

这是我当前代码的一部分，非常基本。底部只是输出到文件。所有代码都是功能性的，我只需要格式化即可工作，这样我就可以生成模板文件。

<代码> my $url= "http://www.example.com";

my $page = get($url) or die $!;
my $tree = HTML::TreeBuilder->new_from_content($page);

$tree->parse_file($page);

$tree->ignore_text;
$tree->elementify;

open OUTPUT, "+>".$body;
my $output = $tree->as_HTML;
print OUTPUT $output;
close OUTPUT;

预先感谢您的帮助！

编辑：我发现了问题 - 忽略文本仅在您从物理文件解析时才有效。我必须将页面保存为临时文件进行解析，然后以我想要的方式输出，没有文本，然后我只需在底部取消链接（$tmp）即可删除该文件。此后，我的脚本在读取和写入数据库方面变得更加复杂，每次我需要创建这个临时文件，这有点烦人......

感谢下面的回复！

原文

I need to ignore or remove all text in between all HTML elements so I can generate a blank template from a given web page.

I am parsing using the perl module HTML::TreeBuilder and HTML::Element.

I have tried the ignore_text method noted in the documentation but that doesn't provide correct results.

I have also tried using DOMXpath with PHP to do the same thing and results seemed too cumbersome to manage. Regex's might work but are a last resort to me.

This is part of my current code, very basic. Bottom is just output to file. All code is functional I just need formatting to work so I can generate template files.

my $url= "http://www.example.com";

my $page = get($url) or die $!;
my $tree = HTML::TreeBuilder->new_from_content($page);

$tree->parse_file($page);

$tree->ignore_text;
$tree->elementify;

open OUTPUT, "+>".$body;
my $output = $tree->as_HTML;
print OUTPUT $output;
close OUTPUT;

Thanks in advance for the help!

EDIT: I found the problem - the ignore text only works when you parse from a physical file. I had to save the page as a temp file to parse then output the way I wanted with no text then I just did unlink($tmp) at the bottom to delete the file. My script has since grown much more complicated with reading and writing to database and each time I need to create this temp file which is kind of annoying...

Thanks for the reply below!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

亂 2024-11-22 06:26:55

你们非常接近。

看来您需要将 ignore_text 设置为 true 值。 $tree->ignore_text(1)，然后在调用 parse_file 之前确保其已设置。

抱歉，这有点长，但我希望它有所帮助。

这是新代码的快速通过，没有示例页面很难测试：

my $tree = HTML::TreeBuilder->new;

$tree->ignore_text(1);
$tree->elementify;
$tree->parse_file( $page );

这是我使用本地文件的快速测试脚本：

use strict;
use warnings;

use HTML::TreeBuilder;

my $page = 'test.html';
my $tree = HTML::TreeBuilder->new();

$tree->ignore_text(1);
$tree->parse_file($page);
$tree->elementify;

print $tree->as_HTML;

输入 test.html：

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>title text</title>
</head>
<body>
  <h1>Heading 1</h1>
  <p>paragraph text</p>
</body>
</html>

输出：

<html xmlns="http://www.w3.org/1999/xhtml"><head><title></title></head><body><h1></h1><p></body></html>

祝你好运

You are very close.

It looks like you need to set ignore_text with a true value. $tree->ignore_text(1) and then make sure its set before calling parse_file.

Sorry this is a bit long but i hope it helps.

Here is quick pass at the new code, hard to test without example page:

my $tree = HTML::TreeBuilder->new;

$tree->ignore_text(1);
$tree->elementify;
$tree->parse_file( $page );

Here is my quick test script using a local file:

use strict;
use warnings;

use HTML::TreeBuilder;

my $page = 'test.html';
my $tree = HTML::TreeBuilder->new();

$tree->ignore_text(1);
$tree->parse_file($page);
$tree->elementify;

print $tree->as_HTML;

Input test.html:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>title text</title>
</head>
<body>
  <h1>Heading 1</h1>
  <p>paragraph text</p>
</body>
</html>

And output:

<html xmlns="http://www.w3.org/1999/xhtml"><head><title></title></head><body><h1></h1><p></body></html>

Good luck

回复收藏 0 原文

红墙和绿瓦 2024-11-22 06:26:55

也许您应该使用 HTML::Parser 来完成此任务。代码可能会多一点，但不应该太复杂。

回复收藏 0 原文

~没有更多了~

关于作者

花落人断肠

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

忽略 HTML::TreeBuilder 输出 Perl 中的文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

忽略 HTML::TreeBuilder 输出 Perl 中的文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。