忽略 HTML::TreeBuilder 输出 Perl 中的文本
我需要忽略或删除所有 HTML 元素之间的所有文本,以便可以从给定网页生成空白模板。
我正在使用 perl 模块 HTML::TreeBuilder 和 HTML::Element 进行解析。
我已经尝试过文档中提到的ignore_text方法,但没有提供正确的结果。
我也尝试过使用 DOMXpath 和 PHP 来做同样的事情,结果似乎管理起来太麻烦了。正则表达式可能有效,但对我来说是最后的手段。
这是我当前代码的一部分,非常基本。底部只是输出到文件。所有代码都是功能性的,我只需要格式化即可工作,这样我就可以生成模板文件。
<代码> my $url= "http://www.example.com";
my $page = get($url) or die $!;
my $tree = HTML::TreeBuilder->new_from_content($page);
$tree->parse_file($page);
$tree->ignore_text;
$tree->elementify;
open OUTPUT, "+>".$body;
my $output = $tree->as_HTML;
print OUTPUT $output;
close OUTPUT;
预先感谢您的帮助!
编辑:我发现了问题 - 忽略文本仅在您从物理文件解析时才有效。我必须将页面保存为临时文件进行解析,然后以我想要的方式输出,没有文本,然后我只需在底部取消链接($tmp)即可删除该文件。此后,我的脚本在读取和写入数据库方面变得更加复杂,每次我需要创建这个临时文件,这有点烦人......
感谢下面的回复!
I need to ignore or remove all text in between all HTML elements so I can generate a blank template from a given web page.
I am parsing using the perl module HTML::TreeBuilder and HTML::Element.
I have tried the ignore_text method noted in the documentation but that doesn't provide correct results.
I have also tried using DOMXpath with PHP to do the same thing and results seemed too cumbersome to manage. Regex's might work but are a last resort to me.
This is part of my current code, very basic. Bottom is just output to file. All code is functional I just need formatting to work so I can generate template files.
my $url= "http://www.example.com";
my $page = get($url) or die $!;
my $tree = HTML::TreeBuilder->new_from_content($page);
$tree->parse_file($page);
$tree->ignore_text;
$tree->elementify;
open OUTPUT, "+>".$body;
my $output = $tree->as_HTML;
print OUTPUT $output;
close OUTPUT;
Thanks in advance for the help!
EDIT: I found the problem - the ignore text only works when you parse from a physical file. I had to save the page as a temp file to parse then output the way I wanted with no text then I just did unlink($tmp) at the bottom to delete the file. My script has since grown much more complicated with reading and writing to database and each time I need to create this temp file which is kind of annoying...
Thanks for the reply below!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你们非常接近。
看来您需要将
ignore_text
设置为 true 值。$tree->ignore_text(1)
,然后在调用parse_file
之前确保其已设置。抱歉,这有点长,但我希望它有所帮助。
这是新代码的快速通过,没有示例页面很难测试:
这是我使用本地文件的快速测试脚本:
输入
test.html
:输出:
祝你好运
You are very close.
It looks like you need to set
ignore_text
with a true value.$tree->ignore_text(1)
and then make sure its set before callingparse_file
.Sorry this is a bit long but i hope it helps.
Here is quick pass at the new code, hard to test without example page:
Here is my quick test script using a local file:
Input
test.html
:And output:
Good luck
也许您应该使用 HTML::Parser 来完成此任务。代码可能会多一点,但不应该太复杂。
Maybe you should use HTML::Parser for this task. It is maybe a little bit more code, but should not to complicated.