Perl HTML::去除白名单
有没有办法为模块提供白名单以保留某些标签?
现在标记如下,
<div><b>test</b></div>
用此代码剥离
my $hs = HTML::Strip->new();
open FILE, "<test.markup";
$raw_html=<FILE>;
my $clean_text = $hs->parse( $raw_html );
$hs->eof;
生成下面的输出
test
但是我想使用 标记获得下面的白名单输出。
<b>test</b>
编辑,一种解决方案
my $hss = HTML::StripScripts::Parser->new(
{
Context => 'Inline',
EscapeFiltered => 0,
BanAllBut => [qw(i b u)],
},
strict_comment => 0,
strict_names => 0,
);
$hss->filter_html("<div><b>test</b></div>");
$cooked = $hss->filtered_document;
$cooked =~ s/<!--filtered-->//g;
print $cooked; // <b>test</b>
Is there a way to give a whitelist to the module that it would preserve certain tags?
Now markup as below
<div><b>test</b></div>
Stripped with this code
my $hs = HTML::Strip->new();
open FILE, "<test.markup";
$raw_html=<FILE>;
my $clean_text = $hs->parse( $raw_html );
$hs->eof;
Produces output below
test
However I would like to get with <b>
tag whitelisted output below.
<b>test</b>
EDIT, ONE SOLUTION
Using HTML::StripScripts::Parser
my $hss = HTML::StripScripts::Parser->new(
{
Context => 'Inline',
EscapeFiltered => 0,
BanAllBut => [qw(i b u)],
},
strict_comment => 0,
strict_names => 0,
);
$hss->filter_html("<div><b>test</b></div>");
$cooked = $hss->filtered_document;
$cooked =~ s/<!--filtered-->//g;
print $cooked; // <b>test</b>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
读取 Perl 包装器和底层 XS 代码,发现没有白名单功能。
可以添加,但不是 100% 微不足道 - 代码已经检查了“strip”标签的标签名称,例如
并且只有 200LOC。
作为另一种方法,O'Reilly 的 RegEx 书籍提供了一个可以去除 HTML 标签(包括白名单功能)的正则表达式配方。
如果您不想弄乱正则表达式,请尝试
HTML:: StripScripts::Parser
- 似乎它使用白名单Reading both the Perl wrapper and the underlying XS code, there's no whitelist capability.
It is possible to add, though not 100% trivial - the code already checks tag names for "strip" tags like
<script>
and is only 200LOC.As another approach, RegEx book from O'Reilly has a regular expression recipe that can strip HTML tags (including whitelist capability).
If you'd rather not mess with RegEx, try
HTML::StripScripts::Parser
- it seems it uses whitelists