如何使用 Perl 去除字符串中的 HTML？

发布于 2024-07-26 01:24:04 字数 392 浏览 9 评论 0原文

还有比这更容易使用 Perl 从字符串中剥离 HTML 的方法吗？

$Error_Msg =~ s|<b>||ig;
$Error_Msg =~ s|</b>||ig;
$Error_Msg =~ s|<h1>||ig;
$Error_Msg =~ s|</h1>||ig;
$Error_Msg =~ s|<br>||ig;

我会欣赏一个精简的正则表达式，例如这样的：

$Error_Msg =~ s|</?[b|h1|br]>||ig;

是否有一个现有的 Perl 函数可以从字符串中删除任何/所有 HTML，即使我只需要删除粗体、h1 标题和 br ？

原文

Is there anyway easier than this to strip HTML from a string using Perl?

$Error_Msg =~ s|<b>||ig;
$Error_Msg =~ s|</b>||ig;
$Error_Msg =~ s|<h1>||ig;
$Error_Msg =~ s|</h1>||ig;
$Error_Msg =~ s|<br>||ig;

I would appreicate both a slimmed down regular expression, e.g. something like this:

$Error_Msg =~ s|</?[b|h1|br]>||ig;

Is there an existing Perl function that strips any/all HTML from a string, even though I only need bolds, h1 headers and br stripped?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

城歌 2024-08-02 01:24:04

假设代码是有效的 HTML（没有杂散的 < 或 > 运算符）

$htmlCode =~ s|<.+?>||g;

如果您只需要删除粗体、h1 和 br

$htmlCode =~ s#</?(?:b|h1|br)\b.*?>##g

您可能需要考虑 HTML::Strip 模块

Assuming the code is valid HTML (no stray < or > operators)

$htmlCode =~ s|<.+?>||g;

If you need to remove only bolds, h1's and br's

$htmlCode =~ s#</?(?:b|h1|br)\b.*?>##g

And you might want to consider the HTML::Strip module

回复收藏 0 原文

哎呦我呸! 2024-08-02 01:24:04

您绝对应该看看 HTML::Restrict 它允许您剥离或限制允许的 HTML 标签。一个剥离所有 HTML 标签的最小示例：

use HTML::Restrict;

my $hr = HTML::Restrict->new();
my $processed = $hr->process('<b>i am bold</b>'); # returns 'i am bold'

我建议远离 HTML::Strip 因为它破坏了 utf8 编码。

You should definitely have a look at the HTML::Restrict which allows you to strip away or restrict the HTML tags allowed. A minimal example that strips away all HTML tags:

use HTML::Restrict;

my $hr = HTML::Restrict->new();
my $processed = $hr->process('<b>i am bold</b>'); # returns 'i am bold'

I would recommend to stay away from HTML::Strip because it breaks utf8 encoding.

回复收藏 0 原文

千紇 2024-08-02 01:24:04

来自 perlfaq9：如何从字符串中删除 HTML？

最正确的方法（尽管不是）最快）是使用 CPAN 的 HTML::Parser。另一种最正确的方法是使用 HTML::FormatText，它不仅删除 HTML，而且还尝试对生成的纯文本进行一些简单的格式化。

许多人尝试使用简单的正则表达式方法，例如 s/<.*?>//g，但在许多情况下会失败，因为标签可能会在换行符上继续，它们可能包含引用的尖括号或 HTML评论可能存在。另外，人们忘记转换实体——比如 < 例如。

这是一种适用于大多数文件的“简单”方法：

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

如果您想要更完整的解决方案，请参阅 http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz 。

在选择解决方案时，您应该考虑以下一些棘手的情况：

<IMG SRC = "foo.gif" ALT = "A > B">

<IMG SRC = "foo.gif"
 ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<# Just data #>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

如果 HTML 注释包含其他标签，这些解决方案也会在文本上中断，如下所示：

<!-- This section commented out.
    <B>You can't see me!</B>
-->

From perlfaq9: How do I remove HTML from a string?

The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text.

Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like < for example.

Here's one "simple-minded" approach, that works for most files:

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz .

Here are some tricky cases that you should think about when picking a solution:

<IMG SRC = "foo.gif" ALT = "A > B">

<IMG SRC = "foo.gif"
 ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<# Just data #>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on text like this:

<!-- This section commented out.
    <B>You can't see me!</B>
-->

回复收藏 0 原文

~没有更多了~

关于作者

寒尘

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何使用 Perl 去除字符串中的 HTML？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

如何使用 Perl 去除字符串中的 HTML？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。