如何从 ColdFusion 字符串中清除 HTML 标签?
我正在寻找一种从 ColdFusion 字符串中解析 HTML 标签的快速方法。 我们正在拉取 RSS 提要,其中可能包含任何内容。 然后我们对信息进行一些处理,然后将其返回到另一个地方。 目前我们正在使用正则表达式来完成此操作。 有一个更好的方法吗?
<cfloop from="1" to="#ArrayLen(myFeed.item)#" index="i">
<cfset myFeed.item[i].description.value =
REReplaceNoCase(myFeed.item[i].description.value, '<(.|\n)*?>', '', 'ALL')>
</cfloop>
我们正在使用 ColdFusion 8。
I am looking for a quick way to parse HTML tags out of a ColdFusion string. We are pulling in an RSS feed, that could potentially have anything in it. We are then doing some manipulation of the information and then spitting it back out to another place. Currently we are doing this with a regular expression. Is there a better way to do this?
<cfloop from="1" to="#ArrayLen(myFeed.item)#" index="i">
<cfset myFeed.item[i].description.value =
REReplaceNoCase(myFeed.item[i].description.value, '<(.|\n)*?>', '', 'ALL')>
</cfloop>
We are using ColdFusion 8.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
免责声明 我强烈主张使用适当的解析器(而不是正则表达式)来解析 HTML。 然而,这个问题不是关于解析 HTML,而是关于销毁它。 对于超出此范围的所有任务,请使用解析器。
我认为你的正则表达式很好。 只要只是从输入中删除所有 HTML 标签,使用像您这样的正则表达式就是安全的。
其他任何事情可能都会比它的价值更麻烦,但您可以编写一个小函数,逐个字符地循环遍历字符串一次,并删除标记括号内的所有内容 - 例如:
<
”字符,>
”就将其关闭,对于应用程序的高需求部分,这可能比正则表达式更快。 但正则表达式很干净并且可能足够快。
也许这个修改后的正则表达式对您有一些优势:
[^>]*
比(.|\n)
更好当模式中没有实际字母时,
REReplaceNoCase()
是不必要的。 不区分大小写的正则表达式匹配比区分大小写的匹配慢。Disclaimer I am a fierce advocate of using a proper parser (instead of regex) to parse HTML. However, this question isn't about parsing HTML, but about destroying it. For all tasks that go beyond that, use a parser.
I think your regex is good. As long as there is nothing more than removing all HTML tags from the input, using a regex like yours is safe.
Anything else would probably be more hassle than it's worth, but you could write a small function that loops through the string char-by-char once and removes everything that's within tag brackets — e.g.:
<
" character,>
"For a high-demand part of your app, this may be faster than the regex. But the regex is clean and probably fast enough.
Maybe this modified regex has some advantages for you:
[^>]*
is better than(.|\n)
The use of
REReplaceNoCase()
is unnecessary when there are no actual letters in the pattern. Case-insensitive regex matching is slower than doing it case-sensitively.HTML 不是正则语言,因此在(不受控制的)HTML 上使用正则表达式应该非常小心(如果有的话)。
例如,考虑以下 HTML 的有效片段:
您会注意到语法荧光笔如何令人窒息 - 以及已提供的现有正则表达式。
除非您可以确定您正在处理的字符串不会包含与上述类似的 HTML 代码,否则您应该避免做出假设/妥协,而单个/纯正则表达式路由会迫使您这样做。
(注意:同样的问题也适用于建议的逐字符方法。)
要解决您的问题,您应该使用 DOM 解析器将字符串解析为 HTML 对象,循环遍历每个元素并转换为文本。
如果您有有效的 XHTML,那么您可以使用 CF 的
XmlParse()
来生成然后可以循环的对象。如果它可能是非 XML HTML,那么 CF8 没有内置选项,因此您必须研究 Java/等中的选项。
HTML is not a Regular language, so using Regular expressions on (uncontrolled) HTML is something that should be done with great care (if at all).
Consider, for example, the following valid segment of HTML:
You'll note how the syntax highlighter is choking on that - as will the existing regex that has been offered.
Unless you can be certain that the string you are processing will not contain HTML code similar to the above, you should avoid making assumptions/compromise, which a single/pure regex route would force you to do.
(Note: The same problem applies to the suggested char-by-char method too.)
To solve your problem, you should use a DOM parser to parse your string into a HTML object, looping through each element and converting to text.
If you have valid XHTML then you can use CF's
XmlParse()
to produce the object which you can then loop though.If it might be non-XML HTML then there's no built-in option with CF8, so you'll have to investigate options in Java/etc.
我用这个:
99%的情况下它工作正常。
I use this:
99% of the cases it works fine.
最好的方法通常是将+C 或使用数学
<
强制转换为<
并将>
强制转换为>
。 这样您就不会对消息的性质做出假设。 有人可能正在谈论
或试图变得<>
或描述击键1 < x> 3.
. 甚至表情符号也可以触发正则表达式<8P X>
The best way is usually to coerce
<
to<
and>
to>
. This way you aren't making assumptions about the nature of the message. Somebody may be talking about<tags>
or trying to be<<expressive>>
or describing a keystroke<Ctrl>+C
or using maths1 < x > 3
. Even smilies could trigger the regex<8P X>
cflib 是你的朋友:stripHTML
cflib is your friend: stripHTML
输出 b = "(PCB) <1 ppm"
正则表达式 "<[^><]*>" 将删除所有标签以及这些标签内的字符,并且不会删除单个标签,例如 << 或> 可以用作字符串中的小于或大于符号
output b = "(PCB) <1 ppm"
The Regex "<[^><]*>" will remove all tags and the characters inside those tags and will not remove single tags like < or > which can be used as less than or greater than symbol in string