如何从 ColdFusion 字符串中清除 HTML 标签？

发布于 2024-07-22 22:31:06 字数 413 浏览 6 评论 0原文

我正在寻找一种从 ColdFusion 字符串中解析 HTML 标签的快速方法。我们正在拉取 RSS 提要，其中可能包含任何内容。然后我们对信息进行一些处理，然后将其返回到另一个地方。目前我们正在使用正则表达式来完成此操作。有一个更好的方法吗？

<cfloop from="1" to="#ArrayLen(myFeed.item)#" index="i">
  <cfset myFeed.item[i].description.value = 
   REReplaceNoCase(myFeed.item[i].description.value, '<(.|\n)*?>', '', 'ALL')>
</cfloop>

我们正在使用 ColdFusion 8。

原文

I am looking for a quick way to parse HTML tags out of a ColdFusion string. We are pulling in an RSS feed, that could potentially have anything in it. We are then doing some manipulation of the information and then spitting it back out to another place. Currently we are doing this with a regular expression. Is there a better way to do this?

<cfloop from="1" to="#ArrayLen(myFeed.item)#" index="i">
  <cfset myFeed.item[i].description.value = 
   REReplaceNoCase(myFeed.item[i].description.value, '<(.|\n)*?>', '', 'ALL')>
</cfloop>

We are using ColdFusion 8.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

躲猫猫 2024-07-29 22:31:06

免责声明 我强烈主张使用适当的解析器（而不是正则表达式）来解析 HTML。然而，这个问题不是关于解析 HTML，而是关于销毁它。对于超出此范围的所有任务，请使用解析器。

我认为你的正则表达式很好。只要只是从输入中删除所有 HTML 标签，使用像您这样的正则表达式就是安全的。

其他任何事情可能都会比它的价值更麻烦，但您可以编写一个小函数，逐个字符地循环遍历字符串一次，并删除标记括号内的所有内容 - 例如：

一旦您打开“inTag”标志遇到“<”字符，
一旦遇到“>”就将其关闭，
只要关闭该标志
以提高性能，就将字符复制到输出字符串，使用 StringBuilder Java 对象而不是字符串连接

对于应用程序的高需求部分，这可能比正则表达式更快。但正则表达式很干净并且可能足够快。

也许这个修改后的正则表达式对您有一些优势：

<[^>]*(?:>|$)

在字符串末尾捕获未闭合的标签
[^>]* 比 (.|\n)

更好当模式中没有实际字母时，REReplaceNoCase() 是不必要的。不区分大小写的正则表达式匹配比区分大小写的匹配慢。

Disclaimer I am a fierce advocate of using a proper parser (instead of regex) to parse HTML. However, this question isn't about parsing HTML, but about destroying it. For all tasks that go beyond that, use a parser.

I think your regex is good. As long as there is nothing more than removing all HTML tags from the input, using a regex like yours is safe.

Anything else would probably be more hassle than it's worth, but you could write a small function that loops through the string char-by-char once and removes everything that's within tag brackets — e.g.:

switch on a "inTag" flag as soon as you encounter a "<" character,
switch it off as soon as you encounter ">"
copy characters to the output string as long as the flag is off
for performance, use a StringBuilder Java object instead of string concatenation

For a high-demand part of your app, this may be faster than the regex. But the regex is clean and probably fast enough.

Maybe this modified regex has some advantages for you:

<[^>]*(?:>|$)

catches unclosed tags at the end of the string
[^>]* is better than (.|\n)

The use of REReplaceNoCase() is unnecessary when there are no actual letters in the pattern. Case-insensitive regex matching is slower than doing it case-sensitively.

回复收藏 0 原文

放肆 2024-07-29 22:31:06

HTML 不是正则语言，因此在（不受控制的）HTML 上使用正则表达式应该非常小心（如果有的话）。

例如，考虑以下 HTML 的有效片段：

<img src="boat.jpg" alt="a boat" title="My boat is > everything! I <3 my boat!">

您会注意到语法荧光笔如何令人窒息 - 以及已提供的现有正则表达式。

除非您可以确定您正在处理的字符串不会包含与上述类似的 HTML 代码，否则您应该避免做出假设/妥协，而单个/纯正则表达式路由会迫使您这样做。

（注意：同样的问题也适用于建议的逐字符方法。）

要解决您的问题，您应该使用 DOM 解析器将字符串解析为 HTML 对象，循环遍历每个元素并转换为文本。

如果您有有效的 XHTML，那么您可以使用 CF 的 XmlParse() 来生成然后可以循环的对象。
如果它可能是非 XML HTML，那么 CF8 没有内置选项，因此您必须研究 Java/等中的选项。

HTML is not a Regular language, so using Regular expressions on (uncontrolled) HTML is something that should be done with great care (if at all).

Consider, for example, the following valid segment of HTML:

<img src="boat.jpg" alt="a boat" title="My boat is > everything! I <3 my boat!">

You'll note how the syntax highlighter is choking on that - as will the existing regex that has been offered.

Unless you can be certain that the string you are processing will not contain HTML code similar to the above, you should avoid making assumptions/compromise, which a single/pure regex route would force you to do.

(Note: The same problem applies to the suggested char-by-char method too.)

To solve your problem, you should use a DOM parser to parse your string into a HTML object, looping through each element and converting to text.

If you have valid XHTML then you can use CF's XmlParse() to produce the object which you can then loop though.
If it might be non-XML HTML then there's no built-in option with CF8, so you'll have to investigate options in Java/etc.

回复收藏 0 原文

你是暖光i 2024-07-29 22:31:06

我用这个：

REReplaceNoCase(text, "<[^[:space:]][^>]*>", "", "ALL");

99%的情况下它工作正常。

I use this:

REReplaceNoCase(text, "<[^[:space:]][^>]*>", "", "ALL");

99% of the cases it works fine.

回复收藏 0 原文

撩发小公举 2024-07-29 22:31:06

最好的方法通常是将 < 强制转换为 < 并将 > 强制转换为 > 。这样您就不会对消息的性质做出假设。有人可能正在谈论或试图变得 <> 或描述击键 +C 或使用数学 1 < x＞ 3.. 甚至表情符号也可以触发正则表达式 <8P X>

<cfloop from="1" to="#ArrayLen(myFeed.item)#" index="i">
    <cfset myFeed.item[i].description.value = ReplaceList(myFeed.item[i].description.value, '<,>', '<,>')>
</cfloop>

The best way is usually to coerce < to < and > to >. This way you aren't making assumptions about the nature of the message. Somebody may be talking about <tags> or trying to be <<expressive>> or describing a keystroke <Ctrl>+C or using maths 1 < x > 3. Even smilies could trigger the regex <8P X>

<cfloop from="1" to="#ArrayLen(myFeed.item)#" index="i">
    <cfset myFeed.item[i].description.value = ReplaceList(myFeed.item[i].description.value, '<,>', '<,>')>
</cfloop>

回复收藏 0 原文

看春风乍起 2024-07-29 22:31:06

cflib 是你的朋友：stripHTML

回复收藏 0 原文

迷路的信 2024-07-29 22:31:06

<cfset a = "<b><font color = 'red'>(PCB) <1 ppm </font></b>">

<cfset b = REReplaceNoCase(a, "<[^><]*>", '', 'ALL')>

<cfdump var="#b#">

输出 b = "(PCB) <1 ppm"

正则表达式 "<[^><]*>" 将删除所有标签以及这些标签内的字符，并且不会删除单个标签，例如 << 或> 可以用作字符串中的小于或大于符号

<cfset a = "<b><font color = 'red'>(PCB) <1 ppm </font></b>">

<cfset b = REReplaceNoCase(a, "<[^><]*>", '', 'ALL')>

<cfdump var="#b#">

output b = "(PCB) <1 ppm"

The Regex "<[^><]*>" will remove all tags and the characters inside those tags and will not remove single tags like < or > which can be used as less than or greater than symbol in string

回复收藏 0 原文

~没有更多了~