如何优化这个正则表达式？

发布于 2024-09-14 02:26:28 字数 767 浏览 8 评论 0原文

我的工具获取纯文本，并通过从标签中的文本替换术语逐渐生成“标签”。由于存在一些复合术语，唯一的方法（我认为）是使用 ReplaceAll 正则表达式。

感谢 stackoverflow 的朋友们，在我的最后一个问题中，我的应用程序得到了一个很好的正则表达式，但经过测试，出现了一个新的需求：

“用于替换标签外部和另一个单词外部的所有单词的正则表达式”

原始代码：

String str = "world worldwide <a href=\"world\">my world</world>underworld world";
str = str.replaceAll("\\bworld\\b(?![^<>]*+>)", "repl");
System.out.println(str);

我现在只需要替换“world”（外部当然是一个标签）并且不是“黑社会”或“全世界”

预期结果：

repl worldwide <a href="world">my world</world>underworld repl

原文

My tool gets a plain text and gradually generates the "tags" by replacing a terms from text in tags. Due to existence of some compound terms, the only way (i think) is use ReplaceAll regex.

Thanks to the friends of stackoverflow, in my last question i got a excellent regex to my app, but after a tests, emerged a new need:

"A regex to replace all word OUTSIDE a tag AND outside another word"

The orginal code:

String str = "world worldwide <a href=\"world\">my world</world>underworld world";
str = str.replaceAll("\\bworld\\b(?![^<>]*+>)", "repl");
System.out.println(str);

I need now replace only "world" (outside a tag ofcourse) and NOT "underworld" or "worldwide"

Expected result:

repl worldwide <a href="world">my world</world>underworld repl

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

混浊又暗下来 2024-09-21 02:26:28

我不认为正则表达式是完成这项工作的最佳工具，但如果您只是想调整和优化您现在拥有的内容，您可以使用单词边界 \b ，扔掉不必要的捕获组和可选的重复说明符，并使用所有格重复：

\bworld\b(?![^<>]*+>)

\bworld\b 将确保 "world" 被零宽度单词边界锚点包围。这将阻止它匹配 "underworld" 和 "worldwide" 中的 "world"。请注意，单词边界定义可能不完全是您想要的，例如 \bworld\b 将与 "a_world_domination" 中的 "world" 不匹配代码>.

原始模式还包含一个类似于 (x+)? 的子模式。最好将其表述为简单的 x*。也就是说，不是“零或一个”? 而是“一个或多个”+，而是简单的“零或多个”*< /代码>。

捕获组 (…) 在功能上是不需要的，并且似乎您不需要捕获替换中的任何替换，因此摆脱它可以提高性能（当您需要分组时）方面，但不是捕获方面，您可以使用非捕获组 (?:…) 代替）。

另请注意，我们现在用 [^<>] 禁止两个括号，而不是 [^<]。现在可以将重复指定为所有格，因为在这种情况下不需要回溯。

（[…] 是一个字符类。类似 [aeiou] 匹配任何小写元音之一 [^…] 是一个否定字符类。 .[^aeiou] 匹配除小写元音之外的任何内容。）

当然 (?!...) 是负数 em> 前瞻；它断言给定的模式不能匹配。所以整体模式是这样的：

\bworld\b(?![^<>]*+>)
\_______/\__________/ NOT the case that
 "world"                      the first bracket to its right is a closing one
 surrounded by
 word boundary anchors

参考文献

regular-expressions.info/Word Boundaries，分组括号，重复，所有格< /a>, Lookarounds

请注意，要在 Java 字符串文字中获取反斜杠，您需要需要将其加倍，因此作为 Java 字符串文字的整个模式为 "\\bworld\\b(?![^<>]*+>)"。

I don't think regex is the best tool for the job, but if you just want to tweak and optimize what you have right now, you can use the word boundary \b, throw away the unnecessary capturing group and optional repetition specifier, and use possessive repetition:

\bworld\b(?![^<>]*+>)

The \bworld\b will ensure that "world" are surrounded by the zero-width word boundary anchors. This will prevent it from matching the "world" in "underworld" and "worldwide". Do note that the word boundary definition may not be exactly what you want, e.g. \bworld\b will not match the "world" in "a_world_domination".

The original pattern also contains a subpattern that looks like (x+)?. This is probably better formulated as simply x*. That is, instead of "zero-or-one" ? of "one-or-more" +, simply "zero-or-more" *.

The capturing group (…) is functionally not needed, and it doesn't seem like you need the capture for any substitution in the replacement, so getting rid of it can improve performance (when you need the grouping aspect, but not the capturing aspect, you can use non-capturing group (?:…) instead).

Note also that instead of [^<], we now forbid both brackets with [^<>]. Now the repetition can be specified as possessive since no backtracking is required in this case.

(The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.)

Of course (?!…) is negative lookahead; it asserts that a given pattern can NOT be matched. So the overall pattern reads like this:

\bworld\b(?![^<>]*+>)
\_______/\__________/ NOT the case that
 "world"                      the first bracket to its right is a closing one
 surrounded by
 word boundary anchors

References

regular-expressions.info/Word Boundaries, Brackets for Grouping, Repetition, Possessive, Lookarounds

Note that to get a backslash in a Java string literal, you need to double it, so the whole pattern as a Java string literal is "\\bworld\\b(?![^<>]*+>)".

回复收藏 0 原文

~没有更多了~