使用非贪婪限定符还是前瞻更好?
我可能有一大块文本来搜索 [[...]]
的实例,其中 ...
可以是任何内容,包括其他括号(尽管它们不能嵌套;[[
结束匹配后 ]]
的第一个实例)。
我可以想到两种方法来匹配此文本:
- 使用非贪婪限定符:
/\[\[.+?\]\]/
- 使用前瞻:
/\[\[ (?:(?!\]\]).)+\]\]/
从性能的角度来看,一个选择本质上比另一个选择更好(我想说第一个可能更具可读性)?我记得读到最好不要使用非贪婪限定符,但我现在找不到来源。
I have a possibly large block of text to search for instances of [[...]]
, where the ...
can be anything, including other brackets (though they cannot be nested; the first instance of ]]
after [[
ends the match).
I can think of two ways to match this text:
- Using a non-greedy qualifier:
/\[\[.+?\]\]/
- Using a lookahead:
/\[\[(?:(?!\]\]).)+\]\]/
Is one choice inherently better than the other, from a performance standpoint (I'd say the first is probably more readable)? I recall reading that it's better not to use non-greedy qualifiers, but I cannot find a source for that now.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
在这种情况下最好使用非贪婪量词。
以字符串为例
"[[a]b]]"
非贪婪量词
\[
匹配\[
匹配.+?
匹配"a"
\]
匹配\]
失败,返回 #3,但保持字符串位置.+?
与"]"
匹配\]
失败,返回到 #3,但保留字符串位置.+?
匹配"b"
\]
匹配\]
匹配前瞻:
\[
匹配\[
匹配(?!\] \])
在"a"
处立即成功(即不匹配),继续.
与"a"
匹配code>,在 #3(?!\]\])
处重复,在"]"
(?!\] 处实现部分匹配\])
在"b"
处成功(即不匹配),继续。
与"]"
匹配>,在 #3(?!\]\])
处重复,立即在"b"
处成功(即不匹配),继续"b"
,在 #3(?!\]\])
在" 处实现部分匹配]"
(?!\]\])
在"]"
处实现完全匹配,因此:#4 失败,退出 #3\]
匹配\]
匹配因此,看起来非贪婪量词要做的工作较少。
免责声明:这是一个人为的示例,实际性能可能会有所不同,具体取决于输入、实际表达式和正则表达式引擎的实现。我只有 98% 确信我在这里概述的内容是实际发生的情况,因此我愿意接受更正。另外,与所有性能技巧一样,不要只看表面价值,如果您想确定的话,请进行自己的基准比较。
It is better to use a non-greedy quantifier in this case.
Take this example string
"[[a]b]]"
Non-greedy quantifier
\[
matches\[
matches.+?
matches the"a"
\]
matches\]
fails, back to #3 but keep string position.+?
matches the"]"
\]
fails, back to #3 but keep string position.+?
matches the"b"
\]
matches\]
matchesLook-ahead:
\[
matches\[
matches(?!\]\])
succeeds (i.e. non-match) immediately at"a"
, go on.
matches the"a"
, repeat at #3(?!\]\])
achieves partial match at"]"
(?!\]\])
succeeds (i.e. non-match) at"b"
, go on.
matches the"]"
, repeat at #3(?!\]\])
succeeds (i.e. non-match) immediately at"b"
, go on.
matches the"b"
, repeat at #3(?!\]\])
achieves partial match at"]"
(?!\]\])
achieves full match at"]"
, ergo: #4 fails, exit #3\]
matches\]
matchesSo it looks like the non-greedy quantifier has less work to do.
Disclaimer: This is an artificial example and real-life performance may differ, depending on the input, the actual expression and the implementation of the regex engine. I'm only 98% sure that what I outlined here is what is actually happening, so I'm open for corrections. Also, as with all performance tips, don't take this at face value, do your own benchmark comparisons if you want to know for sure.
另一种变体:
/\[\[((?:\]?[^]])+)]]/
它既不使用非贪婪量词,也不使用前瞻。它允许在任何非
]
之前出现单个]
。如果顺序有两个]
,则内部重复将停止,并且比赛将结束。此模式最适合与 FSA 编译正则表达式引擎一起使用。在回溯引擎上,它可能会比非贪婪变体慢。
Another variant:
/\[\[((?:\]?[^]])+)]]/
It uses neither non-greedy quantifiers nor look-aheads. It allows a single
]
before any non-]
. If there would be two]
in sequence, the inner repetition would stop, and and the match would end.This pattern would be best to use with FSA-compiling regex engines. On back-tracking engines, it could get slower than the non-greedy variant.
您使用哪种正则表达式?如果它支持所有格量词,那么还有一种更好的选择:
[^\]]++
吞噬除]
之外的任何字符,并且不费心保存状态使回溯成为可能的信息。如果它确实看到]
,它会执行向前查找以查看是否还有另一个。将整个事物包装在另一个所有格量词中意味着它只会在看到]
时进行前瞻,并且仅回溯一次:当它找到结束的]]
时。Java、JGSoft、PCRE (PHP)、Oniguruma (Ruby 1.9) 和 Perl 5.12 风格支持所有格量词。所有这些风格还支持原子组,可用于实现相同的效果:
.NET 风格支持原子组,但不支持所有格量词。
Which regex flavor are you using? If it's one that supports possessive quantifiers, there's a much better alternative:
[^\]]++
gobbles up any characters other than]
and doesn't bother saving the state information that would make backtracking possible. If it does see a]
, it performs a lookahead to see if there's another. Wrapping the whole thing in another possessive quantifier means it only does a lookahead whenever it sees a]
, and it only backtracks once: when it finds the closing]]
.Possessive quantifiers are supported by the Java, JGSoft, PCRE (PHP), Oniguruma (Ruby 1.9), and Perl 5.12 flavors. All those flavors also support atomic groups, which can be used to achieve the same effect:
The .NET flavor supports atomic groups but not possessive quantifiers.
我认为最好使用非贪婪限定符。您确定您读到的文章不是在说“小心贪婪匹配”吗?
I would think it is better to use the non-greedy qualifier. Are you sure that the article you read wasn't saying "be careful with greedy matching?"