使用非贪婪限定符还是前瞻更好?

发布于 2024-09-05 01:40:29 字数 349 浏览 6 评论 0原文

我可能有一大块文本来搜索 [[...]] 的实例,其中 ... 可以是任何内容,包括其他括号(尽管它们不能嵌套;[[ 结束匹配后 ]] 的第一个实例)。

我可以想到两种方法来匹配此文本:

  • 使用非贪婪限定符: /\[\[.+?\]\]/
  • 使用前瞻: /\[\[ (?:(?!\]\]).)+\]\]/

从性能的角度来看,一个选择本质上比另一个选择更好(我想说第一个可能更具可读性)?我记得读到最好不要使用非贪婪限定符,但我现在找不到来源。

I have a possibly large block of text to search for instances of [[...]], where the ... can be anything, including other brackets (though they cannot be nested; the first instance of ]] after [[ ends the match).

I can think of two ways to match this text:

  • Using a non-greedy qualifier: /\[\[.+?\]\]/
  • Using a lookahead: /\[\[(?:(?!\]\]).)+\]\]/

Is one choice inherently better than the other, from a performance standpoint (I'd say the first is probably more readable)? I recall reading that it's better not to use non-greedy qualifiers, but I cannot find a source for that now.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

最美不过初阳 2024-09-12 01:40:29

在这种情况下最好使用非贪婪量词。

以字符串为例 "[[a]b]]"

非贪婪量词

       \[\[.+?\]\]
Atom # 1 2 3  4 5
  1. Atom #1 \[ 匹配
  2. Atom #2 \[ 匹配
  3. Atom #3 .+? 匹配 "a"
  4. Atom #4 \] 匹配
  5. Atom #5 \]失败,返回 #3,但保持字符串位置
  6. Atom #3 .+?"]" 匹配
  7. Atom #4 \] 失败,返回到 #3,但保留字符串位置
  8. Atom #3 .+? 匹配 "b"
  9. Atom #4 \] 匹配
  10. Atom #5 \] 匹配
  11. 成功

前瞻:

       \[\[(?:(?!\]\]).)+\]\]
Atom # 1 2 3  4       5  6 7
  1. Atom #1 \[ 匹配
  2. Atom #2 \[ 匹配
  3. Atom #4 (?!\] \])"a" 处立即成功(即不匹配),继续
  4. Atom #5 ."a" 匹配code>,在 #3
  5. Atom #4 (?!\]\]) 处重复,在 "]"
  6. Atom #4 (?!\] 处实现部分匹配\])"b" 处成功(即不匹配),继续
  7. Atom #5 "]" 匹配>,在 #3
  8. Atom #4 (?!\]\]) 处重复,立即在 "b" 处成功(即不匹配),继续
  9. Atom #5 < code>. 匹配 "b",在 #3
  10. Atom 处重复 #4 (?!\]\])" 处实现部分匹配]"
  11. Atom #4 (?!\]\])"]" 处实现完全匹配,因此:#4 失败,退出 #3
  12. Atom # 6 \] 匹配
  13. Atom #7 \] 匹配
  14. success

因此,看起来非贪婪量词要做的工作较少。

免责声明:这是一个人为的示例,实际性能可能会有所不同,具体取决于输入、实际表达式和正则表达式引擎的实现。我只有 98% 确信我在这里概述的内容是实际发生的情况,因此我愿意接受更正。另外,与所有性能技巧一样,不要只看表面价值,如果您想确定的话,请进行自己的基准比较。

It is better to use a non-greedy quantifier in this case.

Take this example string "[[a]b]]"

Non-greedy quantifier

       \[\[.+?\]\]
Atom # 1 2 3  4 5
  1. Atom #1 \[ matches
  2. Atom #2 \[ matches
  3. Atom #3 .+? matches the "a"
  4. Atom #4 \] matches
  5. Atom #5 \] fails, back to #3 but keep string position
  6. Atom #3 .+? matches the "]"
  7. Atom #4 \] fails, back to #3 but keep string position
  8. Atom #3 .+? matches the "b"
  9. Atom #4 \] matches
  10. Atom #5 \] matches
  11. success

Look-ahead:

       \[\[(?:(?!\]\]).)+\]\]
Atom # 1 2 3  4       5  6 7
  1. Atom #1 \[ matches
  2. Atom #2 \[ matches
  3. Atom #4 (?!\]\]) succeeds (i.e. non-match) immediately at "a", go on
  4. Atom #5 . matches the "a", repeat at #3
  5. Atom #4 (?!\]\]) achieves partial match at "]"
  6. Atom #4 (?!\]\]) succeeds (i.e. non-match) at "b", go on
  7. Atom #5 . matches the "]", repeat at #3
  8. Atom #4 (?!\]\]) succeeds (i.e. non-match) immediately at "b", go on
  9. Atom #5 . matches the "b", repeat at #3
  10. Atom #4 (?!\]\]) achieves partial match at "]"
  11. Atom #4 (?!\]\]) achieves full match at "]", ergo: #4 fails, exit #3
  12. Atom #6 \] matches
  13. Atom #7 \] matches
  14. success

So it looks like the non-greedy quantifier has less work to do.

Disclaimer: This is an artificial example and real-life performance may differ, depending on the input, the actual expression and the implementation of the regex engine. I'm only 98% sure that what I outlined here is what is actually happening, so I'm open for corrections. Also, as with all performance tips, don't take this at face value, do your own benchmark comparisons if you want to know for sure.

未蓝澄海的烟 2024-09-12 01:40:29

另一种变体: /\[\[((?:\]?[^]])+)]]/

它既不使用非贪婪量词,也不使用前瞻。它允许在任何非 ] 之前出现单个 ]。如果顺序有两个],则内部重复将停止,并且比赛将结束。

此模式最适合与 FSA 编译正则表达式引擎一起使用。在回溯引擎上,它可能会比非贪婪变体慢。

Another variant: /\[\[((?:\]?[^]])+)]]/

It uses neither non-greedy quantifiers nor look-aheads. It allows a single ] before any non-]. If there would be two ] in sequence, the inner repetition would stop, and and the match would end.

This pattern would be best to use with FSA-compiling regex engines. On back-tracking engines, it could get slower than the non-greedy variant.

打小就很酷 2024-09-12 01:40:29

您使用哪种正则表达式?如果它支持所有格量​​词,那么还有一种更好的选择:

\[\[(?:[^\]]++|\](?!\]))*+\]\]

[^\]]++ 吞噬除 ] 之外的任何字符,并且不费心保存状态使回溯成为可能的信息。如果它确实看到 ],它会执行向前查找以查看是否还有另一个。将整个事物包装在另一个所有格量词中意味着它只会在看到 ] 时进行前瞻,并且仅回溯一次:当它找到结束的 ]] 时。

Java、JGSoft、PCRE (PHP)、Oniguruma (Ruby 1.9) 和 Perl 5.12 风格支持所有格量​​词。所有这些风格还支持原子组,可用于实现相同的效果:

\[\[(?>(?:(?>[^\]]+)|\](?!\]))*)\]\]

.NET 风格支持原子组,但不支持所有格量​​词。

Which regex flavor are you using? If it's one that supports possessive quantifiers, there's a much better alternative:

\[\[(?:[^\]]++|\](?!\]))*+\]\]

[^\]]++ gobbles up any characters other than ] and doesn't bother saving the state information that would make backtracking possible. If it does see a ], it performs a lookahead to see if there's another. Wrapping the whole thing in another possessive quantifier means it only does a lookahead whenever it sees a ], and it only backtracks once: when it finds the closing ]].

Possessive quantifiers are supported by the Java, JGSoft, PCRE (PHP), Oniguruma (Ruby 1.9), and Perl 5.12 flavors. All those flavors also support atomic groups, which can be used to achieve the same effect:

\[\[(?>(?:(?>[^\]]+)|\](?!\]))*)\]\]

The .NET flavor supports atomic groups but not possessive quantifiers.

剪不断理还乱 2024-09-12 01:40:29

我认为最好使用非贪婪限定符。您确定您读到的文章不是在说“小心贪婪匹配”吗?

I would think it is better to use the non-greedy qualifier. Are you sure that the article you read wasn't saying "be careful with greedy matching?"

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文