编写更好的正则表达式以不使用惰性重复量词

发布于 2024-09-30 02:47:19 字数 184 浏览 2 评论 0原文

我有一个正则表达式：

(<select([^>]*>))(.*?)(</select\s*>)

由于它使用惰性重复量词，因此对于较长的字符串（选项超过 500），它会回溯超过 100,000 次并失败。请帮助我找到一个更好的不使用惰性重复量词的正则表达式

原文

I have a regular expression:

(<select([^>]*>))(.*?)(</select\s*>)

Since it uses lazy repeat quantifier, for longer strings(having options more than 500) it backtracks for more than 100,000 times and fails.
Please help me to find a better regular expression which doesn't use lazy repeat quantifier

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

依靠 2024-10-07 02:47:19

<select[^>]*>[^<]*(?:<(?!/select>)[^<]*)*</select>

...或以人类可读的形式：

<select[^>]*>    # start tag
[^<]*            # anything except opening bracket
(?:              # if you find an open bracket
  <(?!/select>)  #   match it if it's not part of end tag
  [^<]*          #   consume any more non-brackets
)*               # repeat as needed
</select>        # end tag

这是 Friedl 在他的书中开发的“展开循环”技术的示例，掌握正则表达式。我在 RegexBuddy 中使用基于不情愿量词的模式进行了快速测试：

(?s)<select[^>]*>.*?</select>

...大约需要 6,000 个步骤才能找到匹配项。 展开循环模式只需要 500 步。当我从结束标记 () 中删除右括号时，导致无法匹配，只需要 800 步即可报告失败。

如果您的正则表达式风格支持所有格量词，也可以继续使用它们：

<select[^>]*+>[^<]*+(?:<(?!/select>)[^<]*+)*+</select>

实现匹配需要大约相同数量的步骤，但在此过程中可以使用更少的内存。如果不可能匹配，它会更快失败；在我的测试中，大约需要 500 步，与找到匹配项所需的步数相同。

<select[^>]*>[^<]*(?:<(?!/select>)[^<]*)*</select>

...or in human-readable form:

<select[^>]*>    # start tag
[^<]*            # anything except opening bracket
(?:              # if you find an open bracket
  <(?!/select>)  #   match it if it's not part of end tag
  [^<]*          #   consume any more non-brackets
)*               # repeat as needed
</select>        # end tag

This is an example of the "unrolled loop" technique Friedl develops in his book, Mastering Regular Expressions. I did a quick test in RegexBuddy using a pattern based on reluctant quantifiers:

(?s)<select[^>]*>.*?</select>

...and it took about 6,000 steps to find a match. The unrolled-loop pattern took only 500 steps. And when I removed the closing bracket from the end tag (</select), making a match impossible, it required only 800 steps to report failure.

If your regex flavor supports possessive quantifiers, go ahead and use them, too:

<select[^>]*+>[^<]*+(?:<(?!/select>)[^<]*+)*+</select>

It takes about the same number of steps to achieve a match, but it can use a lot less memory in the process. And if no match is possible, it fails even more quickly; in my tests it took about 500 steps, the same number it took to find a match.

回复收藏 0 原文

梦毁影碎の 2024-10-07 02:47:19

不幸的是，这不起作用，请参阅 Alan Moore 的答案以获得正确的示例！

(<select([^>]*>))(.*+)(</select\s*>)

来自 perl regexp 联机帮助页：

默认情况下，当量化子模式不允许其余子模式时
为了匹配整体模式，Perl 将回溯。然而，这种行为
有时是不可取的。因此 Perl 提供了“所有格”
量词形式也是如此。

       *+     Match 0 or more times and give nothing back
       ++     Match 1 or more times and give nothing back
       ?+     Match 0 or 1 time and give nothing back
       {n}+   Match exactly n times and give nothing back (redundant)
       {n,}+  Match at least n times and give nothing back
       {n,m}+ Match at least n but not more than m times and give nothing back

例如，

      'aaaa' =~ /a++a/

永远不会匹配，因为“a++”会吞噬掉所有“a”
字符串，并且不会为模式的其余部分留下任何内容。这
这个特性对于给 Perl 提示它在哪里是非常有用的
不应该走回头路。例如，典型的“匹配双引号
当写成以下形式时，可以最有效地执行“字符串”问题：

      /"(?:[^"\\]++|\\.)*+"/

Unfortunately this wont work, see the answer by Alan Moore for a correct example!

(<select([^>]*>))(.*+)(</select\s*>)

From the perl regexp manpage:

By default, when a quantified subpattern does not allow the rest of the
overall pattern to match, Perl will backtrack. However, this behaviour
is sometimes undesirable. Thus Perl provides the "possessive"
quantifier form as well.

       *+     Match 0 or more times and give nothing back
       ++     Match 1 or more times and give nothing back
       ?+     Match 0 or 1 time and give nothing back
       {n}+   Match exactly n times and give nothing back (redundant)
       {n,}+  Match at least n times and give nothing back
       {n,m}+ Match at least n but not more than m times and give nothing back

For instance,

      'aaaa' =~ /a++a/

will never match, as the "a++" will gobble up all the "a"'s in the
string and won't leave any for the remaining part of the pattern. This
feature can be extremely useful to give perl hints about where it
shouldn't backtrack. For instance, the typical "match a double-quoted
string" problem can be most efficiently performed when written as: