检测正则表达式是否为指数

发布于 2024-09-12 10:13:14 字数 227 浏览 6 评论 0原文

这篇文章表明,有一些正则表达式在回溯时是 O(2^n) 。 示例为(x+x+)+y。 当尝试匹配像 xxxx...p 这样的字符串时,它会回溯一段时间,然后才发现它无法匹配。

有没有办法检测这样的正则表达式?

谢谢

This article show that there is some regexp that is O(2^n) when backtracking.
The example is (x+x+)+y.
When attempt to match a string like xxxx...p it going to backtrack for a while before figure it out that it couldn't match.

Is there a way to detect such regexp?

thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

春风十里 2024-09-19 10:13:14

如果您的正则表达式引擎公开了 (x+x+)+y 的运行时指数行为,那么它就会被破坏,因为 DFA 或 NFA 可以在线性时间内识别此模式:

echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | egrep "(x+x+)+y"
echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxy" | egrep "(x+x+)+y"

两者都会立即回答。

事实上,只有少数情况(例如反向引用)确实需要回溯(主要是因为带有反向引用的正则表达式不再是语言理论意义上的正则表达式)。只有在给出这些极端情况时,一个有能力的实现才应该切换到回溯。

公平地说,DFA 也有阴暗面,因为某些正则表达式有指数大小要求,但大小约束比时间约束更容易执行,而且巨大的 DFA 在输入上线性运行,因此它比小回溯器阻塞更划算在几个 X 上。

您应该真正阅读 Russ Cox 关于 regexp 的实现(以及回溯的病态行为)的优秀文章系列:http://swtch.com/~rsc/regexp/

要回答您有关可判定性的问题:您不能。因为 regexpr 没有唯一回溯。每个实现都有自己的策略来处理某些情况下算法的指数增长,但不涵盖其他情况。一条规则可能在这里适合,但在那里却是灾难性的。

更新:

例如,一种实现可能包含一个优化器,该优化器可以在执行正则表达式之前使用代数转换来简化正则表达式:(x+x+)+yxxx*y 相同>,这对于任何回溯者来说都不应该是问题。但同一个优化器无法识别下一个表达式,问题又出现了。这里有人描述了如何制作一个欺骗 Perl 优化器的正则表达式:

http://perlgeek.de/blog-en/perl-tips/in-search-of-an-exponetial-regexp.html

If your regexp engine exposes runtime exponential behavior for (x+x+)+y ,then it is broken because a DFA or NFA can recognize this pattern in linear time:

echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | egrep "(x+x+)+y"
echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxy" | egrep "(x+x+)+y"

both answer immediately.

In fact, there are only a few cases (like backreferences) where backtracking is really needed (mainly, because a regexp with a backreference is not a regular expression in the language theoretic sense anymore). A capable implementation should switch to backtracking only when these corner cases are given.

In fairness, DFA's have a dark side too, because some regexp's have exponential size requirements, but a size contraints is easier to enforce than a time constraint and the huge DFA runs linear on the input, so it's a better bargain than a small backtracker choking on a couple of X's.

You should really read Russ Cox excellent article series about the implementation of regexp (and the pathological behavior of backtracking): http://swtch.com/~rsc/regexp/

To answer your question about decidability: You can't. Because there is not the one backtracking for regexpr. Every implementation has its own strategies to deal with exponential growth in their algorithm for certain cases and does not cover others. One rule might be fit for here and catastrophic for there.

UPDATE:

For example, one implementation could contain an optimizer which could use algebraic transformations to simplify regexps before executing them: (x+x+)+y is the same a xxx*y, which shouldn't be a problem for any backtracker. But the same optimizer wouldn't recognize the next expression and the problem is there again. Here someone described how to craft a regexpr which fools Perl's optimizer:

http://perlgeek.de/blog-en/perl-tips/in-search-of-an-exponetial-regexp.html

谜泪 2024-09-19 10:13:14

不,我不这么认为,但您可以使用以下准则:

  • 如果它包含两个在高端开放的量词并且它们是嵌套的,那么它可能 O(2^n) 。
  • 如果它不包含两个这样的量词,那么我认为它不可能是 O(2^n)。

可能导致这种情况的量词有:*+{k,}

另请注意,评估正则表达式的最坏情况复杂性可能与典型字符串的复杂性有很大不同,并且复杂性取决于特定的正则表达式引擎。

No I don't think so, but you can use these guidelines:

  • If it contains two quantifiers that are open-ended at the high end and they are nested then it might be O(2^n).
  • If it does not contain two such quantifiers then I think it cannot be O(2^n).

Quantifiers that can cause this are: *, + and {k,}.

Also note that the worst case complexity of evaluating a regular expression might be very different from the complexity on typical strings and that the complexity depends on the specific regular expression engine.

岁月静好 2024-09-19 10:13:14

任何没有反向引用的正则表达式都可以在线性时间内匹配,尽管现实世界中的许多正则表达式引擎不会这样做(至少插入编程语言运行时环境的许多正则表达式引擎支持反向引用,并且不切换)当不存在反向引用时更有效的执行模型)。

没有简单的方法可以找出带有反向引用的正则表达式将消耗多少时间。

Any regex without backreferences can be matched in linear time, though many regex engines out there in the real world don't do it that way (at least many regex engines that are plugged into programming language runtime environments support backreferences, and don't switch to a more efficient execution model when no backreferences are present).

There's no easy way to find out how much time a regex with backreferences is going to consume.

夏日浅笑〃 2024-09-19 10:13:14

您可以使用正则表达式解析器检测并拒绝嵌套重复,该解析器对应于 1 的星号高度。我刚刚编写了使用正则表达式解析器计算和拒绝>1的起始高度的模块来自 npm。

$ node safe.js '(x+x+)+y'
false
$ node safe.js '(beep|boop)*'
true
$ node safe.js '(a+){10}'
false
$ node safe.js '\blocation\s*:[^:\n]+\b(Oakland|San Francisco)\b'
true

You could detect and reject nested repetitions using a regex parser, which corresponds to a star height of 1. I've just written a module to compute and reject start heights of >1 using a regex parser from npm.

$ node safe.js '(x+x+)+y'
false
$ node safe.js '(beep|boop)*'
true
$ node safe.js '(a+){10}'
false
$ node safe.js '\blocation\s*:[^:\n]+\b(Oakland|San Francisco)\b'
true
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文