当前位置：文江博客话题详情

检测正则表达式是否为指数

发布于 2024-09-12 10:13:14 字数 227 浏览 6 评论 0原文

这篇文章表明，有一些正则表达式在回溯时是 O(2^n) 。示例为(x+x+)+y。当尝试匹配像 xxxx...p 这样的字符串时，它会回溯一段时间，然后才发现它无法匹配。

有没有办法检测这样的正则表达式？

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

春风十里 2024-09-19 10:13:14

如果您的正则表达式引擎公开了 (x+x+)+y 的运行时指数行为，那么它就会被破坏，因为 DFA 或 NFA 可以在线性时间内识别此模式：

echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | egrep "(x+x+)+y"
echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxy" | egrep "(x+x+)+y"

两者都会立即回答。

事实上，只有少数情况（例如反向引用）确实需要回溯（主要是因为带有反向引用的正则表达式不再是语言理论意义上的正则表达式）。只有在给出这些极端情况时，一个有能力的实现才应该切换到回溯。

公平地说，DFA 也有阴暗面，因为某些正则表达式有指数大小要求，但大小约束比时间约束更容易执行，而且巨大的 DFA 在输入上线性运行，因此它比小回溯器阻塞更划算在几个 X 上。

您应该真正阅读 Russ Cox 关于 regexp 的实现（以及回溯的病态行为）的优秀文章系列：http://swtch.com/~rsc/regexp/

要回答您有关可判定性的问题：您不能。因为 regexpr 没有唯一回溯。每个实现都有自己的策略来处理某些情况下算法的指数增长，但不涵盖其他情况。一条规则可能在这里适合，但在那里却是灾难性的。

更新：

例如，一种实现可能包含一个优化器，该优化器可以在执行正则表达式之前使用代数转换来简化正则表达式：(x+x+)+y 与 xxx*y 相同>，这对于任何回溯者来说都不应该是问题。但同一个优化器无法识别下一个表达式，问题又出现了。这里有人描述了如何制作一个欺骗 Perl 优化器的正则表达式：

http://perlgeek.de/blog-en/perl-tips/in-search-of-an-exponetial-regexp.html

If your regexp engine exposes runtime exponential behavior for (x+x+)+y ,then it is broken because a DFA or NFA can recognize this pattern in linear time:

echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | egrep "(x+x+)+y"
echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxy" | egrep "(x+x+)+y"

both answer immediately.

In fact, there are only a few cases (like backreferences) where backtracking is really needed (mainly, because a regexp with a backreference is not a regular expression in the language theoretic sense anymore). A capable implementation should switch to backtracking only when these corner cases are given.

In fairness, DFA's have a dark side too, because some regexp's have exponential size requirements, but a size contraints is easier to enforce than a time constraint and the huge DFA runs linear on the input, so it's a better bargain than a small backtracker choking on a couple of X's.

You should really read Russ Cox excellent article series about the implementation of regexp (and the pathological behavior of backtracking): http://swtch.com/~rsc/regexp/

To answer your question about decidability: You can't. Because there is not the one backtracking for regexpr. Every implementation has its own strategies to deal with exponential growth in their algorithm for certain cases and does not cover others. One rule might be fit for here and catastrophic for there.

UPDATE:

For example, one implementation could contain an optimizer which could use algebraic transformations to simplify regexps before executing them: (x+x+)+y is the same a xxx*y, which shouldn't be a problem for any backtracker. But the same optimizer wouldn't recognize the next expression and the problem is there again. Here someone described how to craft a regexpr which fools Perl's optimizer:

http://perlgeek.de/blog-en/perl-tips/in-search-of-an-exponetial-regexp.html

回复收藏 0 原文

谜泪 2024-09-19 10:13:14

不，我不这么认为，但您可以使用以下准则：

如果它包含两个在高端开放的量词并且它们是嵌套的，那么它可能 O(2^n) 。
如果它不包含两个这样的量词，那么我认为它不可能是 O(2^n)。

可能导致这种情况的量词有：*、+ 和 {k,}。

另请注意，评估正则表达式的最坏情况复杂性可能与典型字符串的复杂性有很大不同，并且复杂性取决于特定的正则表达式引擎。

回复收藏 0 原文

岁月静好 2024-09-19 10:13:14

任何没有反向引用的正则表达式都可以在线性时间内匹配，尽管现实世界中的许多正则表达式引擎不会这样做（至少插入编程语言运行时环境的许多正则表达式引擎支持反向引用，并且不切换）当不存在反向引用时更有效的执行模型）。

没有简单的方法可以找出带有反向引用的正则表达式将消耗多少时间。

回复收藏 0 原文

夏日浅笑〃 2024-09-19 10:13:14

您可以使用正则表达式解析器检测并拒绝嵌套重复，该解析器对应于 1 的星号高度。我刚刚编写了使用正则表达式解析器计算和拒绝>1的起始高度的模块来自 npm。

$ node safe.js '(x+x+)+y'
false
$ node safe.js '(beep|boop)*'
true
$ node safe.js '(a+){10}'
false
$ node safe.js '\blocation\s*:[^:\n]+\b(Oakland|San Francisco)\b'
true

You could detect and reject nested repetitions using a regex parser, which corresponds to a star height of 1. I've just written a module to compute and reject start heights of >1 using a regex parser from npm.

$ node safe.js '(x+x+)+y'
false
$ node safe.js '(beep|boop)*'
true
$ node safe.js '(a+){10}'
false
$ node safe.js '\blocation\s*:[^:\n]+\b(Oakland|San Francisco)\b'
true

回复收藏 0 原文

~没有更多了~