当前位置：文江博客话题详情

具有 O(N) 和反向引用支持的正则表达式

发布于 2024-09-30 10:10:39 字数 1902 浏览 8 评论 0原文

您可能知道，有两种不同类型的正则表达式实现：一种使用回溯 (pcre)，另一种使用有限自动机 (re2)。

这两种算法都有其局限性：在特定情况下，PCRE 可能需要指数时间才能找到匹配项，并且有限自动机不支持反向引用。

Pcre 实现支持反向引用，在将 /a?a?a?a?aaaa/ 与 aaaa 等表达式匹配时效率很低，越是 a'表达式和输入有 - 需要的时间越长，如果有 30 多个，则需要很多时间。

具有有限自动机的版本可以很好地处理所有这些实现并且具有 O(N) 输入的复杂性，但不支持反向引用：

针对复杂表达式的 pcre 时间 - https://i.sstatic.net /D4gkC.png NFA 处理这些问题，但不支持反向引用 - https://i.sstatic.net/t2EwI.png

有关反向引用支持的一些信息：

RE2 - http://code.google.com/p/re2/

一个重要的例外是RE2 放弃了对反向引用和广义零宽度断言的支持，因为它们无法有效实现。

汤普森 NFA - http://swtch.com/~rsc/regexp/regexp1.html

如前所述，没有人知道如何有效地使用反向引用实现正则表达式，尽管也没有人能证明这是不可能的。（具体来说，这个问题是 NP 完全问题，这意味着如果有人确实找到了一个有效的实现，这对计算机科学家来说将是一个重大新闻，并将赢得一百万美元的奖金。）

所以我创建了自己的版本，它都支持反向引用，复杂度为 O(N)。它是用 haskell 编写的，大约 600 行（其中约 200 行是空白，约 200 行是类型声明，可以跳过）长。据我所知，它在大约 10 秒内咀嚼了 /a?a?aa/ 和 aa （包含 100 个 a）版本

/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?(a?a?a?a?a?a?a?a?a?a?aaaaaaaaaa)aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\1/

是唯一可以在正常（大约 10 秒）时间内进行匹配的

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

。它当然支持我在互联网上找到的基本正则表达式规范中列出的所有其他功能。

问题是：这真的是“计算机科学家的重大新闻”吗？如果是的话我该怎么办？

PS：我将在大约一周内显示源代码 - 我仍然想使用探查器运行一些测试并替换几个内部数据结构。

原文

As you may know there are two different kinds of regular expressions implementations: one uses backtracking (pcre) and the other one uses finite automata (re2).

Both of those algorithms have their limitations: in specific cases pcre can take exponential time to find a match and finite automata does not support backreferences.

Pcre implementation supports backreferences, very inefficient in matching expressions like /a?a?a?a?aaaa/ against aaaa, the more a's expression and input have - the longer it will take and with 30+ of them it will take a lot if time.

Version with finite automata handles all those implementations nicely and have O(N)
complexity from input, but does not supports backreferences:

pcre time against complex expressions - https://i.sstatic.net/D4gkC.png
NFA handles those, but does not supports backreferences - https://i.sstatic.net/t2EwI.png

Some information on backreferences support:

RE2 - http://code.google.com/p/re2/

The one significant exception is that RE2 drops support for backreferences and generalized zero–width assertions, because they cannot be implemented efficiently.

Thompson NFA - http://swtch.com/~rsc/regexp/regexp1.html

As mentioned earlier, no one knows how to implement regular expressions with backreferences efficiently, though no one can prove that it's impossible either. (Specifically, the problem is NP–complete, meaning that if someone did find an efficient implementation, that would be major news to computer scientists and would win a million dollar prize.)

So I created my own version which both supports backreferences and has O(N) complexity. It written in haskell and about 600 (~200 of them are blank and ~200 - type declarations, which can be skipped) lines long. It chews through /a?a?aa/ against aa (with 100 of a) in about 10 seconds and as far as I know it is the only version which can match

/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?(a?a?a?a?a?a?a?a?a?a?aaaaaaaaaa)aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\1/

against

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

in sane (about 10 seconds) time. It of course supports all other features listed in basic regex specification which I found somewhere on the Internets.

The question is: is it really a "major news to computer scientists" and what should I do if it is?

PS: I will show sources in about a week - I still want to run some tests with profiler and replace several internal data structures.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不知所踪 2024-10-07 10:10:39

我相信你很困惑。所有正则表达式都可以用离散有限自动机 (DFA) 表示，并且（因此）可以在 O(n) 时间内求解。 Perl正则表达式（PREG）（以及许多语言提供的正则表达式库）匹配比正则表达式更大的语言，即：正则表达式存在于PREG中。

如果您想了解更多内容，请搜索常规语言。每个正则语言都可以用正则表达式表示（因此具有相似的名称），并且每个正则表达式代表一种正则语言。 PREG 可以表示非常规语言的事物。

此外，没有人喜欢有人说“我可以做到这一点，这太棒了，但我不会解释如何做到”。仅此一点就足以有理由不相信您（忽略您误解了正则表达式是什么）。

回复收藏 0 原文

呆萌少年 2024-10-07 10:10:39

您建议的测试用例不匹配。您没有包含足够的 a 来匹配反向引用（仅足以匹配反向引用之前）。如果我再添加 10 个 a 以便匹配，glibc 中的正则表达式匹配器会立即报告成功

$ echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | sed -re \
    '/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?(a?a?a?a?a?a?a?a?a?a?aaaaaaaaaa)aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\1/!d'
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Your proposed testcase doesn't match. You are not including enough as to match the backreference (only enough to match up to just before the backreference). If I add 10 more as so that it matches, the regex matcher in glibc reports success instantly

$ echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | sed -re \
    '/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?(a?a?a?a?a?a?a?a?a?a?aaaaaaaaaa)aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\1/!d'
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

回复收藏 0 原文