具有 O(N) 和反向引用支持的正则表达式

发布于 2024-09-30 10:10:39 字数 1902 浏览 2 评论 0原文

您可能知道,有两种不同类型的正则表达式实现:一种使用回溯 (pcre),另一种使用有限自动机 (re2)。

这两种算法都有其局限性:在特定情况下,PCRE 可能需要指数时间才能找到匹配项,并且有限自动机不支持反向引用。

Pcre 实现支持反向引用,在将 /a?a?a?a?aaaa/aaaa 等表达式匹配时效率很低,越是 a'表达式和输入有 - 需要的时间越长,如果有 30 多个,则需要很多时间。

具有有限自动机的版本可以很好地处理所有这些实现并且具有 O(N) 输入的复杂性,但不支持反向引用:

针对复杂表达式的 pcre 时间 - https://i.sstatic.net /D4gkC.png NFA 处理这些问题,但不支持反向引用 - https://i.sstatic.net/t2EwI.png

有关反向引用支持的一些信息:

RE2 - http://code.google.com/p/re2/

一个重要的例外是RE2 放弃了对反向引用和广义零宽度断言的支持,因为它们无法有效实现

汤普森 NFA - http://swtch.com/~rsc/regexp/regexp1.html

如前所述,没有人知道如何有效地使用反向引用实现正则表达式,尽管也没有人能证明这是不可能的。 (具体来说,这个问题是 NP 完全问题,这意味着如果有人确实找到了一个有效的实现,这对计算机科学家来说将是一个重大新闻,并将赢得一百万美元的奖金。)

所以我创建了自己的版本,它都支持反向引用,复杂度为 O(N)。它是用 haskell 编写的,大约 600 行(其中约 200 行是空白,约 200 行是类型声明,可以跳过)长。据我所知,它在大约 10 秒内咀嚼了 /a?a?aa/aa (包含 100 个 a) 版本

/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?(a?a?a?a?a?a?a?a?a?a?aaaaaaaaaa)aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\1/

是唯一可以在正常(大约 10 秒)时间内进行匹配的

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

。它当然支持我在互联网上找到的基本正则表达式规范中列出的所有其他功能。

问题是:这真的是“计算机科学家的重大新闻”吗?如果是的话我该怎么办?

PS:我将在大约一周内显示源代码 - 我仍然想使用探查器运行一些测试并替换几个内部数据结构。

As you may know there are two different kinds of regular expressions implementations: one uses backtracking (pcre) and the other one uses finite automata (re2).

Both of those algorithms have their limitations: in specific cases pcre can take exponential time to find a match and finite automata does not support backreferences.

Pcre implementation supports backreferences, very inefficient in matching expressions like /a?a?a?a?aaaa/ against aaaa, the more a's expression and input have - the longer it will take and with 30+ of them it will take a lot if time.

Version with finite automata handles all those implementations nicely and have O(N)
complexity from input, but does not supports backreferences:

pcre time against complex expressions - https://i.sstatic.net/D4gkC.png
NFA handles those, but does not supports backreferences - https://i.sstatic.net/t2EwI.png

Some information on backreferences support:

RE2 - http://code.google.com/p/re2/

The one significant exception is that RE2 drops support for backreferences and generalized zero–width assertions, because they cannot be implemented efficiently.

Thompson NFA - http://swtch.com/~rsc/regexp/regexp1.html

As mentioned earlier, no one knows how to implement regular expressions with backreferences efficiently, though no one can prove that it's impossible either. (Specifically, the problem is NP–complete, meaning that if someone did find an efficient implementation, that would be major news to computer scientists and would win a million dollar prize.)

So I created my own version which both supports backreferences and has O(N) complexity. It written in haskell and about 600 (~200 of them are blank and ~200 - type declarations, which can be skipped) lines long. It chews through /a?a?aa/ against aa (with 100 of a) in about 10 seconds and as far as I know it is the only version which can match

/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?(a?a?a?a?a?a?a?a?a?a?aaaaaaaaaa)aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\1/

against

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

in sane (about 10 seconds) time. It of course supports all other features listed in basic regex specification which I found somewhere on the Internets.

The question is: is it really a "major news to computer scientists" and what should I do if it is?

PS: I will show sources in about a week - I still want to run some tests with profiler and replace several internal data structures.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

不知所踪 2024-10-07 10:10:39

我相信你很困惑。所有正则表达式都可以用离散有限自动机 (DFA) 表示,并且(因此)可以在 O(n) 时间内求解。 Perl正则表达式(PREG)(以及许多语言提供的正则表达式库)匹配比正则表达式更大的语言,即:正则表达式存在于PREG中。

如果您想了解更多内容,请搜索常规语言。每个正则语言都可以用正则表达式表示(因此具有相似的名称),并且每个正则表达式代表一种正则语言。 PREG 可以表示非常规语言的事物。

此外,没有人喜欢有人说“我可以做到这一点,这太棒了,但我不会解释如何做到”。仅此一点就足以有理由不相信您(忽略您误解了正则表达式是什么)。

I believe you are confused. All regular expressions can be represented by a discrete finite automata (DFA) and (because of such) be solved in O(n) time. Perl Regular Expressions (PREG) (and the regex libraries provided by many languages) match a language that is larger than regular expressions, ie: regular expressions exist in PREG.

If you want to look more of this up search for regular languages. Every regular language can be represented by a regular expression (hence the similar names), and every regular expression represents a regular language. PREG can represent things that are not a regular language.

Further, no one likes someone who says "I can do this and it is amazing, but I won't explain how". That alone is reason enough not to believe you (ignoring that you misunderstand what a regular expression is).

呆萌少年 2024-10-07 10:10:39

您建议的测试用例不匹配。您没有包含足够的 a 来匹配反向引用(仅足以匹配反向引用之前)。如果我再添加 10 个 a 以便匹配,glibc 中的正则表达式匹配器会立即报告成功

$ echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | sed -re \
    '/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?(a?a?a?a?a?a?a?a?a?a?aaaaaaaaaa)aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\1/!d'
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Your proposed testcase doesn't match. You are not including enough as to match the backreference (only enough to match up to just before the backreference). If I add 10 more as so that it matches, the regex matcher in glibc reports success instantly

$ echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | sed -re \
    '/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?(a?a?a?a?a?a?a?a?a?a?aaaaaaaaaa)aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\1/!d'
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
王权女流氓 2024-10-07 10:10:39

函数式语言似乎有一种有效实现正则表达式的技巧。我已经看过一篇用 Common-Lisp 编写的非常酷的文章: CL_PPCRE

如果你能证明 O (n) 这可能是一个有趣的结果,但你必须确保你确实具有线性时间复杂度,而不仅仅是非常有效。

Functional languages seem to have a knack for efficiently implementing regexps. I've already seen a very cool one written with Common-Lisp: CL_PPCRE

If you can prove the O(n) this might be a interesting result, but you have to be sure that you really have linear time complexity and not being just very effective.

貪欢 2024-10-07 10:10:39

您可以查看此链接上的文章,似乎其他人已经完成了:

正则表达式匹配可以简单快捷
(但在 Java、Perl、PHP、Python、Ruby 等中速度很慢)

You might check out the article at this link, seems this has already be done by someone else:

Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文