平均正则表达式算法的时间复杂度是多少?
我对使用正则表达式并不陌生,并且我了解它们所基于的基本理论——有限状态机。
不过,我不太擅长算法分析,也不明白正则表达式与基本线性搜索相比如何。我这么问是因为从表面上看这似乎是一个线性数组搜索。 (如果正则表达式很简单。)
我可以在哪里了解有关实现正则表达式引擎的更多信息?
I'm not new to using regular expressions, and I understand the basic theory they're based on--finite state machines.
I'm not so good at algorithmic analysis though and don't understand how a regex compares to say, a basic linear search. I'm asking because on the surface it seems like a linear array search. (If the regex is simple.)
Where could I go to learn more about implementing a regex engine?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是最流行的大纲之一:正则表达式匹配可以简单而快速
。针对字符串运行 DFA 编译的正则表达式确实是 O(n),但最多可能需要 O(2^m) 构建时间/空间(其中 m = 正则表达式大小)。
This is one of the most popular outlines: Regular Expression Matching Can Be Simple And Fast
. Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).
您熟悉术语确定性/非确定性有限自动机吗?
真实正则表达式(当我说真实时,我指的是那些识别正则语言的正则表达式,而不是几乎所有编程语言都支持的正则表达式包含反向引用等)可以转换为 DFA/NFA,并且两者都可以用编程语言以机械方式实现(NFA 可以转换为 DFA)
你要做的是:
这样,给定一个正则表达式,您可以将其转换为 DFA 并运行它以查看它是否匹配指定的文本。
这可以在
O(n)
中实现,因为 DFA 不会向后移动(就像图灵机),因此它与字符串匹配或不匹配。那是假设你不会接受计数重叠的匹配,否则你将不得不返回并重新开始匹配......Are you familiar with the term Deterministic/Non-Deterministic Finite Automata?
Real regular expressions (when I say real I'm refering to those regex that recognize Regular Languages, and not the regex that almost every programming language include with backreferences, etc) can be converted into a DFA/NFA and both can be implemented in a mechanical way in a programming language (a NFA can be converted into a DFA)
What you have to do is:
That way, given a regex you can convert it to a DFA and run it to see if it matches or not a specified text.
This can be implemented in
O(n)
, because DFA don't go backward (like a Turing Machine), so it matches the string or not. That is supposing you won't take in count overlapped matches, otherwise you will have to go back and start matching again...经典的正则表达式可以通过一种在实践中速度很快但具有非常糟糕的最坏情况行为(标准 DFA)的方式来实现,或者以一种保证合理的最坏情况行为的方式(将其保留为 NFA)来实现。标准 DFA 可以扩展以支持大量额外的匹配字符和标志,这利用了它基本上是回溯搜索的事实。
标准方法的例子随处可见(例如内置于 Perl 中)。 http://code.google.com/p/re2 上有一个声称具有良好的最坏情况行为的示例/ - 事实上,在最坏的情况下,它甚至比我预期的还要好,所以他们可能发现了一两个额外的技巧。
如果您对此感兴趣,或者关心编写可以锁定给定病理输入的程序,请阅读 http://swtch.com/~rsc/regexp/regexp1.html。
The classic regular expression can be implemented in a way which is fast in practice but has really bad worst case behaviour (the standard DFA) or in a way which has guaranteed reasonable worst case behaviour (keeping it as an NFA). The standard DFA can be extended to support lots of extra matching characters and flags, which make use of the fact that it is basically back-tracking search.
Examples of the standard approach are everywhere (e.g. built into Perl). There is an example that claims good worst case behaviour at http://code.google.com/p/re2/ - in fact it is even better than I expected in the worst case, so they may have found an extra trick or two.
If you are at all interested in this, or care about writing programs that can be made to lock up solid given pathological inputs, read http://swtch.com/~rsc/regexp/regexp1.html.