正则表达式引擎如何解释不规则性?
我的C有点不稳定,但是我看过python的源代码,看起来python的大部分re
模块都是由状态机实现的。这并不奇怪,因为正则表达式可以简化为确定性有限状态机。
我想其他正则表达式的实现也是类似的。但是,根据教科书的定义,现代正则表达式实现是常规的(如果有的话)很少。那么他们如何解释不规则性,例如反向引用?
(.*)\1 // this is not regular
My C is a bit shaky, but I've looked at python's source code and it looks like most of python's re
module is implemented by state machines. This comes as no surprise since regular expressions can be reduced to deterministic finite state machines.
I imagine other regex implementations are similar. But few, if any, modern regex implementations are regular according to the textbook definition. Then how do they account for irregularity, like backreferences?
(.*)\1 // this is not regular
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
他们使用修改后的(超越有限状态)自动机类来解决这个问题,并且使用比 普通汤姆森算法。如果您找到任何特定“RE”引擎支持的自动机类的正式特征,那么您非常幸运。
根据我从 Python
re
源代码中得到的内容,它将组存储在缓冲区中(无论如何它都必须这样做,因为它必须将其作为匹配对象的一部分返回)并执行一个简单的字符串匹配算法中的匹配,消耗与组匹配缓冲区中的字符一样多的字符。[可选的咆哮:不幸的是,RE 引擎实际上是 NFA 之上的黑客攻击的集合,破坏了它们的数学属性。许多实现者忽略了正则语言的优雅代数及其对正则关系的强大扩展,这可以有效地由 FST 捕获。]
They use an amended (beyond finite state) automaton class to account for this, and more complicated matching algorithms than the vanilla Thomson algorithm. You're very lucking if you find a formal characterization of the automaton class that any particular "RE" engine supports.
From what I can make up from the Python
re
source code, it stores the group in a buffer (it has to anyway, since it must return this as part of the match object) and does a straightforward string match in the matching algorithm, consuming as many characters as there are in the group match buffer.[Optional rant: unfortunately, RE engines in practice are collections of hacks on top of NFAs that destroy their mathematical properties. Many implementers ignore the elegant algebra of regular languages and their powerful extension to regular relations, which can be efficiently captured by FSTs.]