互斥的正则表达式
如果我有一个正则表达式列表,是否有一种简单的方法可以确定它们中没有两个会返回同一字符串的匹配项?
也就是说,当且仅当对于所有字符串,列表中最多有一项与整个字符串匹配时,该列表才有效。
似乎很难(也许不可能?)明确地证明这一点,但我似乎找不到任何关于这个主题的工作。
我问的原因是我正在开发一种接受正则表达式的标记生成器,并且我想确保一次只有一个标记可以匹配输入的头部。
If I have a list of regular expressions, is there an easy way to determine that no two of them will both return a match for the same string?
That is, the list is valid if and only if for all strings a maximum of one item in the list will match the entire string.
It seems like this will be very hard (maybe impossible?) to prove definitively, but I can't seem to find any work on the subject.
The reason I ask is that I am working on a tokenizer that accepts regexes, and I would like to ensure only one token at a time can match the head of the input.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您使用纯正则表达式(没有反向引用或其他使它们识别上下文无关或更复杂的语言的功能),那么您的要求是可能的。
您可以做的是将每个正则表达式转换为 DFA,然后(因为常规语言在交集下封闭)将它们组合成可识别的 DFA
两种语言的交集。如果该 DFA 具有从开始状态到接受状态的路径,则两个输入正则表达式都会接受该字符串。
问题在于,通常的 regex->DFA 算法的第一步是
将正则表达式转换为 NFA,然后将 NFA 转换为 DFA。但最后一步可以
导致 DFA 状态数量呈指数级增长,因此这只会是
对于非常简单的正则表达式是可行的。
如果您正在使用扩展的正则表达式语法,那么一切皆有可能:上下文无关语言
在交集下不闭合,因此此方法不起作用。
If you're working with pure regular expressions (no backreferences or other features that cause them to recognize context-free or more complicated languages), what you ask is possible.
What you can do is convert each regex to a DFA, then (since regular languages are closed under intersection) combine them into a DFA that recognizes
the intersection of the two languages. If that DFA has a path from the start state to an accepting state, that string is accepted by both input regexen.
The problem with this is that the first step of the usual regex->DFA algorithm is to
convert the regex to a NFA, then convert the NFA to a DFA. But that last step can
result in an exponential blowup in the number of DFA states, so this will only be
feasible for very simple regular expressions.
If you are working with extended regex syntax, all bets are off: context-free languages
are not closed under intersection, so this method won't work.
关于正则表达式的维基百科文章确实指出
可以编写一种算法对于两个给定的正则表达式,决定所描述的语言是否本质上相等,将每个表达式简化为最小确定性有限状态机,并确定它们是否同构(等价)。
但没有给出进一步的提示。
当然,您所追求的简单方法是运行大量测试 - 但我们都知道测试作为证明方法的缺点。
The Wkipedia article on regular expressions does state
It is possible to write an algorithm which for two given regular expressions decides whether the described languages are essentially equal, reduces each expression to a minimal deterministic finite state machine, and determines whether they are isomorphic (equivalent).
but gives no further hints.
Of course the easy way you are after is to run a lot of tests -- but we all know the shortcomings of testing as a method of proof.
仅通过查看正则表达式是无法做到这一点的。
考虑一下有
[0-9]
和[0-9]+
的情况。它们显然是不同的表达式,但是当应用于字符串“1”时,它们都会产生相同的结果。当应用于字符串“11”时,它们会产生不同的结果。关键是正则表达式没有提供足够的信息。结果取决于正则表达式和目标字符串。
You can't do that by only looking at the regular expression.
Consider the case where you have
[0-9]
and[0-9]+
. They are obviously different expressions, but when applied to the string "1", they both produce the same result. When applied to string "11" they produce different results.The point is that a regular expression isn't enough information. The result depends both on the regex and the target string.