包含有序交替的正则表达式可以重写为仅使用无序交替吗?
假设我有一种正则表达式语言,支持文字、正负字符类、有序交替、贪婪量词 ?
、*
和 +
,并且非贪婪量词 ??
、*?
和 +?
。 (这本质上是 PCRE 的一个子集,没有反向引用、环视断言或其他一些更奇特的位。)用无序交替替换有序交替是否会降低这种形式主义的表达能力?
(无序交替——有时也称为“无序选择”——满足 L(S|T) = L(S) + L(T),而有序交替满足 L(S|T) = L (S) + (L(T) - { a in L(T) : a extends some b in L(S) }) 具体来说,模式 a|aa
将匹配字符串 。 >a
和如果交替是无序的,则aa
,但如果交替是有序的,则只有a
。)
换句话说,给定一个包含有序交替的模式S,该模式可以重写为一个不包含有序交替的等效模式 T(但可能是无序交替)?
如果这个问题已在文献中考虑过,我将不胜感激任何人都可以提供的参考资料。我几乎没有发现任何关于扩展正则表达式形式主义的表达能力的理论著作(除了关于反向引用如何将你从常规语言转移到上下文无关语法的常见问题之外)。
Suppose I have a regex language supporting literals, positive and negative character classes, ordered alternation, the greedy quantifiers ?
, *
, and +
, and the nongreedy quantifiers ??
, *?
, and +?
. (This is essentially a subset of PCRE without backreferences, look-around assertions, or some of the other fancier bits.) Does replacing ordered alternation with unordered alternation decrease the expressive power of this formalism?
(Unordered alternation---also sometimes called "unordered choice"---is such that L(S|T) = L(S) + L(T), while ordered alternation is such that L(S|T) = L(S) + (L(T) - { a in L(T) : a extends some b in L(S) }). Concretely, the pattern a|aa
would match the strings a
and aa
if the alternation is unordered, but only a
if the alternation is ordered.)
Put another way, given a pattern S containing an ordered alternation, can that pattern be rewritten to an equivalent pattern T which contains no ordered alternations (but possibly unordered alternations instead)?
If this question has been considered in the literature, I'd appreciate any references which anyone can provide. I was able to turn up almost no theoretical work on the expressive power of extended regex formalisms (beyond the usual things about how backreferences move you from regular languages to context-free grammars).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在 http://swtch.com/~rsc/regexp/regexp3.html [ “正则表达式是否匹配字符串的子字符串?如果是,在哪里?”]有必要在“DFA”中引入优先级的概念(我怀疑,您需要阅读整个系列才能理解,但是“DFA” ”所讨论的问题是从 NFA 图“动态”扩展而来)以处理有序交替。虽然这只是对权威的呼吁,而不是证明,但我认为可以公平地说,如果 russ cox 不能做到这一点(将有序交替表示为纯粹的 DFA),那么没有人知道如何做到。
in http://swtch.com/~rsc/regexp/regexp3.html [section "Does the regexp match a substring of the string? If so, where?"] it's necessary to introduce the idea of priorities within the "DFA" (you need to read the entire series to understand, i suspect, but the "DFA" in question is expanded from the NFA graph "on the fly") to handle ordered alternations. while this is only an appeal to authority, and not a proof, i think it's fair to say that if russ cox can't do it (express ordered alternations as a pure DFA), then no-one knows how to.
我没有检查任何文献,但我认为你可以为有序交替构建一个 DFA,从而证明它不会通过以下方式添加任何表达能力:
直观地说,它的作用是在输出 DFA 中创建两个区域。其中一个对应于交替的第一个参数,另一个对应于第二个参数。只要交替的第一个参数可能匹配,我们就留在第一部分。当遇到一个确定第一个参数不匹配的符号时,如果可能的话,我们此时会切换到第二部分。如果此方法有误,请评论。
I haven't checked any literature but I think you can construct a DFA for the ordered alternation and thus prove that it doesn't add any expressive power in the following way:
Intuitively what this does is it creates two regions in the output DFA. One of them corresponds to the first argument of the alternation and the other to the second. As long as it's possible that the first argument of the alternation will match we stay in the first part. When a symbol is encountered which makes it certain that the first argument won't match we switch to the second part if possible at this point. Please comment if this approach is wrong.