包含有序交替的正则表达式可以重写为仅使用无序交替吗?

发布于 2024-11-25 13:10:49 字数 633 浏览 3 评论 0原文

假设我有一种正则表达式语言,支持文字、正负字符类、有序交替、贪婪量词 ?*+,并且非贪婪量词 ??*?+?。 (这本质上是 PCRE 的一个子集,没有反向引用、环视断言或其他一些更奇特的位。)用无序交替替换有序交替是否会降低这种形式主义的表达能力?

(无序交替——有时也称为“无序选择”——满足 L(S|T) = L(S) + L(T),而有序交替满足 L(S|T) = L (S) + (L(T) - { a in L(T) : a extends some b in L(S) }) 具体来说,模式 a|aa 将匹配字符串 。 >a 和如果交替是无序的,则aa,但如果交替是有序的,则只有a。)

换句话说,给定一个包含有序交替的模式S,该模式可以重写为一个不包含有序交替的等效模式 T(但可能是无序交替)?

如果这个问题已在文献中考虑过,我将不胜感激任何人都可以提供的参考资料。我几乎没有发现任何关于扩展正则表达式形式主义的表达能力的理论著作(除了关于反向引用如何将你从常规语言转移到上下文无关语法的常见问题之外)。

Suppose I have a regex language supporting literals, positive and negative character classes, ordered alternation, the greedy quantifiers ?, *, and +, and the nongreedy quantifiers ??, *?, and +?. (This is essentially a subset of PCRE without backreferences, look-around assertions, or some of the other fancier bits.) Does replacing ordered alternation with unordered alternation decrease the expressive power of this formalism?

(Unordered alternation---also sometimes called "unordered choice"---is such that L(S|T) = L(S) + L(T), while ordered alternation is such that L(S|T) = L(S) + (L(T) - { a in L(T) : a extends some b in L(S) }). Concretely, the pattern a|aa would match the strings a and aa if the alternation is unordered, but only a if the alternation is ordered.)

Put another way, given a pattern S containing an ordered alternation, can that pattern be rewritten to an equivalent pattern T which contains no ordered alternations (but possibly unordered alternations instead)?

If this question has been considered in the literature, I'd appreciate any references which anyone can provide. I was able to turn up almost no theoretical work on the expressive power of extended regex formalisms (beyond the usual things about how backreferences move you from regular languages to context-free grammars).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

腹黑女流氓 2024-12-02 13:10:49

http://swtch.com/~rsc/regexp/regexp3.html [ “正则表达式是否匹配字符串的子字符串?如果是,在哪里?”]有必要在“DFA”中引入优先级的概念(我怀疑,您需要阅读整个系列才能理解,但是“DFA” ”所讨论的问题是从 NFA 图“动态”扩展而来)以处理有序交替。虽然这只是对权威的呼吁,而不是证明,但我认为可以公平地说,如果 russ cox 不能做到这一点(将有序交替表示为纯粹的 DFA),那么没有人知道如何做到。

in http://swtch.com/~rsc/regexp/regexp3.html [section "Does the regexp match a substring of the string? If so, where?"] it's necessary to introduce the idea of priorities within the "DFA" (you need to read the entire series to understand, i suspect, but the "DFA" in question is expanded from the NFA graph "on the fly") to handle ordered alternations. while this is only an appeal to authority, and not a proof, i think it's fair to say that if russ cox can't do it (express ordered alternations as a pure DFA), then no-one knows how to.

绾颜 2024-12-02 13:10:49

我没有检查任何文献,但我认为你可以为有序交替构建一个 DFA,从而证明它不会通过以下方式添加任何表达能力:

  1. 假设我们有正则表达式 x||y< /strong> 其中 xy 是正则表达式,|| 表示无序交替。如果是这样,我们可以构建接受 xy 的 DFA。我们将标记 DFA_xDFA_y
  2. 我们将通过连接 DFA_x 分阶段构建 x||y 的 DFA和 DFA_y
  3. 对于 DFA_x 中对应于某个字符串 a 的每个路径(路径是指图形意义上的路径,无需遍历和边缘两次,因此aDFA_"a*" 中的路径,但 aa 不是)...
    • 对于字母表 s 中的每个符号
      • 如果 DFA_y 消耗 as(即如果在 as 上运行)DFA_y 不会提前停止,但可能会不一定接受),并且 DFA_x 不接受,并且 DFA_x 不接受 as 的任何前缀,创建从状态 DFA_x< 的转换/strong> 消耗后结束a 到消耗asDFA_y结束的状态


  4. 最终 DFA 的接受状态是两个输入 DFA 的所有接受状态。起始状态是DFA_x的起始状态。

直观地说,它的作用是在输出 DFA 中创建两个区域。其中一个对应于交替的第一个参数,另一个对应于第二个参数。只要交替的第一个参数可能匹配,我们就留在第一部分。当遇到一个确定第一个参数不匹配的符号时,如果可能的话,我们此时会切换到第二部分。如果此方法有误,请评论。

I haven't checked any literature but I think you can construct a DFA for the ordered alternation and thus prove that it doesn't add any expressive power in the following way:

  1. Let's say we have the regex x||y where x and y are regexen and || means the unordered alternation. If so we can construct DFA's accepting x and y. We will mark those DFA_x and DFA_y
  2. We will construct the DFA for x||y in phases by connecting DFA_x and DFA_y
  3. For every path in DFA_x corresponding to some string a (by path I mean a path in the graph sense without traversing and edge twice so a is a path in DFA_"a*" but aa is not)...
    • For every symbol in the alphabet s
      • If DFA_y consumes as (that is if run on as DFA_y will not stop early but it may not necessarily accept) and DFA_x does not and DFA_x doesn't accept any prefix of as create a transition from the state DFA_x ends in after consuming a to the state DFA_y ends in after consuming as
  4. The accepting states of the final DFA are all the accepting states of both the input DFA's. The starting state is the starting state of DFA_x.

Intuitively what this does is it creates two regions in the output DFA. One of them corresponds to the first argument of the alternation and the other to the second. As long as it's possible that the first argument of the alternation will match we stay in the first part. When a symbol is encountered which makes it certain that the first argument won't match we switch to the second part if possible at this point. Please comment if this approach is wrong.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文