非贪婪模式表达

发布于 2024-12-11 18:37:00 字数 1663 浏览 0 评论 0原文

我一直在阅读 Friedl 的《掌握正则表达式》,并尝试为由单词分隔的字符串设计一个常见的非贪婪模式表达式。 从基础开始,其中分隔单词只是单个字符“a”,表达式:

sed -r 's/([^a]*)(a)/\                                                                  
(1)\1(2)\2(ALL)&(END)/g' <<<"xaxxaxxxaxxx...aa..."

(1)x(2)a(ALL)xa(END)
(1)xx(2)a(ALL)xxa(END)
(1)xxx(2)a(ALL)xxxa(END)
(1)xxx...(2)a(ALL)xxx...a(END)
(1)(2)a(ALL)a(END)...

从中,模式(参考 Friedl)可能是:

  • [ 正常* 闭合 ]

转向真正的多字符“ab”分隔符:

sed -r 's/([^a]*)((a[^b]*)*)(ab)/\                          
(1)\1(2)\2(3)\3(4)\4(ALL)&(END)/g' <<<"xabxxabxxxabxxx...abxxx...aabxxx...axxx...aaabxaabaxabaxaxabxaxaabxxaaabaaxxab..."

(1)x(2)(3)(4)ab(ALL)xab(END)
(1)xx(2)(3)(4)ab(ALL)xxab(END)
(1)xxx(2)(3)(4)ab(ALL)xxxab(END)
(1)xxx...(2)(3)(4)ab(ALL)xxx...ab(END)
(1)xxx...(2)a(3)a(4)ab(ALL)xxx...aab(END)
(1)xxx...(2)axxx...aa(3)axxx...aa(4)ab(ALL)xxx...axxx...aaab(END)
(1)x(2)a(3)a(4)ab(ALL)xaab(END)
(1)(2)ax(3)ax(4)ab(ALL)axab(END)
(1)(2)axax(3)axax(4)ab(ALL)axaxab(END)
(1)x(2)axa(3)axa(4)ab(ALL)xaxaab(END)
(1)xx(2)aa(3)aa(4)ab(ALL)xxaaab(END)
(1)(2)aaxx(3)aaxx(4)ab(ALL)aaxxab(END)...

其中的模式可能是:

  • [ 正常* (特殊*)* 结束< /em> ]

对于后续'abc' 分隔符特殊表达式可以扩展为:

(a[^b]*)*(ab[^c]*)*
  1. 这是正确的吗?
  2. 可以证明吗?
  3. 特殊表达式可以简化吗?
  4. 有没有更好/更有效的表达方式?注意我没有使用 Perl 的非贪婪 '*?'操作员并避免交替。
  5. 我在哪里可以找到此类问题的参考资料(弗里德尔提到过,但没有发布解决方案)。

I've been reading "Mastering Regular Expressions" by Friedl and trying to devise a common non-greedy pattern expression for a string that is delimited by a word.
Starting from basics where the delimited word is just a single character 'a' the expression:

sed -r 's/([^a]*)(a)/\                                                                  
(1)\1(2)\2(ALL)&(END)/g' <<<"xaxxaxxxaxxx...aa..."

(1)x(2)a(ALL)xa(END)
(1)xx(2)a(ALL)xxa(END)
(1)xxx(2)a(ALL)xxxa(END)
(1)xxx...(2)a(ALL)xxx...a(END)
(1)(2)a(ALL)a(END)...

from which the pattern (with reference to Friedl) might be:

  • [ normal* closing ]

Moving on to a real multi-character 'ab' delimiter:

sed -r 's/([^a]*)((a[^b]*)*)(ab)/\                          
(1)\1(2)\2(3)\3(4)\4(ALL)&(END)/g' <<<"xabxxabxxxabxxx...abxxx...aabxxx...axxx...aaabxaabaxabaxaxabxaxaabxxaaabaaxxab..."

(1)x(2)(3)(4)ab(ALL)xab(END)
(1)xx(2)(3)(4)ab(ALL)xxab(END)
(1)xxx(2)(3)(4)ab(ALL)xxxab(END)
(1)xxx...(2)(3)(4)ab(ALL)xxx...ab(END)
(1)xxx...(2)a(3)a(4)ab(ALL)xxx...aab(END)
(1)xxx...(2)axxx...aa(3)axxx...aa(4)ab(ALL)xxx...axxx...aaab(END)
(1)x(2)a(3)a(4)ab(ALL)xaab(END)
(1)(2)ax(3)ax(4)ab(ALL)axab(END)
(1)(2)axax(3)axax(4)ab(ALL)axaxab(END)
(1)x(2)axa(3)axa(4)ab(ALL)xaxaab(END)
(1)xx(2)aa(3)aa(4)ab(ALL)xxaaab(END)
(1)(2)aaxx(3)aaxx(4)ab(ALL)aaxxab(END)...

from which the pattern might be:

  • [ normal* (special*)* closing ]

For the subsequent 'abc' delimiter the special expression can be extended to:

(a[^b]*)*(ab[^c]*)*
  1. Is this correct?
  2. Can it be proved?
  3. Can the special expression be simplified?
  4. Are there better/more efficient expressions for this? n.b. I'm not using perl's non-greedy '*?' operator and avoiding alternation.
  5. Where might I find reference material to this type of problem (Friedl alluded but stopped short of a published solution).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

命比纸薄 2024-12-18 18:37:00
  1. 是的,看起来是正确的。
  2. 您想了解有限自动机——非确定性 (NFA) 和确定性 (DFA)。简单的正则表达式系统本质上是有限自动机的一种方便的表示法。任何一本关于编译器的好书都会有一章介绍 NFA 和 DFA。
  3. 可能没有,或者说不多。你的话越长,你必须允许的回溯就越多。
  1. Yes, it looks correct.
  2. You want to read about finite automata — nondeterministic (NFA) and deterministic (DFA). Simple regexp systems are essentially a handy notation for finite automata. Any good book on compilers will have a chapter covering NFA and DFA.
  3. Probably not, or not much. The longer your word, the more backtracks you have to allow for.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文