Ruby 1.9 正则表达式与上下文无关语法同样强大吗？

发布于 2024-12-28 13:06:56 字数 1123 浏览 5 评论 0原文

我有这个正则表达式：

regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x

当我针对多个字符串测试它时，它似乎与上下文无关语法一样强大，因为它可以正确处理递归。

regex.match("aaacaaa")
# => #<MatchData "aaacaaa" foo:"aaacaaa">
regex.match("aacaa")
# => #<MatchData "aacaa" foo:"aacaa">
regex.match("aabcbaa")
# => #<MatchData "aabcbaa" foo:"aabcbaa">
regex.match("aaacaa")
# => nil

“Ruby 1.9 正则表达式的乐趣”有一个例子，他实际上排列了正则表达式的所有部分，使其看起来像上下文无关语法，如下所示：

sentence = %r{ 
    (?<subject>   cat   | dog   | gerbil    ){0} 
    (?<verb>      eats  | drinks| generates ){0} 
    (?<object>    water | bones | PDFs      ){0} 
    (?<adjective> big   | small | smelly    ){0} 

    (?<opt_adj>   (\g<adjective>\s)?     ){0} 

    The\s\g<opt_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object> 
}x

在他重新排列正则表达式部分的技术和我的示例之间递归命名捕获组，这是否意味着 Ruby 1.9 正则表达式具有相当于上下文无关语法的功能？

原文

I have this regular expression:

regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x

When I test it against several strings, it appears to be as powerful as a context free grammar because it handles the recursion properly.

regex.match("aaacaaa")
# => #<MatchData "aaacaaa" foo:"aaacaaa">
regex.match("aacaa")
# => #<MatchData "aacaa" foo:"aacaa">
regex.match("aabcbaa")
# => #<MatchData "aabcbaa" foo:"aabcbaa">
regex.match("aaacaa")
# => nil

"Fun with Ruby 1.9 Regular Expressions" has an example where he actually arranges all the parts of a regex so that it looks like a context-free grammar as follows:

sentence = %r{ 
    (?<subject>   cat   | dog   | gerbil    ){0} 
    (?<verb>      eats  | drinks| generates ){0} 
    (?<object>    water | bones | PDFs      ){0} 
    (?<adjective> big   | small | smelly    ){0} 

    (?<opt_adj>   (\g<adjective>\s)?     ){0} 

    The\s\g<opt_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object> 
}x

Between his technique for rearranging the parts of the regex, and my example of recursive named capturing groups, does this mean Ruby 1.9 regular expressions have the power equivalent to a context-free grammar?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

︶￣淡然 2025-01-04 13:06:56

这是 Ruby 1.9 中使用的 Oniguruma 正则表达式引擎的最棒的事情之一——它具有解析器的功能，并且不仅限于识别常规语言。它具有正向和负向前视/后视功能，甚至可以用来识别一些非上下文无关的语言！举个例子：

regexp = /\A(?<AB>a\g<AB>b|){0}(?=\g<AB>c)a*(?<BC>b\g<BC>c|){1}\Z/

这个正则表达式可以识别“abc”、“aabbcc”、“aaabbbccc”等字符串，其中“a”、“b”、“c”的个数必须相等，否则会出现错误。不匹配。

（一个限制：你不能在前向和后向中使用命名组。）

虽然我没有深入了解，Oniguruma 似乎通过简单的递归下降来处理命名组，当某些东西不匹配时进行备份。我发现它无法处理左递归。例如：

irb(main):013:0> regexp = /(?<A>\g<A>a|)/
SyntaxError: (irb):13: never ending recursion: /(?<A>\g<A>a|)/
    from C:/Ruby192/bin/irb:12:in `<main>'

我不太清楚我的解析理论，但我认为像这样的非确定性自顶向下解析器应该能够解析任何上下文无关语言。（“语言”，而不是“语法”；如果您的语法有左递归，则必须将其转换为右递归。）如果不正确，请编辑这篇文章。

This is one of the awesome things about the Oniguruma regexp engine used in Ruby 1.9 – it has the power of a parser, and is not restricted to recognizing regular languages. It has positive and negative lookahead/lookbehind, which even can be used to recognize some languages which are not context-free! Take the following as an example:

regexp = /\A(?<AB>a\g<AB>b|){0}(?=\g<AB>c)a*(?<BC>b\g<BC>c|){1}\Z/

This regexp recognizes strings like “abc”, “aabbcc”, “aaabbbccc”, and so on – the number of “a”, “b”, and “c” must be equal, or it will not match.

(One limitation: you can’t use named groups in the lookahead and lookbehind.)

Although I haven’t peeked under the hood, Oniguruma seems to deal with named groups by simple recursive descent, backing up when something doesn’t match. I’ve observed that it can’t deal with left recursion. For example:

irb(main):013:0> regexp = /(?<A>\g<A>a|)/
SyntaxError: (irb):13: never ending recursion: /(?<A>\g<A>a|)/
    from C:/Ruby192/bin/irb:12:in `<main>'

I don’t remember my parsing theory very clearly, but I think that a non-deterministic top-down parser like this should be able to parse any context-free language. (“language”, not “grammar”; if your grammar has left recursion, you will have to convert it to right recursion.) If that is incorrect, please edit this post.

回复收藏 0 原文

~没有更多了~