可以使用正则表达式来匹配嵌套模式吗?
是否可以编写一个正则表达式来匹配出现次数未知的嵌套模式? 例如,当外大括号内嵌套未知数量的左大括号时,正则表达式是否可以匹配左大括号和右大括号?
例如:
public MyMethod()
{
if (test)
{
// More { }
}
// More { }
} // End
应匹配:
{
if (test)
{
// More { }
}
// More { }
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
不,就这么简单。 有限自动机(正则表达式底层的数据结构)除了其所处的状态之外没有内存,如果您有任意深度的嵌套,则需要一个任意大的自动机,这与 的概念相冲突有限自动机。
您可以将嵌套/配对元素匹配到固定深度,其中深度仅受您的内存限制,因为自动机变得非常大。 然而,在实践中,您应该使用下推自动机,即上下文无关语法的解析器,例如 LL(自上而下)或 LR(自下而上)。 您必须考虑到更糟糕的运行时行为:O(n^3) 与 O(n),其中 n = length(input)。
有许多可用的解析器生成器,例如用于 Java 的 ANTLR。 找到 Java(或 C)的现有语法也不难。
有关更多背景信息:维基百科的自动机理论
No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.
You can match nested/paired elements up to a fixed depth, where the depth is only limited by your memory, because the automaton gets very large. In practice, however, you should use a push-down automaton, i.e a parser for a context-free grammar, for instance LL (top-down) or LR (bottom-up). You have to take the worse runtime behavior into account: O(n^3) vs. O(n), with n = length(input).
There are many parser generators avialable, for instance ANTLR for Java. Finding an existing grammar for Java (or C) is also not difficult.
For more background: Automata Theory at Wikipedia
使用正则表达式检查嵌套模式非常简单。
Using regular expressions to check for nested patterns is very easy.
如果字符串在一行上,则可能是有效的 Perl 解决方案:
HTH
编辑: 检查:
还有一件事 Torsten Marek(他正确地指出,它不再是正则表达式):
Probably working Perl solution, if the string is on one line:
HTH
EDIT: check:
And one more thing by Torsten Marek (who had pointed out correctly, that it's not a regex anymore):
常规语言的泵引理是你不能这样做的原因。
生成的自动机将具有有限数量的状态,例如 k,因此一串 k+1 个左大括号必然会在某处重复某个状态(当自动机处理字符时)。 同一状态之间的字符串部分可以无限次重复,并且自动机不会知道其中的差异。
特别是,如果它接受 k+1 个左大括号,后跟 k+1 个右大括号(它应该),它也将接受泵送的左大括号数量,后跟未更改的 k+1 个右大括号(它不应该)。
The Pumping lemma for regular languages is the reason why you can't do that.
The generated automaton will have a finite number of states, say k, so a string of k+1 opening braces is bound to have a state repeated somewhere (as the automaton processes the characters). The part of the string between the same state can be duplicated infinitely many times and the automaton will not know the difference.
In particular, if it accepts k+1 opening braces followed by k+1 closing braces (which it should) it will also accept the pumped number of opening braces followed by unchanged k+1 closing brases (which it shouldn't).
是的,如果它是 .NET RegEx 引擎。 .Net 引擎支持由外部堆栈提供的有限状态机。 请参阅详细信息
Yes, if it is .NET RegEx-engine. .Net engine supports finite state machine supplied with an external stack. see details
正确的正则表达式将无法做到这一点,因为您将离开正则语言领域而进入上下文无关语言领域。
尽管如此,许多语言提供的“正则表达式”包实际上更强大。
例如, Lua 正则表达式具有将匹配的“
%b()
”识别器平衡括号。 在您的情况下,您将使用“%b{}
”另一个类似于 sed 的复杂工具是 gema,您可以轻松地将平衡花括号与
{#}
匹配。因此,根据您可以使用的工具,您的“正则表达式”(广义上)可能能够匹配嵌套括号。
Proper Regular expressions would not be able to do it as you would leave the realm of Regular Languages to land in the Context Free Languages territories.
Nevertheless the "regular expression" packages that many languages offer are strictly more powerful.
For example, Lua regular expressions have the "
%b()
" recognizer that will match balanced parenthesis. In your case you would use "%b{}
"Another sophisticated tool similar to sed is gema, where you will match balanced curly braces very easily with
{#}
.So, depending on the tools you have at your disposal your "regular expression" (in a broader sense) may be able to match nested parenthesis.
是的
......假设有一些最大的嵌套数量,您会很乐意停下来。
让我解释。
@torsten-marek 是正确的,正则表达式无法检查这样的嵌套模式,但是 可以定义一个嵌套的正则表达式模式,它允许您捕获这样的嵌套结构达到某个最大深度。 我创建了一个来捕获 EBNF 风格 评论 (在这里尝试一下),例如:
正则表达式(用于单深度注释)如下:
这可以很容易地进行调整为了您的目的,将
\(+\*+
和\*+\)+
替换为{
和}
> 并用简单的[^{}]
替换中间的所有内容:(这里是链接 进行尝试。)
要嵌套,只需在块本身中允许此模式:
要查找三重嵌套块,请使用:
出现了清晰的模式。 要查找嵌套深度为 N 的注释,只需使用正则表达式:
可以编写脚本来递归生成这些正则表达式,但这超出了我需要的范围。 (这留给读者作为练习。
YES
...assuming that there is some maximum number of nestings you'd be happy to stop at.
Let me explain.
@torsten-marek is right that a regular expression cannot check for nested patterns like this, BUT it is possible to define a nested regex pattern which will allow you to capture nested structures like this up to some maximum depth. I created one to capture EBNF-style comments (try it out here), like:
The regex (for single-depth comments) is the following:
This could easily be adapted for your purposes by replacing the
\(+\*+
and\*+\)+
with{
and}
and replacing everything in between with a simple[^{}]
:(Here's the link to try that out.)
To nest, just allow this pattern within the block itself:
To find triple-nested blocks, use:
A clear pattern has emerged. To find comments nested to a depth of
N
, simply use the regex:A script could be written to recursively generate these regexes, but that's beyond the scope of what I need this for. (This is left as an exercise for the reader. ????)
正如 zsolt 提到的,一些正则表达式引擎支持递归——当然,这些引擎通常使用回溯算法,因此它不会特别有效。 例如:
/(?>[^{}]*){(?>[^{}]*)(?R)*(?>[^{}]*)}/sm
as zsolt mentioned, some regex engines support recursion -- of course, these are typically the ones that use a backtracking algorithm so it won't be particularly efficient. example:
/(?>[^{}]*){(?>[^{}]*)(?R)*(?>[^{}]*)}/sm
不,此时您正在进入上下文无关语法的领域。
No, you are getting into the realm of Context Free Grammars at that point.
这似乎有效:
/(\{(?:\{.*\}|[^\{])*\})/m
This seems to work:
/(\{(?:\{.*\}|[^\{])*\})/m
在 PHP 正则表达式引擎中使用递归匹配比括号的过程匹配要快得多。 尤其是对于较长的琴弦。
http://php.net/manual/en/regexp.reference.recursive.php
例如
与
Using the recursive matching in the PHP regex engine is massively faster than procedural matching of brackets. especially with longer strings.
http://php.net/manual/en/regexp.reference.recursive.php
e.g.
vs.