这是一系列教育正则表达式文章的第二部分。它展示了如何使用前瞻和嵌套引用来匹配非常规语言 anbn。嵌套引用首先在:此正则表达式如何查找三角形数字?
一种典型的非常规语言是:
L = { a
nb
n: n >; 0 }
这是所有非空字符串的语言,由一定数量的 a
和后跟相同数量的 b
组成。这种语言中的字符串示例有 ab
、aabb
、aaabbb
。
这种语言可以通过抽引理显示为非正则语言。它实际上是一种原型上下文无关语言,可以由上下文无关语法 S → aSb | ab
.
尽管如此,现代正则表达式实现显然不仅仅识别常规语言。也就是说,根据正式语言理论的定义,它们不是“规则的”。 PCRE 和 Perl 支持递归正则表达式,.NET 支持平衡组定义。即使不太“花哨”的功能,例如反向引用匹配,也意味着正则表达式不是常规的。
但这个“基本”功能到底有多强大呢?例如,我们可以使用 Java 正则表达式识别 L
吗?我们是否可以将环视和嵌套引用结合起来,并拥有一种适用于例如 String.matches
匹配 ab
、aabb
、<代码>aaabbb等?
参考文献
链接问题
This is the second part of a series of educational regex articles. It shows how lookaheads and nested references can be used to match the non-regular languge anbn. Nested references are first introduced in: How does this regex find triangular numbers?
One of the archetypal non-regular languages is:
L = { a
nb
n: n > 0 }
This is the language of all non-empty strings consisting of some number of a
's followed by an equal number of b
's. Examples of strings in this language are ab
, aabb
, aaabbb
.
This language can be show to be non-regular by the pumping lemma. It is in fact an archetypal context-free language, which can be generated by the context-free grammar S → aSb | ab
.
Nonetheless, modern day regex implementations clearly recognize more than just regular languages. That is, they are not "regular" by formal language theory definition. PCRE and Perl supports recursive regex, and .NET supports balancing groups definition. Even less "fancy" features, e.g. backreference matching, means that regex is not regular.
But just how powerful is this "basic" features? Can we recognize L
with Java regex, for example? Can we perhaps combine lookarounds and nested references and have a pattern that works with e.g. String.matches
to match strings like ab
, aabb
, aaabbb
, etc?
References
Linked questions
发布评论
评论(3)
鉴于没有提及 PCRE 支持递归模式,我只想指出描述相关语言的最简单、最有效的 PCRE 示例:
Given that no mention has been made of PCRE supporting recursive patterns, I'd just like to point out the simplest and most efficient example of PCRE that describes the language in question:
正如问题中提到的 - 使用 .NET 平衡组,类型的模式 anbncndn…zn 可以轻松匹配,例如
: http://www.ideone.com/usuOE
编辑:
对于具有递归模式的通用语言,还有一个 PCRE 模式,但需要先行查找。我认为这不是上述内容的直接翻译。
例如:http://www.ideone.com/9gUwF
As mentioned in the question — with .NET balancing group, the patterns of the type anbncndn…zn can be matched easily as
For example: http://www.ideone.com/usuOE
Edit:
There is also a PCRE pattern for the generalized language with recursive pattern, but a lookahead is needed. I don't think this is a direct translation of the above.
For example: http://www.ideone.com/9gUwF
不用说,答案是,是!您肯定可以编写 Java 正则表达式模式来匹配 anbn< /em>。它使用正向前瞻进行断言,并使用一个嵌套引用进行“计数”。
这个答案不会立即给出模式,而是引导读者完成推导模式的过程。随着解决方案的慢慢构建,会给出各种提示。在这方面,希望这个答案将包含的不仅仅是另一个简洁的正则表达式模式。希望读者也能学习如何“用正则表达式思考”,以及如何将各种结构和谐地组合在一起,这样他们将来就可以自己推导出更多模式。
由于其简洁性,用于开发该解决方案的语言将是 PHP。一旦模式最终确定,最终测试将在 Java 中完成。
第 1 步:断言的前瞻
让我们从一个更简单的问题开始:我们想要匹配字符串开头的
a+
,但前提是它后面紧跟着b+
。我们可以使用^
来锚定我们的匹配,因为我们只想匹配a+
而不匹配b+
,我们可以使用 前瞻断言(?=…)
。这是带有简单测试工具的模式:
输出为(如 ideone.com 上所示):
这正是我们想要的输出:我们匹配
a+
,仅当它位于字符串的开头,并且仅当它紧随其后的是b+
时。教训:您可以在环视中使用模式来做出断言。
第 2 步:在前瞻(和自由间距模式)中捕获
现在假设即使我们不希望
b+
成为匹配的一部分,但我们确实希望 将其捕获到第 1 组中。此外,由于我们预计会有更复杂的模式,因此我们使用x
free-spacing 修饰符,这样我们就可以使我们的正则表达式更具可读性。在之前的 PHP 代码片段的基础上,我们现在具有以下模式:
现在的输出(如 ideone.com 上所示 ):
请注意,例如
aaa|b
是join
的结果 - 每个组使用'|'
捕获的内容。在本例中,组 0(即模式匹配的内容)捕获了aaa
,组 1 捕获了b
。教训:您可以在环顾四周进行捕捉。您可以使用自由间距来增强可读性。
步骤 3:将前瞻重构为“循环”
在引入计数机制之前,我们需要对模式进行一项修改。目前,前瞻位于
+
重复“循环”之外。到目前为止这还不错,因为我们只是想断言a+
后面有一个b+
,但我们最终真正想要做的是断言对于我们在“循环”内匹配的每个a
,都有一个相应的b
与之配合。我们暂时不用担心计数机制,只需进行如下重构:
a+
重构为(?: a )+
(注意(?: ...)
是一个非捕获组)a*
,然后才能“看到”b+
,因此请相应地修改模式所以我们现在有以下内容:
输出是相同的和以前一样(如 ideone.com 上所示),因此在这方面没有任何变化。重要的是,现在我们在
+
“循环”的每次迭代处进行断言。对于我们当前的模式,这不是必需的,但接下来我们将使用自引用使组 1 为我们“计数”。教训:您可以在非捕获组内进行捕获。环视可以重复。
第 4 步:这是我们开始计数的步骤
这是我们要做的:我们将重写组 1,以便:
+
的第一次迭代结束时,当第一个a
匹配,它应该捕获b
a
匹配时,它应该捕获bb
bbb
b
来捕获到组 1,那么断言就会失败,所以组 1,现在是
(b+)
,必须重写为(\1 b)
之类的内容。也就是说,我们尝试将b
“添加”到组 1 在上一次迭代中捕获的内容中。这里有一个小问题,即该模式缺少“基本情况”,即无需自引用即可匹配的情况。需要基本情况,因为组 1 开始时“未初始化”;它还没有捕获任何内容(甚至没有空字符串),因此自引用尝试总是会失败。
有很多方法可以解决这个问题,但现在让我们将自引用匹配 可选 ,即
\1?
。这可能工作得很好,也可能不完美,但让我们看看它能做什么,如果有任何问题,那么当我们遇到问题时,我们就会跨过那座桥。此外,我们还将添加更多测试用例。现在的输出是(如 ideone.com 上所示):
啊哈!看来我们现在已经非常接近解决方案了!我们成功地让第 1 组使用自我引用来“计数”!但是等等...第二个和最后一个测试用例出了问题!没有足够的
b
,并且不知何故计数错误!我们将在下一步中检查为什么会发生这种情况。教训:“初始化”自引用组的一种方法是使自引用匹配可选。
步骤 4½:了解出了什么
问题 问题是,由于我们将自引用匹配设为可选,因此当没有足够的
b
时,“计数器”可以“重置”回 0。让我们仔细检查一下以aaaaabbb
作为输入的模式的每次迭代中发生的情况。啊哈!在第四次迭代中,我们仍然可以匹配
\1
,但无法匹配\1b
!由于我们允许自引用匹配对于\1?
是可选的,因此引擎回溯并采取“不,谢谢”选项,然后允许我们仅匹配和捕获b!
但请注意,除了第一次迭代之外,您始终可以仅匹配自引用
\1
。当然,这是显而易见的,因为这是我们在上一次迭代中刚刚捕获的内容,并且在我们的设置中我们总是可以再次匹配它(例如,如果我们上次捕获了bbb
,我们保证有仍然是bbb
,但这次可能有也可能没有bbbb
)。教训:谨防回溯。正则表达式引擎将进行尽可能多的回溯,直到给定的模式匹配为止。这可能会影响性能(即灾难性回溯)和/或正确性。
第五步:自我克制来拯救!
现在“修复”应该是显而易见的:将可选重复与 所有格 量词结合起来。也就是说,不要简单地使用
?
,而是使用?+
(请记住,被量化为所有格的重复不会回溯,即使这种“合作”可能会导致整体模式的匹配)。用非常非正式的术语来说,
?+
、?
和??
的意思是:在我们的设置中,
\1
第一次不会出现,但此后它始终会出现,而且我们始终那就想匹配一下。因此,\1?+
将完全实现我们想要的效果。现在输出是(如 ideone.com 上所示):
Voilà!!!问题解决了!!!我们现在正在正确计数,完全按照我们想要的方式进行!
教训:了解贪婪、不情愿和占有欲重复之间的区别。可选-所有格可以是一个强大的组合。
第 6 步:收尾工作
现在我们拥有的是一个重复匹配
a
的模式,并且对于每个匹配的a
,都有一个对应的b< /code> 在组 1 中捕获。当不再有
a
时,或者由于没有相应的b
导致断言失败时,+
终止code> 为a
。要完成这项工作,我们只需附加到我们的模式
\1 $
中。现在,这是对组 1 匹配的内容的反向引用,后跟行锚点的末尾。锚点确保字符串中没有任何额外的b
;换句话说,实际上我们有anbn。这是最终确定的模式,带有其他测试用例,其中一个长度为 10,000 个字符:
它找到 4 个匹配项:
ab
、aabb
、aaabbb
和a5000b5000。 在 ideone.com 上运行只需 0.06 秒。第 7 步:Java 测试
因此,该模式可以在 PHP 中运行,但最终目标是编写一个可以在 Java 中运行的模式。
该模式按预期工作(如 ideone.com 上所示)。
现在我们得出结论...
需要指出的是,前瞻中的
a*
以及“main+
循环”实际上都允许回溯。鼓励读者确认为什么这在正确性方面不是问题,以及为什么同时使两个所有格也能起作用(尽管以相同模式混合强制和非强制所有格量词可能会导致误解)。还应该说,虽然有一个正则表达式模式可以匹配anbn,但这并不总是“实践中最好的解决方案。更好的解决方案是简单地匹配
^(a+)(b+)$
,然后比较托管编程语言中组 1 和组 2 捕获的字符串的长度。在 PHP 中,它可能看起来像这样(如 ideone.com 中所示):
本文的目的是 < em>不是让读者相信正则表达式几乎可以做任何事情;它显然不能,即使对于它可以做的事情,如果它能带来更简单的解决方案,至少应该考虑部分委托给托管语言。
正如顶部所提到的,虽然本文必须为 stackoverflow 标记为
[regex]
,但它的含义可能不止于此。虽然学习断言、嵌套引用、所有格量词等当然是有价值的,但也许这里更大的一课是尝试解决问题的创造性过程,当你遇到困难时,它通常需要决心和努力工作。各种约束、由各个部分组成的系统性构建工作解决方案等。奖励材料! PCRE递归模式!
既然我们确实提出了 PHP,就需要说 PCRE 支持递归模式和子例程。因此,以下模式适用于
preg_match
(如 ideone.com 上所示):正则表达式不支持递归模式。
更多奖励材料!匹配anbncn!!
我们已经了解了如何匹配 anbn,它是非常规的,但仍然与上下文无关,但我们也可以匹配anbncn,这甚至不是上下文无关的?
答案当然是,是!鼓励读者尝试自己解决这个问题,但解决方案在下面提供(ideone.com 上的 Java 实现)。
The answer is, needless to say, YES! You can most certainly write a Java regex pattern to match anbn. It uses a positive lookahead for assertion, and one nested reference for "counting".
Rather than immediately giving out the pattern, this answer will guide readers through the process of deriving it. Various hints are given as the solution is slowly constructed. In this aspect, hopefully this answer will contain much more than just another neat regex pattern. Hopefully readers will also learn how to "think in regex", and how to put various constructs harmoniously together, so they can derive more patterns on their own in the future.
The language used to develop the solution will be PHP for its conciseness. The final test once the pattern is finalized will be done in Java.
Step 1: Lookahead for assertion
Let's start with a simpler problem: we want to match
a+
at the beginning of a string, but only if it's followed immediately byb+
. We can use^
to anchor our match, and since we only want to match thea+
without theb+
, we can use lookahead assertion(?=…)
.Here is our pattern with a simple test harness:
The output is (as seen on ideone.com):
This is exactly the output we want: we match
a+
, only if it's at the beginning of the string, and only if it's immediately followed byb+
.Lesson: You can use patterns in lookarounds to make assertions.
Step 2: Capturing in a lookahead (and f r e e - s p a c i n g mode)
Now let's say that even though we don't want the
b+
to be part of the match, we do want to capture it anyway into group 1. Also, as we anticipate having a more complicated pattern, let's usex
modifier for free-spacing so we can make our regex more readable.Building on our previous PHP snippet, we now have the following pattern:
The output is now (as seen on ideone.com):
Note that e.g.
aaa|b
is the result ofjoin
-ing what each group captured with'|'
. In this case, group 0 (i.e. what the pattern matched) capturedaaa
, and group 1 capturedb
.Lesson: You can capture inside a lookaround. You can use free-spacing to enhance readability.
Step 3: Refactoring the lookahead into the "loop"
Before we can introduce our counting mechanism, we need to do one modification to our pattern. Currently, the lookahead is outside of the
+
repetition "loop". This is fine so far because we just wanted to assert that there's ab+
following oura+
, but what we really want to do eventually is assert that for eacha
that we match inside the "loop", there's a correspondingb
to go with it.Let's not worry about the counting mechanism for now and just do the refactoring as follows:
a+
to(?: a )+
(note that(?:…)
is a non-capturing group)a*
before we can "see" theb+
, so modify the pattern accordinglySo we now have the following:
The output is the same as before (as seen on ideone.com), so there's no change in that regard. The important thing is that now we are making the assertion at every iteration of the
+
"loop". With our current pattern, this is not necessary, but next we'll make group 1 "count" for us using self-reference.Lesson: You can capture inside a non-capturing group. Lookarounds can be repeated.
Step 4: This is the step where we start counting
Here's what we're going to do: we'll rewrite group 1 such that:
+
, when the firsta
is matched, it should captureb
a
is matched, it should capturebb
bbb
b
to capture into group 1 then the assertion simply failsSo group 1, which is now
(b+)
, will have to be rewritten to something like(\1 b)
. That is, we try to "add" ab
to what group 1 captured in the previous iteration.There's a slight problem here in that this pattern is missing the "base case", i.e. the case where it can match without the self-reference. A base case is required because group 1 starts "uninitialized"; it hasn't captured anything yet (not even an empty string), so a self-reference attempt will always fail.
There are many ways around this, but for now let's just make the self-reference matching optional, i.e.
\1?
. This may or may not work perfectly, but let's just see what that does, and if there's any problem then we'll cross that bridge when we come to it. Also, we'll add some more test cases while we're at it.The output is now (as seen on ideone.com):
A-ha! It looks like we're really close to the solution now! We managed to get group 1 to "count" using self-reference! But wait... something is wrong with the second and the last test cases!! There aren't enough
b
s, and somehow it counted wrong! We'll examine why this happened in the next step.Lesson: One way to "initialize" a self-referencing group is to make the self-reference matching optional.
Step 4½: Understanding what went wrong
The problem is that since we made the self-reference matching optional, the "counter" can "reset" back to 0 when there aren't enough
b
's. Let's closely examine what happens at every iteration of our pattern withaaaaabbb
as input.A-ha! On our 4th iteration, we could still match
\1
, but we couldn't match\1b
! Since we allow the self-reference matching to be optional with\1?
, the engine backtracks and took the "no thanks" option, which then allows us to match and capture justb
!Do note, however, that except on the very first iteration, you could always match just the self-reference
\1
. This is obvious, of course, since it's what we just captured on our previous iteration, and in our setup we can always match it again (e.g. if we capturedbbb
last time, we're guaranteed that there will still bebbb
, but there may or may not bebbbb
this time).Lesson: Beware of backtracking. The regex engine will do as much backtracking as you allow until the given pattern matches. This may impact performance (i.e. catastrophic backtracking) and/or correctness.
Step 5: Self-possession to the rescue!
The "fix" should now be obvious: combine optional repetition with possessive quantifier. That is, instead of simply
?
, use?+
instead (remember that a repetition that is quantified as possessive does not backtrack, even if such "cooperation" may result in a match of the overall pattern).In very informal terms, this is what
?+
,?
and??
says:In our setup,
\1
will not be there the very first time, but it will always be there any time after that, and we always want to match it then. Thus,\1?+
would accomplish exactly what we want.Now the output is (as seen on ideone.com):
Voilà!!! Problem solved!!! We are now counting properly, exactly the way we want it to!
Lesson: Learn the difference between greedy, reluctant, and possessive repetition. Optional-possessive can be a powerful combination.
Step 6: Finishing touches
So what we have right now is a pattern that matches
a
repeatedly, and for everya
that was matched, there is a correspondingb
captured in group 1. The+
terminates when there are no morea
, or if the assertion failed because there isn't a correspondingb
for ana
.To finish the job, we simply need to append to our pattern
\1 $
. This is now a back reference to what group 1 matched, followed by the end of the line anchor. The anchor ensures that there aren't any extrab
's in the string; in other words, that in fact we have anbn.Here's the finalized pattern, with additional test cases, including one that's 10,000 characters long:
It finds 4 matches:
ab
,aabb
,aaabbb
, and the a5000b5000. It takes only 0.06s to run on ideone.com.Step 7: The Java test
So the pattern works in PHP, but the ultimate goal is to write a pattern that works in Java.
The pattern works as expected (as seen on ideone.com).
And now we come to the conclusion...
It needs to be said that the
a*
in the lookahead, and indeed the "main+
loop", both permit backtracking. Readers are encouraged to confirm why this is not a problem in terms of correctness, and why at the same time making both possessive would also work (though perhaps mixing mandatory and non-mandatory possessive quantifier in the same pattern may lead to misperceptions).It should also be said that while it's neat that there's a regex pattern that will match anbn, this is in not always the "best" solution in practice. A much better solution is to simply match
^(a+)(b+)$
, and then compare the length of the strings captured by groups 1 and 2 in the hosting programming language.In PHP, it may look something like this (as seen in ideone.com):
The purpose of this article is NOT to convince readers that regex can do almost anything; it clearly can't, and even for the things it can do, at least partial delegation to the hosting language should be considered if it leads to a simpler solution.
As mentioned at the top, while this article is necessarily tagged
[regex]
for stackoverflow, it is perhaps about more than that. While certainly there's value in learning about assertions, nested references, possessive quantifier, etc, perhaps the bigger lesson here is the creative process by which one can try to solve problems, the determination and hard work that it often requires when you're subjected to various constraints, the systematic composition from various parts to build a working solution, etc.Bonus material! PCRE recursive pattern!
Since we did bring up PHP, it needs to be said that PCRE supports recursive pattern and subroutines. Thus, following pattern works for
preg_match
(as seen on ideone.com):Currently Java's regex does not support recursive pattern.
Even more bonus material! Matching anbncn !!
So we've seen how to match anbn which is non-regular, but still context-free, but can we also match anbncn, which isn't even context-free?
The answer is, of course, YES! Readers are encouraged to try to solve this on their own, but the solution is provided below (with implementation in Java on ideone.com).