我可以使用 Perl 正则表达式来匹配平衡文本吗?
I would like to match text enclosed in brackets etc in Perl. How can I do that?
This is a question from the official perlfaq. We're importing the perlfaq to Stack Overflow.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是官方常见问题解答,减去任何后续编辑。
您的第一次尝试可能应该是 Perl 中的 Text::Balanced 模块自 Perl 5.8 起的标准库。它具有多种处理棘手文本的功能。 Regexp::Common 模块还可以通过提供您可以使用的固定模式来提供帮助。
从 Perl 5.10 开始,您可以使用递归模式将平衡文本与正则表达式进行匹配。在 Perl 5.10 之前,您必须诉诸各种技巧,例如在
(??{})
序列中使用 Perl 代码。这是使用递归正则表达式的示例。目标是捕获尖括号内的所有文本,包括嵌套尖括号内的文本。此示例文本有两个“主要”组:一个具有一层嵌套的组和一个具有两层嵌套的组。尖括号中共有五组:
匹配平衡文本的正则表达式使用两个新的(对于 Perl 5.10)正则表达式功能。这些内容在 perlre 中进行了介绍,此示例是该文档中示例的修改版本。
首先,将新的所有格
+
添加到任何量词中都会找到最长的匹配,并且不会回溯。这很重要,因为您想通过递归处理任何尖括号,而不是回溯。组[^<>]++
查找一个或多个非尖括号而不用回溯。其次,新的
(?PARNO)
引用PARNO
给出的特定捕获组中的子模式。在下面的正则表达式中,第一个捕获组找到(并记住)平衡文本,并且您需要在第一个缓冲区中使用相同的模式来越过嵌套文本。这就是递归部分。(?1)
使用外部捕获组中的模式作为正则表达式的独立部分。将它们放在一起,您将得到:
输出显示 Perl 找到了两个主要组:
通过一些额外的工作,您可以获得尖括号中的所有组,即使它们也位于其他尖括号中。每次获得平衡匹配时,删除其外部分隔符(这是您刚刚匹配的分隔符,因此不要再次匹配它)并将其添加到要处理的字符串队列中。继续这样做,直到没有匹配项:
输出显示所有组。最外层的匹配首先显示,嵌套的匹配稍后显示:
This is the official FAQ answer minus any subsequent edits.
Your first try should probably be the Text::Balanced module, which is in the Perl standard library since Perl 5.8. It has a variety of functions to deal with tricky text. The Regexp::Common module can also help by providing canned patterns you can use.
As of Perl 5.10, you can match balanced text with regular expressions using recursive patterns. Before Perl 5.10, you had to resort to various tricks such as using Perl code in
(??{})
sequences.Here's an example using a recursive regular expression. The goal is to capture all of the text within angle brackets, including the text in nested angle brackets. This sample text has two "major" groups: a group with one level of nesting and a group with two levels of nesting. There are five total groups in angle brackets:
The regular expression to match the balanced text uses two new (to Perl 5.10) regular expression features. These are covered in perlre and this example is a modified version of one in that documentation.
First, adding the new possessive
+
to any quantifier finds the longest match and does not backtrack. That's important since you want to handle any angle brackets through the recursion, not backtracking. The group[^<>]++
finds one or more non-angle brackets without backtracking.Second, the new
(?PARNO)
refers to the sub-pattern in the particular capture group given byPARNO
. In the following regex, the first capture group finds (and remembers) the balanced text, and you need that same pattern within the first buffer to get past the nested text. That's the recursive part. The(?1)
uses the pattern in the outer capture group as an independent part of the regex.Putting it all together, you have:
The output shows that Perl found the two major groups:
With a little extra work, you can get the all of the groups in angle brackets even if they are in other angle brackets too. Each time you get a balanced match, remove its outer delimiter (that's the one you just matched so don't match it again) and add it to a queue of strings to process. Keep doing that until you get no matches:
The output shows all of the groups. The outermost matches show up first and the nested matches so up later: