我可以使用 Perl 正则表达式来匹配平衡文本吗?

发布于 2024-10-08 00:11:31 字数 272 浏览 6 评论 0原文

我想在 Perl 中匹配括号等中包含的文本。我怎样才能做到这一点?


这是来自官方 perlfaq 的问题。我们正在将 perlfaq 导入 Stack Overflow

I would like to match text enclosed in brackets etc in Perl. How can I do that?


This is a question from the official perlfaq. We're importing the perlfaq to Stack Overflow.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

戏舞 2024-10-15 00:11:31

这是官方常见问题解答,减去任何后续编辑。

您的第一次尝试可能应该是 Perl 中的 Text::Balanced 模块自 Perl 5.8 起的标准库。它具有多种处理棘手文本的功能。 Regexp::Common 模块还可以通过提供您可以使用的固定模式来提供帮助。

从 Perl 5.10 开始,您可以使用递归模式将平衡文本与正则表达式进行匹配。在 Perl 5.10 之前,您必须诉诸各种技巧,例如在 (??{}) 序列中使用 Perl 代码。

这是使用递归正则表达式的示例。目标是捕获尖括号内的所有文本,包括嵌套尖括号内的文本。此示例文本有两个“主要”组:一个具有一层嵌套的组和一个具有两层嵌套的组。尖括号中共有五组:

I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that's it.

匹配平衡文本的正则表达式使用两个新的(对于 Perl 5.10)正则表达式功能。这些内容在 perlre 中进行了介绍,此示例是该文档中示例的修改版本。

首先,将新的所有格 + 添加到任何量词中都会找到最长的匹配,并且不会回溯。这很重要,因为您想通过递归处理任何尖括号,而不是回溯。组[^<>]++查找一个或多个非尖括号而不用回溯。

其次,新的 (?PARNO) 引用 PARNO 给出的特定捕获组中的子模式。在下面的正则表达式中,第一个捕获组找到(并记住)平衡文本,并且您需要在第一个缓冲区中使用相同的模式来越过嵌套文本。这就是递归部分。 (?1) 使用外部捕获组中的模式作为正则表达式的独立部分。

将它们放在一起,您将得到:

#!/usr/local/bin/perl5.10.0

my $string =<<"HERE";
I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that's it.
HERE

my @groups = $string =~ m/
        (                   # start of capture group 1
        <                   # match an opening angle bracket
            (?:
                [^<>]++     # one or more non angle brackets, non backtracking
                  |
                (?1)        # found < or >, so recurse to capture group 1
            )*
        >                   # match a closing angle bracket
        )                   # end of capture group 1
        /xg;

$" = "\n\t";
print "Found:\n\t@groups\n";

输出显示 Perl 找到了两个主要组:

Found:
    <brackets in <nested brackets> >
    <another group <nested once <nested twice> > >

通过一些额外的工作,您可以获得尖括号中的所有组,即使它们也位于其他尖括号中。每次获得平衡匹配时,删除其外部分隔符(这是您刚刚匹配的分隔符,因此不要再次匹配它)并将其添加到要处理的字符串队列中。继续这样做,直到没有匹配项:

#!/usr/local/bin/perl5.10.0

my @queue =<<"HERE";
I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that's it.
HERE

my $regex = qr/
        (                   # start of bracket 1
        <                   # match an opening angle bracket
            (?:
                [^<>]++     # one or more non angle brackets, non backtracking
                  |
                (?1)        # recurse to bracket 1
            )*
        >                   # match a closing angle bracket
        )                   # end of bracket 1
        /x;

$" = "\n\t";

while( @queue )
    {
    my $string = shift @queue;

    my @groups = $string =~ m/$regex/g;
    print "Found:\n\t@groups\n\n" if @groups;

    unshift @queue, map { s/^<//; s/>$//; $_ } @groups;
    }

输出显示所有组。最外层的匹配首先显示,嵌套的匹配稍后显示:

Found:
    <brackets in <nested brackets> >
    <another group <nested once <nested twice> > >

Found:
    <nested brackets>

Found:
    <nested once <nested twice> >

Found:
    <nested twice>

This is the official FAQ answer minus any subsequent edits.

Your first try should probably be the Text::Balanced module, which is in the Perl standard library since Perl 5.8. It has a variety of functions to deal with tricky text. The Regexp::Common module can also help by providing canned patterns you can use.

As of Perl 5.10, you can match balanced text with regular expressions using recursive patterns. Before Perl 5.10, you had to resort to various tricks such as using Perl code in (??{}) sequences.

Here's an example using a recursive regular expression. The goal is to capture all of the text within angle brackets, including the text in nested angle brackets. This sample text has two "major" groups: a group with one level of nesting and a group with two levels of nesting. There are five total groups in angle brackets:

I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that's it.

The regular expression to match the balanced text uses two new (to Perl 5.10) regular expression features. These are covered in perlre and this example is a modified version of one in that documentation.

First, adding the new possessive + to any quantifier finds the longest match and does not backtrack. That's important since you want to handle any angle brackets through the recursion, not backtracking. The group [^<>]++ finds one or more non-angle brackets without backtracking.

Second, the new (?PARNO) refers to the sub-pattern in the particular capture group given by PARNO. In the following regex, the first capture group finds (and remembers) the balanced text, and you need that same pattern within the first buffer to get past the nested text. That's the recursive part. The (?1) uses the pattern in the outer capture group as an independent part of the regex.

Putting it all together, you have:

#!/usr/local/bin/perl5.10.0

my $string =<<"HERE";
I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that's it.
HERE

my @groups = $string =~ m/
        (                   # start of capture group 1
        <                   # match an opening angle bracket
            (?:
                [^<>]++     # one or more non angle brackets, non backtracking
                  |
                (?1)        # found < or >, so recurse to capture group 1
            )*
        >                   # match a closing angle bracket
        )                   # end of capture group 1
        /xg;

$" = "\n\t";
print "Found:\n\t@groups\n";

The output shows that Perl found the two major groups:

Found:
    <brackets in <nested brackets> >
    <another group <nested once <nested twice> > >

With a little extra work, you can get the all of the groups in angle brackets even if they are in other angle brackets too. Each time you get a balanced match, remove its outer delimiter (that's the one you just matched so don't match it again) and add it to a queue of strings to process. Keep doing that until you get no matches:

#!/usr/local/bin/perl5.10.0

my @queue =<<"HERE";
I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that's it.
HERE

my $regex = qr/
        (                   # start of bracket 1
        <                   # match an opening angle bracket
            (?:
                [^<>]++     # one or more non angle brackets, non backtracking
                  |
                (?1)        # recurse to bracket 1
            )*
        >                   # match a closing angle bracket
        )                   # end of bracket 1
        /x;

$" = "\n\t";

while( @queue )
    {
    my $string = shift @queue;

    my @groups = $string =~ m/$regex/g;
    print "Found:\n\t@groups\n\n" if @groups;

    unshift @queue, map { s/^<//; s/>$//; $_ } @groups;
    }

The output shows all of the groups. The outermost matches show up first and the nested matches so up later:

Found:
    <brackets in <nested brackets> >
    <another group <nested once <nested twice> > >

Found:
    <nested brackets>

Found:
    <nested once <nested twice> >

Found:
    <nested twice>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文