如何判断 Perl 正则表达式模式中匹配的替代项是什么?

发布于 2024-12-14 16:09:43 字数 455 浏览 4 评论 0原文

我有一个正则表达式列表(大约 10 - 15 个),我需要将其与某些文本进行匹配。在循环中将它们一一匹配太慢了。但我没有编写自己的状态机来立即匹配所有正则表达式,而是尝试 | 各个正则表达式并让 perl 完成工作。问题是我如何知道哪个选项匹配?

这个问题解决了每个正则表达式中没有捕获组的情况。 (哪个部分与正则表达式匹配?)如果每个正则表达式中都有捕获组怎么办?

那么对于以下内容

/^(A(\d+))|(B(\d+))|(C(\d+))$/

和字符串“A123”,我如何才能知道 A123 匹配并提取“123”?

I have a list of regular expressions (about 10 - 15) that I needed to match against some text. Matching them one by one in a loop is too slow. But instead of writing up my own state machine to match all the regexes at once, I am trying to | the individual regexes and let perl do the work. The problem is that how do I know which of the alternatives matched?

This question addresses the case where there are no capturing groups inside each individual regex. (which portion is matched by regex?) What if there are capturing groups inside each regexes?

So with the following,

/^(A(\d+))|(B(\d+))|(C(\d+))$/

and the string "A123", how can I both know that A123 matched and extract "123"?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

五里雾 2024-12-21 16:09:48

使用您的示例数据,很容易编写,

'A123' =~ /^([ABC])(\d+)$/;

其中 $1 将包含前缀,$2 将包含后缀。

我无法判断这是否与您的真实数据相关,但使用额外的模块似乎有点矫枉过正。

With your example data, it is easy to write

'A123' =~ /^([ABC])(\d+)$/;

after which $1 will contain the prefix and $2 the suffix.

I cannot tell whether this is relevant to your real data, but to use an additional module seems like overkill.

淡水深流 2024-12-21 16:09:47

A123 将位于捕获组 $1 中,而 123 将位于组 $2 中,

因此您可以说:

if ( /^(A(\d+))|(B(\d+))|(C(\d+))$/ && $1 eq 'A123' && $2 eq '123' ) {
    ...
}

这是多余的,但你明白了...

编辑:不,你不必枚举每个子匹配,你问如何知道A123是否匹配以及如何提取123

  • 你 否则不会进入 if
  • 除非 A123 匹配,并且您可以使用 $2 反向引用提取 123

。所以也许这个例子会更清楚:

if ( /^(A(\d+))|(B(\d+))|(C(\d+))$/ ) {
    # do something with $2, which will be '123' assuming $_ matches /^A123/
}

编辑2:

捕获AoA中的匹配(这是一个不同的问题,但这应该做到这一点):

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my @matches = map { [$1,$2] if /^(?:(A|B|C)(\d+))$/ } <DATA>;
print Dumper \@matches;

__DATA__
A123
B456
C769

结果:

$VAR1 = [
          [
            'A',
            '123'
          ],
          [
            'B',
            '456'
          ],
          [
            'C',
            '769'
          ]
        ];

请注意,我修改了你的正则表达式,但是从你的评论来看,这似乎就是你想要的......

A123 will be in capture group $1 and 123 will be in group $2

So you could say:

if ( /^(A(\d+))|(B(\d+))|(C(\d+))$/ && $1 eq 'A123' && $2 eq '123' ) {
    ...
}

This is redundant, but you get the idea...

EDIT: No, you don't have to enumerate each sub match, you asked how to know whether A123 matched and how to extract 123:

  • You won't enter the if block unless A123 matched
  • and you can extract 123 using the $2 backreference.

So maybe this example would have been more clear:

if ( /^(A(\d+))|(B(\d+))|(C(\d+))$/ ) {
    # do something with $2, which will be '123' assuming $_ matches /^A123/
}

EDIT 2:

To capture matches in an AoA (which is a different question, but this should do it):

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my @matches = map { [$1,$2] if /^(?:(A|B|C)(\d+))$/ } <DATA>;
print Dumper \@matches;

__DATA__
A123
B456
C769

Result:

$VAR1 = [
          [
            'A',
            '123'
          ],
          [
            'B',
            '456'
          ],
          [
            'C',
            '769'
          ]
        ];

Note that I modified your regex, but it looks like that's what you're going for judging by your comment...

吃→可爱长大的 2024-12-21 16:09:47

在 Perl 中可以做的另一件事是使用“(?{...})”将 Perl 代码直接嵌入到正则表达式中。因此,您可以设置一个变量来告诉您正则表达式的哪一部分匹配。警告:您的正则表达式不应包含任何变量(嵌入的 Perl 代码之外),这些变量将被插入到正则表达式中,否则您将收到错误。这是一个使用此功能的示例解析器:

my $kind;
my $REGEX  = qr/
          [A-Za-z][\w]*                        (?{$kind = 'IDENT';})
        | (?: ==? | != | <=? | >=? )           (?{$kind = 'OP';})
        | -?\d+                                (?{$kind = 'INT';})
        | \x27 ( (?:[^\x27] | \x27{2})* ) \x27 (?{$kind = 'STRING';})
        | \S                                   (?{$kind = 'OTHER';})
        /xs;

my $line = "if (x == 'that') then x = -23 and y = 'say ''hi'' for me';";
my @tokens;
while ($line =~ /( $REGEX )/xsg) {
    my($match, $str) = ($1,$2);
    if ($kind eq 'STRING') {
        $str =~ s/\x27\x27/\x27/g;
        push(@tokens, ['STRING', $str]);
        }
    else {
        push(@tokens, [$kind, $match]);
        }
    }
foreach my $lItems (@tokens) {
    print("$lItems->[0]: $lItems->[1]\n");
    }

它打印出以下内容:

IDENT: if
OTHER: (
IDENT: x
OP: ==
STRING: that
OTHER: )
IDENT: then
IDENT: x
OP: =
INT: -23
IDENT: and
IDENT: y
OP: =
STRING: say 'hi' for me
OTHER: ;

这是一种人为的做法,但您会注意到字符串周围的引号(实际上是撇号)被剥离(而且,连续的引号被折叠为单引号) ,所以一般来说,只有 $kind 变量会告诉您解析器是否看到了标识符或带引号的字符串。

Another thing you can do in Perl is to embed Perl code directly in your regex using "(?{...})". So, you can set a variable that tells you which part of the regex matched. WARNING: your regex should not contain any variables (outside of the embedded Perl code), that will be interpolated into the regex or you will get errors. Here is a sample parser that uses this feature:

my $kind;
my $REGEX  = qr/
          [A-Za-z][\w]*                        (?{$kind = 'IDENT';})
        | (?: ==? | != | <=? | >=? )           (?{$kind = 'OP';})
        | -?\d+                                (?{$kind = 'INT';})
        | \x27 ( (?:[^\x27] | \x27{2})* ) \x27 (?{$kind = 'STRING';})
        | \S                                   (?{$kind = 'OTHER';})
        /xs;

my $line = "if (x == 'that') then x = -23 and y = 'say ''hi'' for me';";
my @tokens;
while ($line =~ /( $REGEX )/xsg) {
    my($match, $str) = ($1,$2);
    if ($kind eq 'STRING') {
        $str =~ s/\x27\x27/\x27/g;
        push(@tokens, ['STRING', $str]);
        }
    else {
        push(@tokens, [$kind, $match]);
        }
    }
foreach my $lItems (@tokens) {
    print("$lItems->[0]: $lItems->[1]\n");
    }

which prints out the following:

IDENT: if
OTHER: (
IDENT: x
OP: ==
STRING: that
OTHER: )
IDENT: then
IDENT: x
OP: =
INT: -23
IDENT: and
IDENT: y
OP: =
STRING: say 'hi' for me
OTHER: ;

It's kind of contrived, but you'll notice that the quotes (actually, apostrophes) around strings are stripped off (also, consecutive quotes are collapsed to single quotes), so in general, only the $kind variable will tell you whether the parser saw an identifier or a quoted string.

关于从前 2024-12-21 16:09:46

为什么不使用 /^ (?A|B|C) (?\d+) $/x。请注意,命名捕获组是为了清楚起见而使用的,但这不是必需的。

Why not use /^ (?<prefix> A|B|C) (?<digits> \d+) $/x. Note, named capture groups used for clarity, and not essential.

烟燃烟灭 2024-12-21 16:09:44

您不需要编写自己的状态机来组合正则表达式。查看 Regexp:Assemble。它有一些方法可以跟踪您的初始模式是否匹配。

编辑:

use strict;
use warnings;

use 5.012;

use Regexp::Assemble;

my $string = 'A123';

my $re = Regexp::Assemble->new(track => 1);
for my $pattern (qw/ A(\d+) B(\d+) C(\d+) /) {
  $re->add($pattern);
}

say $re->re; ### (?-xism:(?:A(\d+)(?{0})|B(\d+)(?{2})|C(\d+)(?{1})))
say for $re->match($string); ### A(\d+)
say for $re->capture; ### 123

You don't need to code up your own state machine to combine regexes. Look into Regexp:Assemble. It has methods that'll track which of your initial patterns matched.

Edit:

use strict;
use warnings;

use 5.012;

use Regexp::Assemble;

my $string = 'A123';

my $re = Regexp::Assemble->new(track => 1);
for my $pattern (qw/ A(\d+) B(\d+) C(\d+) /) {
  $re->add($pattern);
}

say $re->re; ### (?-xism:(?:A(\d+)(?{0})|B(\d+)(?{2})|C(\d+)(?{1})))
say for $re->match($string); ### A(\d+)
say for $re->capture; ### 123
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文