Perl 中的正则表达式组:如何将与字符串中未知数量/多个/变量出现次数匹配的正则表达式组中的元素捕获到数组中?

发布于 2024-09-14 02:09:51 字数 2438 浏览 9 评论 0原文

在 Perl 中,如何使用一个正则表达式分组来捕获多个与其匹配的匹配项,并将其放入多个数组元素中?

例如,对于字符串:

var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello

用代码处理它:

$string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

my @array = $string =~ <regular expression here>

for ( my $i = 0; $i < scalar( @array ); $i++ )
{
  print $i.": ".$array[$i]."\n";
}

我想看到输出:

0: var1=100
1: var2=90
2: var5=hello
3: var3="a, b, c"
4: var7=test
5: var3=hello

我将使用什么作为正则表达式?

我想要在此处匹配的事物之间的共性是分配字符串模式,因此类似于:

my @array = $string =~ m/(\w+=[\w\"\,\s]+)*/;

其中 * 表示与该组匹配的一个或多个匹配项。

(我不使用 split() ,因为某些匹配本身包含空格(即 var3...),因此不会给出所需的结果。)

使用上面的正则表达式,我只得到:

0: var1=100 var2

在正则表达式中是否可能?或者需要添加代码吗?

在搜索“perl regex multiple group”但没有足够的线索时,已经查看了现有的答案:

In Perl, how can I use one regex grouping to capture more than one occurrence that matches it, into several array elements?

For example, for a string:

var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello

to process this with code:

$string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

my @array = $string =~ <regular expression here>

for ( my $i = 0; $i < scalar( @array ); $i++ )
{
  print $i.": ".$array[$i]."\n";
}

I would like to see as output:

0: var1=100
1: var2=90
2: var5=hello
3: var3="a, b, c"
4: var7=test
5: var3=hello

What would I use as a regex?

The commonality between things I want to match here is an assignment string pattern, so something like:

my @array = $string =~ m/(\w+=[\w\"\,\s]+)*/;

Where the * indicates one or more occurrences matching the group.

(I discounted using a split() as some matches contain spaces within themselves (i.e. var3...) and would therefore not give desired results.)

With the above regex, I only get:

0: var1=100 var2

Is it possible in a regex? Or addition code required?

Looked at existing answers already, when searching for "perl regex multiple group" but not enough clues:

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

假扮的天使 2024-09-21 02:09:51
my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

while($string =~ /(?:^|\s+)(\S+)\s*=\s*("[^"]*"|\S*)/g) {
        print "<$1> => <$2>\n";
}

打印:

<var1> => <100>
<var2> => <90>
<var5> => <hello>
<var3> => <"a, b, c">
<var7> => <test>
<var3> => <hello>

解释:

最后一块先:末尾的 g 标志意味着您可以多次将正则表达式应用于字符串。第二次它将继续匹配字符串中最后一次匹配结束的位置。

现在对于正则表达式: (?:^|\s+) 匹配字符串的开头或一组一个或多个空格。这是必需的,因此下次应用正则表达式时,我们将跳过键/值对之间的空格。 ?: 意味着括号内容不会被捕获为组(我们不需要空格,只需要键和值)。 \S+ 与变量名称匹配。然后我们跳过任意数量的空格和中间的等号。最后, ("[^"]*"|\S*)/ 匹配两个引号及其间任意数量的字符,或任意数量的非空格字符的值。请注意引号匹配非常脆弱,无法正确处理转义引号,例如 "\"quoted\"" 会导致 "\"

编辑:

因为您确实想要这样做 。获取整个分配,而不是单个键/值,这是提取这些的单行代码:

my @list = $string =~ /(?:^|\s+)((?:\S+)\s*=\s*(?:"[^"]*"|\S*))/g;
my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

while($string =~ /(?:^|\s+)(\S+)\s*=\s*("[^"]*"|\S*)/g) {
        print "<$1> => <$2>\n";
}

Prints:

<var1> => <100>
<var2> => <90>
<var5> => <hello>
<var3> => <"a, b, c">
<var7> => <test>
<var3> => <hello>

Explanation:

Last piece first: the g flag at the end means that you can apply the regex to the string multiple times. The second time it will continue matching where the last match ended in the string.

Now for the regex: (?:^|\s+) matches either the beginning of the string or a group of one or more spaces. This is needed so when the regex is applied next time, we will skip the spaces between the key/value pairs. The ?: means that the parentheses content won't be captured as group (we don't need the spaces, only key and value). \S+ matches the variable name. Then we skip any amount of spaces and an equal sign in between. Finally, ("[^"]*"|\S*)/ matches either two quotes with any amount of characters in between, or any amount of non-space characters for the value. Note that the quote matching is pretty fragile and won't handle escpaped quotes properly, e.g. "\"quoted\"" would result in "\".

EDIT:

Since you really want to get the whole assignment, and not the single keys/values, here's a one-liner that extracts those:

my @list = $string =~ /(?:^|\s+)((?:\S+)\s*=\s*(?:"[^"]*"|\S*))/g;
冷︶言冷语的世界 2024-09-21 02:09:51

对于正则表达式,使用一种我喜欢称之为粘着和拉伸的技术:锚定您知道将在那里的功能(粘着),然后抓住之间的内容(拉伸)。

在这种情况下,您知道单个分配匹配

\b\w+=.+

,并且在 $string 中重复了许多这样的分配。请记住 \b 表示单词边界:

字边界 (\b) 是两个字符之间的一个点,其一侧有一个 \w 和一个 \W > 在其另一侧(以任一顺序),计算字符串开头和结尾的虚构字符作为匹配 \W

使用正则表达式描述赋值中的值可能有点棘手,但您也知道每个值都将以空格结尾(尽管不一定是遇到的第一个空格!),后面是另一个赋值或字符串结尾。

为了避免重复断言模式,请使用 qr// 编译一次 并在您的模式中与 look- 一起重用它提前断言 (?=...) 将匹配延伸到足够远以捕获整个值,同时防止它溢出到下一个变量名称。

使用 m//g 与列表上下文中的模式进行匹配会产生以下行为:

/g 修饰符指定全局模式匹配,即在字符串内匹配尽可能多的次数。它的行为方式取决于上下文。在列表上下文中,它返回与正则表达式中的任何捕获括号匹配的子字符串列表。如果没有括号,它将返回所有匹配字符串的列表,就好像整个模式周围有括号一样。

一旦前瞻看到另一个赋值或行尾,模式 $assignment 使用非贪婪的 .+? 来切断该值。请记住,匹配返回来自所有捕获子模式的子字符串,因此前瞻的交替使用非捕获(?:...)。相反,qr// 包含隐式捕获括号。

#! /usr/bin/perl

use warnings;
use strict;

my $string = <<'EOF';
var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello
EOF

my $assignment = qr/\b\w+ = .+?/x;
my @array = $string =~ /$assignment (?= \s+ (?: $ | $assignment))/gx;

for ( my $i = 0; $i < scalar( @array ); $i++ )
{
  print $i.": ".$array[$i]."\n";
}

输出:

0: var1=100
1: var2=90
2: var5=hello
3: var3="a, b, c"
4: var7=test
5: var3=hello

With regular expressions, use a technique that I like to call tack-and-stretch: anchor on features you know will be there (tack) and then grab what's between (stretch).

In this case, you know that a single assignment matches

\b\w+=.+

and you have many of these repeated in $string. Remember that \b means word boundary:

A word boundary (\b) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.

The values in the assignments can be a little tricky to describe with a regular expression, but you also know that each value will terminate with whitespace—although not necessarily the first whitespace encountered!—followed by either another assignment or end-of-string.

To avoid repeating the assertion pattern, compile it once with qr// and reuse it in your pattern along with a look-ahead assertion (?=...) to stretch the match just far enough to capture the entire value while also preventing it from spilling into the next variable name.

Matching against your pattern in list context with m//g gives the following behavior:

The /g modifier specifies global pattern matching—that is, matching as many times as possible within the string. How it behaves depends on the context. In list context, it returns a list of the substrings matched by any capturing parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern.

The pattern $assignment uses non-greedy .+? to cut off the value as soon as the look-ahead sees another assignment or end-of-line. Remember that the match returns the substrings from all capturing subpatterns, so the look-ahead's alternation uses non-capturing (?:...). The qr//, in contrast, contains implicit capturing parentheses.

#! /usr/bin/perl

use warnings;
use strict;

my $string = <<'EOF';
var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello
EOF

my $assignment = qr/\b\w+ = .+?/x;
my @array = $string =~ /$assignment (?= \s+ (?: $ | $assignment))/gx;

for ( my $i = 0; $i < scalar( @array ); $i++ )
{
  print $i.": ".$array[$i]."\n";
}

Output:

0: var1=100
1: var2=90
2: var5=hello
3: var3="a, b, c"
4: var7=test
5: var3=hello
吃不饱 2024-09-21 02:09:51

我并不是说这是您应该做的,但是您想要做的是编写语法。现在,您的示例对于语法来说非常简单,但是Damian Conway的模块 Regexp::Grammars 在这方面真的很棒。如果你必须种植它,你会发现它会让你的生活变得更轻松。我在这里经常使用它 - 它有点类似于 perl6。

use Regexp::Grammars;
use Data::Dumper;
use strict;
use warnings;

my $parser = qr{
    <[pair]>+
    <rule: pair>     <key>=(?:"<list>"|<value=literal>)
    <token: key>     var\d+
    <rule: list>     <[MATCH=literal]> ** (,)
    <token: literal> \S+

}xms;

q[var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello] =~ $parser;
die Dumper {%/};

输出:

$VAR1 = {
          '' => 'var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello',
          'pair' => [
                      {
                        '' => 'var1=100',
                        'value' => '100',
                        'key' => 'var1'
                      },
                      {
                        '' => 'var2=90',
                        'value' => '90',
                        'key' => 'var2'
                      },
                      {
                        '' => 'var5=hello',
                        'value' => 'hello',
                        'key' => 'var5'
                      },
                      {
                        '' => 'var3="a, b, c"',
                        'key' => 'var3',
                        'list' => [
                                    'a',
                                    'b',
                                    'c'
                                  ]
                      },
                      {
                        '' => 'var7=test',
                        'value' => 'test',
                        'key' => 'var7'
                      },
                      {
                        '' => 'var3=hello',
                        'value' => 'hello',
                        'key' => 'var3'
                      }
                    ]

I'm not saying this is what you should do, but what you're trying to do is write a Grammar. Now your example is very simple for a Grammar, but Damian Conway's module Regexp::Grammars is really great at this. If you have to grow this at all, you'll find it will make your life much easier. I use it quite a bit here - it is kind of perl6-ish.

use Regexp::Grammars;
use Data::Dumper;
use strict;
use warnings;

my $parser = qr{
    <[pair]>+
    <rule: pair>     <key>=(?:"<list>"|<value=literal>)
    <token: key>     var\d+
    <rule: list>     <[MATCH=literal]> ** (,)
    <token: literal> \S+

}xms;

q[var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello] =~ $parser;
die Dumper {%/};

Output:

$VAR1 = {
          '' => 'var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello',
          'pair' => [
                      {
                        '' => 'var1=100',
                        'value' => '100',
                        'key' => 'var1'
                      },
                      {
                        '' => 'var2=90',
                        'value' => '90',
                        'key' => 'var2'
                      },
                      {
                        '' => 'var5=hello',
                        'value' => 'hello',
                        'key' => 'var5'
                      },
                      {
                        '' => 'var3="a, b, c"',
                        'key' => 'var3',
                        'list' => [
                                    'a',
                                    'b',
                                    'c'
                                  ]
                      },
                      {
                        '' => 'var7=test',
                        'value' => 'test',
                        'key' => 'var7'
                      },
                      {
                        '' => 'var3=hello',
                        'value' => 'hello',
                        'key' => 'var3'
                      }
                    ]
意中人 2024-09-21 02:09:51

也许有点过头了,但这是我研究 http://p3rl.org/Parse 的借口: :RecDescent。制作一个解析器怎么样?

#!/usr/bin/perl

use strict;
use warnings;

use Parse::RecDescent;

use Regexp::Common;

my $grammar = <<'_EOGRAMMAR_'
INTEGER: /[-+]?\d+/
STRING: /\S+/
QSTRING: /$Regexp::Common::RE{quoted}/

VARIABLE: /var\d+/
VALUE: ( QSTRING | STRING | INTEGER )

assignment: VARIABLE "=" VALUE /[\s]*/ { print "$item{VARIABLE} => $item{VALUE}\n"; }

startrule: assignment(s)
_EOGRAMMAR_
;

$Parse::RecDescent::skip = '';
my $parser = Parse::RecDescent->new($grammar);

my $code = q{var1=100 var2=90 var5=hello var3="a, b, c" var7=test var8=" haha \" heh " var3=hello};
$parser->startrule($code);

产量:

var1 => 100
var2 => 90
var5 => hello
var3 => "a, b, c"
var7 => test
var8 => " haha \" heh "
var3 => hello

PS。请注意 double var3,如果您希望后一个分配覆盖第一个分配,您可以使用哈希来存储值,然后在以后使用它们。

聚苯硫醚。我的第一个想法是在“=”上进行分割,但是如果字符串包含“=”,那么就会失败,并且由于正则表达式几乎总是不利于解析,所以我最终尝试了一下并且它有效。

编辑:添加了对带引号的字符串内转义引号的支持。

A bit over the top maybe, but an excuse for me to look into http://p3rl.org/Parse::RecDescent. How about making a parser?

#!/usr/bin/perl

use strict;
use warnings;

use Parse::RecDescent;

use Regexp::Common;

my $grammar = <<'_EOGRAMMAR_'
INTEGER: /[-+]?\d+/
STRING: /\S+/
QSTRING: /$Regexp::Common::RE{quoted}/

VARIABLE: /var\d+/
VALUE: ( QSTRING | STRING | INTEGER )

assignment: VARIABLE "=" VALUE /[\s]*/ { print "$item{VARIABLE} => $item{VALUE}\n"; }

startrule: assignment(s)
_EOGRAMMAR_
;

$Parse::RecDescent::skip = '';
my $parser = Parse::RecDescent->new($grammar);

my $code = q{var1=100 var2=90 var5=hello var3="a, b, c" var7=test var8=" haha \" heh " var3=hello};
$parser->startrule($code);

yields:

var1 => 100
var2 => 90
var5 => hello
var3 => "a, b, c"
var7 => test
var8 => " haha \" heh "
var3 => hello

PS. Note the double var3, if you want the latter assignment to overwrite the first one you can use a hash to store the values, and then use them later.

PPS. My first thought was to split on '=' but that would fail if a string contained '=' and since regexps are almost always bad for parsing, well I ended up trying it out and it works.

Edit: Added support for escaped quotes inside quoted strings.

风渺 2024-09-21 02:09:51

我最近不得不解析 x509 证书的“主题”行。它们的形式与您提供的形式类似:

echo 'Subject: C=HU, L=Budapest, O=Microsec Ltd., CN=Microsec e-Szigno Root CA 2009/[email protected]' | \
  perl -wne 'my @a = m/(\w+\=.+?)(?=(?:, \w+\=|$))/g; print "$_\n" foreach @a;'

C=HU
L=Budapest
O=Microsec Ltd.
CN=Microsec e-Szigno Root CA 2009/[email protected]

正则表达式的简短描述:

(\w+\=.+?) - 在非贪婪模式下捕获后跟“=”的单词以及任何后续符号< br>
(?=(?:, \w+\=|$)) - 后面跟着另一个 、KEY=val 或行尾。

使用的正则表达式有趣的部分是:

  • .+? - 非贪婪模式
  • (?:pattern) - 非捕获模式
  • (?=pattern) 零宽度正前瞻断言

I've recently had to parse x509 certificates "Subject" lines. They had similar form to the one you have provided:

echo 'Subject: C=HU, L=Budapest, O=Microsec Ltd., CN=Microsec e-Szigno Root CA 2009/[email protected]' | \
  perl -wne 'my @a = m/(\w+\=.+?)(?=(?:, \w+\=|$))/g; print "$_\n" foreach @a;'

C=HU
L=Budapest
O=Microsec Ltd.
CN=Microsec e-Szigno Root CA 2009/[email protected]

Short description of the regex:

(\w+\=.+?) - capture words followed by '=' and any subsequent symbols in non greedy mode
(?=(?:, \w+\=|$)) - which are followed by either another , KEY=val or end of line.

The interesting part of the regex used are:

  • .+? - Non greedy mode
  • (?:pattern) - Non capturing mode
  • (?=pattern) zero-width positive look-ahead assertion
林空鹿饮溪 2024-09-21 02:09:51

这将为您提供双引号中的常见转义,例如 var3="a, \"b, c"。

@a = /(\w+=(?:\w+|"(?:[^\\"]*(?:\\.[^\\"]*)*)*"))/g;

实际操作:

echo 'var1=100 var2=90 var42="foo\"bar\\" var5=hello var3="a, b, c" var7=test var3=hello' |
perl -nle '@a = /(\w+=(?:\w+|"(?:[^\\"]*(?:\\.[^\\"]*)*)*"))/g; $,=","; print @a'
var1=100,var2=90,var42="foo\"bar\\",var5=hello,var3="a, b, c",var7=test,var3=hello

This one will provide you also common escaping in double-quotes as for example var3="a, \"b, c".

@a = /(\w+=(?:\w+|"(?:[^\\"]*(?:\\.[^\\"]*)*)*"))/g;

In action:

echo 'var1=100 var2=90 var42="foo\"bar\\" var5=hello var3="a, b, c" var7=test var3=hello' |
perl -nle '@a = /(\w+=(?:\w+|"(?:[^\\"]*(?:\\.[^\\"]*)*)*"))/g; $,=","; print @a'
var1=100,var2=90,var42="foo\"bar\\",var5=hello,var3="a, b, c",var7=test,var3=hello
楠木可依 2024-09-21 02:09:51

您要求提供正则表达式解决方案或其他代码。这是一个(大部分)仅使用核心模块的非正则表达式解决方案。唯一的正则表达式是 \s+ 来确定分隔符;在这种情况下,有一个或多个空格。

use strict; use warnings;
use Text::ParseWords;
my $string="var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";  

my @array = quotewords('\s+', 0, $string);

for ( my $i = 0; $i < scalar( @array ); $i++ )
{
    print $i.": ".$array[$i]."\n";
}

或者您可以执行代码 HERE

输出为:

0: var1=100
1: var2=90
2: var5=hello
3: var3=a, b, c
4: var7=test
5: var3=hello

如果您确实想要正则表达式解决方案,Alan Moore 的 评论链接到他在 IDEone 上的代码真是太棒了!

You asked for a RegEx solution or other code. Here is a (mostly) non regex solution using only core modules. The only regex is \s+ to determine the delimiter; in this case one or more spaces.

use strict; use warnings;
use Text::ParseWords;
my $string="var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";  

my @array = quotewords('\s+', 0, $string);

for ( my $i = 0; $i < scalar( @array ); $i++ )
{
    print $i.": ".$array[$i]."\n";
}

Or you can execute the code HERE

The output is:

0: var1=100
1: var2=90
2: var5=hello
3: var3=a, b, c
4: var7=test
5: var3=hello

If you really want a regex solution, Alan Moore's comment linking to his code on IDEone is the gas!

書生途 2024-09-21 02:09:51
#!/usr/bin/perl

use strict; use warnings;

use Text::ParseWords;
use YAML;

my $string =
    "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

my @parts = shellwords $string;
print Dump \@parts;

@parts = map { { split /=/ } } @parts;

print Dump \@parts;
#!/usr/bin/perl

use strict; use warnings;

use Text::ParseWords;
use YAML;

my $string =
    "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

my @parts = shellwords $string;
print Dump \@parts;

@parts = map { { split /=/ } } @parts;

print Dump \@parts;
一世旳自豪 2024-09-21 02:09:51

可以使用正则表达式来做到这一点,但它很脆弱。

my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

my $regexp = qr/( (?:\w+=[\w\,]+) | (?:\w+=\"[^\"]*\") )/x;
my @matches = $string =~ /$regexp/g;

It is possible to do this with regexes, however it's fragile.

my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

my $regexp = qr/( (?:\w+=[\w\,]+) | (?:\w+=\"[^\"]*\") )/x;
my @matches = $string =~ /$regexp/g;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文