使用 Perl 替换运算符保留捕获

发布于 2024-10-28 05:14:52 字数 708 浏览 4 评论 0原文

有人可以解释为什么下面的代码...

#!/opt/local/bin/perl
use strict;
use warnings;

my $string;

$string = "\t\t\tEntry";
print "String: >$string<\n";

$string =~ s/^(\t*)//gi;

print "\$1: >$1<\n";
print "String: >$string<\n";
print "\n";

$string = "\t\t\tEntry";

$string =~ s/^(\t*)([^\t]+)/$2/gi;

print "\$1: >$1<\n";
print "String: >$string<\n";
print "\n";

exit 0;

...产生以下输出...

String: >           Entry<
Use of uninitialized value in concatenation (.) or string at ~/sandbox.pl line 12.
$1: ><
String: >Entry<

$1: >           <
String: >Entry<

...或者更直接:为什么第一个替换中的匹配值没有保留在 $1 中?

Can someone explain why the following code...

#!/opt/local/bin/perl
use strict;
use warnings;

my $string;

$string = "\t\t\tEntry";
print "String: >$string<\n";

$string =~ s/^(\t*)//gi;

print "\$1: >$1<\n";
print "String: >$string<\n";
print "\n";

$string = "\t\t\tEntry";

$string =~ s/^(\t*)([^\t]+)/$2/gi;

print "\$1: >$1<\n";
print "String: >$string<\n";
print "\n";

exit 0;

...produces the following output...

String: >           Entry<
Use of uninitialized value in concatenation (.) or string at ~/sandbox.pl line 12.
$1: ><
String: >Entry<

$1: >           <
String: >Entry<

...or more directly: Why is the matched value in the first substitution not retained in $1?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

白龙吟 2024-11-04 05:14:52

我在 Perl 5.12 的两个实现上尝试过这一点,但没有遇到问题。 5.8 做到了。

因为您有 g 选项,perl 会尝试匹配该模式,直到失败。请参阅下面的调试输出。

所以它在 Perl 5.8 中不起作用,但 this 可以:

my $c1;
$string =~ s/^(\t*)/$c1=$1;''/ge;

因此每次匹配时,它都会将其保存到 $c1 中。

这就是 use re 'debug' 告诉我的:

Compiling REx `^(\t*)'
size 9 Got 76 bytes for offset annotations.
first at 2
   1: BOL(2)
   2: OPEN1(4)
   4:   STAR(7)
   5:     EXACT <\t>(0)
   7: CLOSE1(9)
   9: END(0)
anchored(BOL) minlen 0
Offsets: [9]
        1[1] 2[1] 0[0] 5[1] 3[1] 0[0] 6[1] 0[0] 7[0]
Compiling REx `^(\t*)([^\t]+)'
size 25 Got 204 bytes for offset annotations.
first at 2
   1: BOL(2)
   2: OPEN1(4)
   4:   STAR(7)
   5:     EXACTF <\t>(0)
   7: CLOSE1(9)
   9: OPEN2(11)
  11:   PLUS(23)
  12:     ANYOF[\0-\10\12-\377{unicode_all}](0)
  23: CLOSE2(25)
  25: END(0)
anchored(BOL) minlen 1
Offsets: [25]
        1[1] 2[1] 0[0] 5[1] 3[1] 0[0] 6[1] 0[0] 7[1] 0[0] 13[1] 8[5] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 14[1] 0[0] 15[0]
String: >                       Entry<
Matching REx `^(\t*)' against `                 Entry'
  Setting an EVAL scope, savestack=5
   0 <> <                       Entry>        |  1:  BOL
   0 <> <                       Entry>        |  2:  OPEN1
   0 <> <                       Entry>        |  4:  STAR
                           EXACT <\t> can match 3 times out of 2147483647...
  Setting an EVAL scope, savestack=5
   3 <                  > <Entry>        |  7:    CLOSE1
   3 <                  > <Entry>        |  9:    END
Match successful!
match pos=0
Use of uninitialized value in substitution iterator at - line 11.
Matching REx `^(\t*)' against `Entry'
  Setting an EVAL scope, savestack=5
   3 <                  > <Entry>        |  1:  BOL
                            failed...
Match failed
Freeing REx: `"^(\\t*)"'
Freeing REx: `"^(\\t*)([^\\t]+)"'

因为您试图匹配行开头的空格,所以您既不需要 g 也不需要 i。因此,这可能是您正在尝试做其他事情的情况。

I tried this on two implementations of Perl 5.12, and did not encounter the problem. 5.8 did.

Because you have the g options, perl tries to match the pattern until it fails. See the debug output below.

So it doesn't work in Perl 5.8, but this does:

my $c1;
$string =~ s/^(\t*)/$c1=$1;''/ge;

Thus each time it matches, it saves it to $c1.

This is what use re 'debug' tells me:

Compiling REx `^(\t*)'
size 9 Got 76 bytes for offset annotations.
first at 2
   1: BOL(2)
   2: OPEN1(4)
   4:   STAR(7)
   5:     EXACT <\t>(0)
   7: CLOSE1(9)
   9: END(0)
anchored(BOL) minlen 0
Offsets: [9]
        1[1] 2[1] 0[0] 5[1] 3[1] 0[0] 6[1] 0[0] 7[0]
Compiling REx `^(\t*)([^\t]+)'
size 25 Got 204 bytes for offset annotations.
first at 2
   1: BOL(2)
   2: OPEN1(4)
   4:   STAR(7)
   5:     EXACTF <\t>(0)
   7: CLOSE1(9)
   9: OPEN2(11)
  11:   PLUS(23)
  12:     ANYOF[\0-\10\12-\377{unicode_all}](0)
  23: CLOSE2(25)
  25: END(0)
anchored(BOL) minlen 1
Offsets: [25]
        1[1] 2[1] 0[0] 5[1] 3[1] 0[0] 6[1] 0[0] 7[1] 0[0] 13[1] 8[5] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 0[0] 14[1] 0[0] 15[0]
String: >                       Entry<
Matching REx `^(\t*)' against `                 Entry'
  Setting an EVAL scope, savestack=5
   0 <> <                       Entry>        |  1:  BOL
   0 <> <                       Entry>        |  2:  OPEN1
   0 <> <                       Entry>        |  4:  STAR
                           EXACT <\t> can match 3 times out of 2147483647...
  Setting an EVAL scope, savestack=5
   3 <                  > <Entry>        |  7:    CLOSE1
   3 <                  > <Entry>        |  9:    END
Match successful!
match pos=0
Use of uninitialized value in substitution iterator at - line 11.
Matching REx `^(\t*)' against `Entry'
  Setting an EVAL scope, savestack=5
   3 <                  > <Entry>        |  1:  BOL
                            failed...
Match failed
Freeing REx: `"^(\\t*)"'
Freeing REx: `"^(\\t*)([^\\t]+)"'

Because you are trying to match whitespace at the beginning of the line, you need neither the g nor the i. So it might be a case where you're trying to do something else.

铃予 2024-11-04 05:14:52

我认为 5.10 及更高版本,如果存在匹配,它只会影响捕获缓冲区。
您的示例中有趣的事情是 $string =~ s/^(\t*)([^\t]+)/$2/gi;
它没有重置捕获缓冲区。这似乎是因为序言中估计了
如果应该尝试匹配。在本例中,([^\t]+) 消耗了第一个
中的整个字符串
匹配,因此出现字符串太短并且缓冲区从未重置。

我无法测试它,但 $string =~ s/^(\t*)([^\t])//gi 应该给出相同的警告。
if ( s///g ) {} 并且在这种情况下捕获缓冲区的测试不一定包含
任何事物。 5.8版本就是这种情况。即使在更高版本中,它实际上也只是一个调试功能。

编辑 @theracoon - 关于您的评论:“我相当确定 ([^\t]+) 实际上并未消耗整个字符串。输出绝对没有反映这一点。”

这是您的正则表达式在第一个匹配(第 1 轮)中消耗了整个字符串的证明。
第二遍就没有什么可匹配的了。这就是 /g 修饰符的工作方式。
它尝试在字符串中最后一次匹配结束的位置再次匹配整个正则表达式。

use re 'debug';
$string = "\t\t\tEntry";
$string =~ s/^(\t*)([^\t]+)/$2/gi;
print "'$string'\n";

通过 1 ..
将 REx "^(\t*)([^\t]+)""%t%t%tEntry"
8 <%t%t%tEntry> <>
匹配成功!

通过 2 ..
将 REx "^(\t*)([^\t]+)""" 进行匹配
(不,没有什么可以匹配的)
字符串太短 [regexec_flags]...
匹配失败
'入口'

I think version 5.10 and beyond, it only affects capture buffers if there was a match.
The interesting thing in your example, is that with $string =~ s/^(\t*)([^\t]+)/$2/gi;
it didin't reset the capture buffers. That appears to be because of a preamble that estimates
if the match should be tried. In this case, ([^\t]+) consumed the entire string in the first
match, so a string too short occured and the buffers were never reset.

I can't test it but $string =~ s/^(\t*)([^\t])//gi should give the same warning.
if ( s///g ) {} and testing of capture buffers in this case is not certain to contain
anything. This was the case in version 5.8. Even in later versions its really just a debug feature.

Edit @theracoon - on your comment: "I'm reasonably certain that ([^\t]+) did not actually consume the entire string. The output definitely does not reflect that."

This is a proof that your regex consumed the entire string on the first match, Pass 1.
There is nothing left to match on the second pass. That is the way the /g modifier works.
It tries to match the entire regex again, in the postion in the string where the last match left off.

use re 'debug';
$string = "\t\t\tEntry";
$string =~ s/^(\t*)([^\t]+)/$2/gi;
print "'$string'\n";

Pass 1 ..
Matching REx "^(\t*)([^\t]+)" against "%t%t%tEntry"
8 <%t%t%tEntry> <>
Match successful!

Pass 2 ..
Matching REx "^(\t*)([^\t]+)" against ""
(Nope, nothing left to match)
String too short [regexec_flags]...
Match failed
'Entry'

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文