为什么是 s/^\s+|\s+$//g;比两次单独替换慢得多?

发布于 2024-08-22 22:34:03 字数 327 浏览 7 评论 0 原文

Perl 常见问题解答条目 如何从字符串的开头/结尾去除空格? 指出使用比分

s/^\s+|\s+$//g;

两步执行要慢:

s/^\s+//;
s/\s+$//;

为什么这个组合语句明显比单独的语句慢(对于任何输入字符串)?

The Perl FAQ entry How do I strip blank space from the beginning/end of a string? states that using

s/^\s+|\s+$//g;

is slower than doing it in two steps:

s/^\s+//;
s/\s+$//;

Why is this combined statement noticeably slower than the separate ones (for any input string)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

残月升风 2024-08-29 22:34:04

由于这两种方法在逻辑上是等效的,因此它们的评估性能没有内在的差异。然而,在实践中,某些引擎将无法发现更复杂的正则表达式中的优化。

在这种情况下,组合的正则表达式作为一个整体是未锚定的,因此它可能在字符串中的任何点匹配,而 ^\s+ 锚定在开头,因此匹配很简单,并且 \s+$ 锚定在末尾,并为从末尾向后的每个字符提供单个字符类 - 一个经过良好优化的引擎将识别这一事实并反向匹配,这使得它成为就像输入反面的 ^\s+ 匹配一样简单。

Since the two methods are logically equivalent, there's no inherent reason for them to differ in evaluation performance. In practice, however, some engines won't be able to spot optimizations in more complex regexes.

In this case, the combined regex as a whole is unanchored, so it could potentially match at any point in the string, while the ^\s+ is anchored at the start, so it is trivial to match, and \s+$ is anchored at the end, and provides a single character class for each character from the end backwards - a well-optimized engine will recognize that fact and will match in reverse, which makes it as trivial as a ^\s+ match on the reverse of the input.

超可爱的懒熊 2024-08-29 22:34:04

如果情况确实如此,那是因为正则表达式引擎能够对单个正则表达式进行比组合正则表达式更好的优化。

“明显变慢”是什么意思?

If this is indeed the case, then it would be because the regex engine is able to optimize better for the individual regexes than for the combined one.

What do you mean by "noticeably slower"?

许仙没带伞 2024-08-29 22:34:03

当使用“固定”或“锚定”子字符串而不是“浮动”子字符串时,Perl 正则表达式运行时运行得更快。当您可以将子字符串锁定到源字符串中的某个位置时,子字符串就被固定了。 '^' 和 '$' 都提供这种锚定。但是,当您使用交替“|”时,编译器不会将这些选择识别为固定的,因此它使用优化程度较低的代码来扫描整个字符串。在该过程的最后,两次查找固定字符串比查找一次浮动字符串要快得多。与此相关的是,阅读 Perl 的 regcomp.c 会让你失明。

更新
这里有一些额外的细节。如果您已使用调试支持编译了 perl,则可以使用“-Dr”标志运行 perl,并且它将转储出正则表达式编译数据。这就是您得到的结果:

~# debugperl -Dr -e 's/^\s+//g'
Compiling REx `^\s+'
size 4 Got 36 bytes for offset annotations.
first at 2
synthetic stclass "ANYOF[\11\12\14\15 {unicode_all}]".
   1: BOL(2)
   2: PLUS(4)
   3:   SPACE(0)
   4: END(0)
stclass "ANYOF[\11\12\14\15 {unicode_all}]" anchored(BOL) minlen 1

# debugperl -Dr -e 's/^\s+|\s+$//g'
Compiling REx `^\s+|\s+

请注意第一个转储中的“锚定”一词。

size 9 Got 76 bytes for offset annotations. 1: BRANCH(5) 2: BOL(3) 3: PLUS(9) 4: SPACE(0) 5: BRANCH(9) 6: PLUS(8) 7: SPACE(0) 8: EOL(9) 9: END(0) minlen 1

请注意第一个转储中的“锚定”一词。

The Perl regex runtime runs much quicker when working with 'fixed' or 'anchored' substrings rather than 'floated' substrings. A substring is fixed when you can lock it to a certain place in the source string. Both '^' and '$' provide that anchoring. However, when you use alternation '|', the compiler doesn't recognize the choices as fixed, so it uses less optimized code to scan the whole string. And at the end of the process, looking for fixed strings twice is much, much faster than looking for a floating string once. On a related note, reading perl's regcomp.c will make you go blind.

Update:
Here's some additional details. You can run perl with the '-Dr' flag if you've compiled it with debugging support and it'll dump out regex compilation data. Here's what you get:

~# debugperl -Dr -e 's/^\s+//g'
Compiling REx `^\s+'
size 4 Got 36 bytes for offset annotations.
first at 2
synthetic stclass "ANYOF[\11\12\14\15 {unicode_all}]".
   1: BOL(2)
   2: PLUS(4)
   3:   SPACE(0)
   4: END(0)
stclass "ANYOF[\11\12\14\15 {unicode_all}]" anchored(BOL) minlen 1

# debugperl -Dr -e 's/^\s+|\s+$//g'
Compiling REx `^\s+|\s+

Note the word 'anchored' in the first dump.

size 9 Got 76 bytes for offset annotations. 1: BRANCH(5) 2: BOL(3) 3: PLUS(9) 4: SPACE(0) 5: BRANCH(9) 6: PLUS(8) 7: SPACE(0) 8: EOL(9) 9: END(0) minlen 1

Note the word 'anchored' in the first dump.

清旖 2024-08-29 22:34:03

其他答案表明,完全锚定的正则表达式允许引擎优化搜索过程,仅关注开头或结尾或字符串。通过比较使用不同长度字符串的两种方法的速度差异,您似乎可以看到这种优化的效果。随着字符串变长,“浮动”正则表达式(使用交替)受到的影响越来越大。

use strict;
use warnings;
use Benchmark qw(cmpthese);

my $ws = "   \t\t\n";

for my $sz (1, 10, 100, 1000){
    my $str = $ws . ('Z' x $sz) . $ws;
    cmpthese(-2, {
        "alt_$sz" => sub { $_ = $str; s/^\s+|\s+$//g },
        "sep_$sz" => sub { $_ = $str; s/^\s+//; s/\s+$// },
    });
}

           Rate alt_1 sep_1
alt_1  870578/s    --  -16%
sep_1 1032017/s   19%    --

            Rate alt_10 sep_10
alt_10  384391/s     --   -62%
sep_10 1010017/s   163%     --

            Rate alt_100 sep_100
alt_100  61179/s      --    -92%
sep_100 806840/s   1219%      --

             Rate alt_1000 sep_1000
alt_1000   6612/s       --     -97%
sep_1000 261102/s    3849%       --

Other answers have indicated that the fully anchored regexes allow the engine to optimize the search process, focusing on just the beginning or the end or the string. It appears that you can see the effect of this optimization by comparing the speed difference of the two approaches using strings of various lengths. As the string gets longer, the "floating" regex (using alternation) suffers more and more.

use strict;
use warnings;
use Benchmark qw(cmpthese);

my $ws = "   \t\t\n";

for my $sz (1, 10, 100, 1000){
    my $str = $ws . ('Z' x $sz) . $ws;
    cmpthese(-2, {
        "alt_$sz" => sub { $_ = $str; s/^\s+|\s+$//g },
        "sep_$sz" => sub { $_ = $str; s/^\s+//; s/\s+$// },
    });
}

           Rate alt_1 sep_1
alt_1  870578/s    --  -16%
sep_1 1032017/s   19%    --

            Rate alt_10 sep_10
alt_10  384391/s     --   -62%
sep_10 1010017/s   163%     --

            Rate alt_100 sep_100
alt_100  61179/s      --    -92%
sep_100 806840/s   1219%      --

             Rate alt_1000 sep_1000
alt_1000   6612/s       --     -97%
sep_1000 261102/s    3849%       --
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文