在 Perl 中访问字符串中的单个字符时,substr 或拆分为数组更快吗?

发布于 2024-09-28 13:04:11 字数 181 浏览 1 评论 0原文

我正在编写一个 Perl 脚本,其中需要循环遍历字符串的每个字符。有很多字符串,每个字符串都有 100 个字符长(如果您想知道的话,它们是短 DNA 序列)。

那么,使用 substr 一次提取每个字符更快,还是将字符串拆分为数组然后迭代该数组更快?

当我等待答案时,我想我会去阅读如何在 Perl 中进行基准测试。

I'm writing a Perl script in which I need to loop over each character of a string. There's a lot of strings, and each is 100 characters long (they're short DNA sequences, in case you're wondering).

So, is it faster to use substr to extract each character one at a time, or is it faster to split the string into an array and then iterate over the array?

While I'm waiting for an answer, I suppose I'll go read up on how to benchmark things in Perl.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

无风消散 2024-10-05 13:04:11

这实际上取决于您对数据的处理方式 - 但是嘿,您的最后一个问题的方向是正确的!不要猜测,基准。

Perl 为此类事情提供了 Benchmark 模块,并且使用它非常简单。下面是一些可以开始使用的示例代码:

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);

my $dna;
$dna .= [qw(G A T C)]->[rand 4] for 1 .. 100;

sub frequency_substr {
  my $length = length $dna;
  my %hist;

  for my $pos (0 .. $length) {
    $hist{$pos}{substr $dna, $pos, 1} ++;
  }

  \%hist;
}

sub frequency_split {
  my %hist;
  my $pos = 0;
  for my $char (split //, $dna) {
    $hist{$pos ++}{$char} ++;
  }

  \%hist;
}

sub frequency_regmatch {
  my %hist;

  while ($dna =~ /(.)/g) {
    $hist{pos($dna)}{$1} ++;
  }

  \%hist;
}


cmpthese(-5, # Run each for at least 5 seconds
  { 
    substr => \&frequency_substr,
    split => \&frequency_split,
    regex => \&frequency_regmatch
  }
);

示例结果:

         Rate  regex  split substr
regex  6254/s     --   -26%   -32%
split  8421/s    35%     --    -9%
substr 9240/s    48%    10%     --

事实证明 substr 的速度快得惊人。 :)

It really depends on exactly what you're doing with your data -- but hey, you're headed the right way with your last question! Don't guess, benchmark.

Perl provides the Benchmark module for exactly this kind of thing, and using it is really pretty straightforward. Here's a little sample code to get started with:

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);

my $dna;
$dna .= [qw(G A T C)]->[rand 4] for 1 .. 100;

sub frequency_substr {
  my $length = length $dna;
  my %hist;

  for my $pos (0 .. $length) {
    $hist{$pos}{substr $dna, $pos, 1} ++;
  }

  \%hist;
}

sub frequency_split {
  my %hist;
  my $pos = 0;
  for my $char (split //, $dna) {
    $hist{$pos ++}{$char} ++;
  }

  \%hist;
}

sub frequency_regmatch {
  my %hist;

  while ($dna =~ /(.)/g) {
    $hist{pos($dna)}{$1} ++;
  }

  \%hist;
}


cmpthese(-5, # Run each for at least 5 seconds
  { 
    substr => \&frequency_substr,
    split => \&frequency_split,
    regex => \&frequency_regmatch
  }
);

And a sample result:

         Rate  regex  split substr
regex  6254/s     --   -26%   -32%
split  8421/s    35%     --    -9%
substr 9240/s    48%    10%     --

Turns out substr is surprisingly fast. :)

半枫 2024-10-05 13:04:11

这是我会做的,而不是首先尝试在 substrsplit 之间进行选择:

#!/usr/bin/perl

use strict; use warnings;

my %dist;
while ( my $s = <> ) {
    while ( $s =~ /(.)/g ) {
        ++ $dist{ pos($s) }{ $1 };
    }
}

更新:

我的好奇心战胜了我。这是一个基准:

#!/usr/bin/perl

use strict; use warnings;
use Benchmark qw( cmpthese );

my @chars = qw(A C G T);
my @to_split = my @to_substr = my @to_match = map {
    join '', map $chars[rand @chars], 1 .. 100
} 1 .. 1_000;

cmpthese -1, {
    'split'  => \&bench_split,
    'substr' => \&bench_substr,
    'match'  => \&bench_match,
};

sub bench_split {
    my %dist;
    for my $s ( @to_split ) {
        my @s = split //, $s;
        for my $i ( 0 .. $#s ) {
            ++ $dist{ $i }{ $s[$i] };
        }
    }
}

sub bench_substr {
    my %dist;
    for my $s ( @to_substr ) {
        my $u = length($s) - 1;
        for my $i (0 .. $u) {
            ++ $dist{ $i }{ substr($s, $i, 1) };
        }
    }
}

sub bench_match {
    my %dist;
    for my $s ( @to_match ) {
        while ( $s =~ /(.)/g ) {
            ++ $dist{ pos($s) }{ $1 };
        }
    }
}

输出:

         Rate  split  match substr
split  4.93/s     --   -31%   -65%
match  7.11/s    44%     --   -49%
substr 14.0/s   184%    97%     --

Here is what I would do instead of first trying to choose between substr and split:

#!/usr/bin/perl

use strict; use warnings;

my %dist;
while ( my $s = <> ) {
    while ( $s =~ /(.)/g ) {
        ++ $dist{ pos($s) }{ $1 };
    }
}

Update:

My curiosity got the best of me. Here is a benchmark:

#!/usr/bin/perl

use strict; use warnings;
use Benchmark qw( cmpthese );

my @chars = qw(A C G T);
my @to_split = my @to_substr = my @to_match = map {
    join '', map $chars[rand @chars], 1 .. 100
} 1 .. 1_000;

cmpthese -1, {
    'split'  => \&bench_split,
    'substr' => \&bench_substr,
    'match'  => \&bench_match,
};

sub bench_split {
    my %dist;
    for my $s ( @to_split ) {
        my @s = split //, $s;
        for my $i ( 0 .. $#s ) {
            ++ $dist{ $i }{ $s[$i] };
        }
    }
}

sub bench_substr {
    my %dist;
    for my $s ( @to_substr ) {
        my $u = length($s) - 1;
        for my $i (0 .. $u) {
            ++ $dist{ $i }{ substr($s, $i, 1) };
        }
    }
}

sub bench_match {
    my %dist;
    for my $s ( @to_match ) {
        while ( $s =~ /(.)/g ) {
            ++ $dist{ pos($s) }{ $1 };
        }
    }
}

Output:

         Rate  split  match substr
split  4.93/s     --   -31%   -65%
match  7.11/s    44%     --   -49%
substr 14.0/s   184%    97%     --
萌梦深 2024-10-05 13:04:11

我在掌握 Perl 中有一个处理这个问题的例子。您是否想要创建一堆单独的标量,每个标量都带有 Perl 标量的内存开销,或者将所有内容存储在单个字符串中以减少内存,但可能会做更多工作。你说你有很多这样的东西,所以如果你担心记忆问题,把它们保留为单个字符串可能会更好。

如果您对这些内容感到好奇,掌握 Perl 还有几章涉及基准测试和分析。

以太说先让它工作,然后再担心剩下的事情。其中一部分是将操作隐藏在面向任务的界面后面。一个好的面向对象模块可以为您做到这一点。如果你不喜欢这个实现,你可以改变它。但是,较高级别的程序不必更改,因为界面保持不变。

I have an example in Mastering Perl dealing with this problem. Do you want to create a bunch of individual scalars, each of which carries around the memory overhead of a Perl scalar, or store everything in a single string to reduce memory but maybe do more work. You say that you have a lot of these, so leaving them as single strings might work out much better for you if you are worried about memory.

Mastering Perl also has a couple chapters dealing with benchmarking and profiling, if you're curious about those.

Ether says to get it working first and worry about the rest later. Part of that is hiding the operations behind a task-oriented interface. A nice object-oriented module can do that for you. If you don't like the implmentation, you change it. However, the programs at the higher level don't have to change because the interface stays the same.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文