在 Perl 中查找字符串中匹配的两个字符的数量

发布于 2024-12-16 22:58:40 字数 352 浏览 9 评论 0原文

Perl 中是否有一种方法（不是 BioPerl）来查找每两个连续字母的数量。

即，AA, AC, AG, AT, CC, CA, ... 的数量按如下顺序排列：

$sequence = 'AACGTACTGACGTACTGGTTGGTACGA'

PS: 我们可以使用正则表达式手动制作，即 $GC= ($sequence=~s/GC/GC/g) 返回序列中 GC 的数量。

我需要一种自动化且通用的方法。

原文

Is there a method in Perl (not BioPerl) to find the number of each two consecutive letters.

I.e., number of AA, AC, AG, AT, CC, CA, ... in a sequence like this:

$sequence = 'AACGTACTGACGTACTGGTTGGTACGA'

PS: We can make it manually by using the regular expression, i.e., $GC=($sequence=~s/GC/GC/g) which return the number of GC in the sequence.

I need an automated and generic way.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

娜些时光，永不杰束 2024-12-23 22:58:40

你让我困惑了一段时间，但我认为你想计算给定字符串中的二核苷酸。

代码：

my @dinucs = qw(AA AC AG CC CA CG);
my %count;
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';

for my $dinuc (@dinucs) {
    $count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
}

来自 Data::Dumper：

$VAR1 = {
          "AC" => 5,
          "CC" => "",
          "AG" => "",
          "AA" => 1,
          "CG" => 3,
          "CA" => ""
        };

You had me confused for a while, but I take it you want to count the dinucleotides in a given string.

Code:

my @dinucs = qw(AA AC AG CC CA CG);
my %count;
my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';

for my $dinuc (@dinucs) {
    $count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
}

Output from Data::Dumper:

$VAR1 = {
          "AC" => 5,
          "CC" => "",
          "AG" => "",
          "AA" => 1,
          "CG" => 3,
          "CA" => ""
        };

回复收藏 0 原文

故事与诗 2024-12-23 22:58:40

接近 TLP 的答案，但没有替换：

my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my @dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}@dinucs;

for my $dinuc (@dinucs) {
    while($sequence=~/$dinuc/g) {
        $count{$dinuc}++;
    }
}

基准：

my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my @dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}@dinucs;

my $count = -3;
my $r = cmpthese($count, {
        'match' => sub {
            for my $dinuc (@dinucs) {
               while($sequence=~/$dinuc/g) {
                    $count{$dinuc}++;
               }
            }
        },
        'substitute' => sub {
            for my $dinuc (@dinucs) {
                $count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
            }
         }
});

输出：

              Rate substitute      Match
Substitute 13897/s         --       -11%
Match      15622/s        12%         --

Close to TLP's answer, but without substitution:

my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my @dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}@dinucs;

for my $dinuc (@dinucs) {
    while($sequence=~/$dinuc/g) {
        $count{$dinuc}++;
    }
}

Benchmark:

my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';
my @dinucs = qw(AA AC AG AT CC CG);
my %count = map{$_ => 0}@dinucs;

my $count = -3;
my $r = cmpthese($count, {
        'match' => sub {
            for my $dinuc (@dinucs) {
               while($sequence=~/$dinuc/g) {
                    $count{$dinuc}++;
               }
            }
        },
        'substitute' => sub {
            for my $dinuc (@dinucs) {
                $count{$dinuc} = ($sequence =~ s/\Q$dinuc\E/$dinuc/g);
            }
         }
});

Output:

              Rate substitute      Match
Substitute 13897/s         --       -11%
Match      15622/s        12%         --

回复收藏 0 原文

小兔几 2024-12-23 22:58:40

如果您小心的话，正则表达式可以工作，但是有一个使用 substr 的简单解决方案，它会更快、更灵活。

（截至本文发布，标记为已接受的正则表达式解决方案将无法正确计算“AAAA...”等重复区域中的二核苷酸，其中自然存在的序列中有很多。

一旦匹配'AA'，正则表达式搜索在第三个字符上恢复，跳过中间的 'AA' 二核苷酸，这不会影响其他二核苷酸，因为如果您在一个位置有 'AC'，则保证不会将其包含在其中。当然，问题中给出的特定序列不会遇到这个问题，因为没有碱基连续出现三次。）

我建议的方法更灵活，因为它可以计算任何长度的单词；将正则表达式方法扩展到更长的单词很复杂，因为您必须使用正则表达式做更多的练习才能获得准确的计数。

sub substrWise {
    my ($seq, $wordLength) = @_;

    my $cnt = {};

    my $w;
    for my $i (0 .. length($seq) - $wordLength) {
        $w = substr($seq, $i, $wordLength);
        $cnt->{$w}++;
    }

    return $cnt;
}

sub regexWise {
    my ($seq, $dinucs) = @_;

    my $cnt = {};
    for my $d (@$dinucs) {
        if (substr($d, 0,1) eq substr($d, 1,1) ) {
            my $n = substr($d, 0,1);
            $cnt->{$d} = ($seq =~ s/$n(?=$n)/$n/g); # use look-ahead
        } else {
            $cnt->{$d} = ($seq =~ s/$d/$d/g);
        }
    }

    return $cnt;
}


my @dinucs = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);

my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';

use Test::More tests => 1;
my $rWise = regexWise($sequence, \@dinucs);
my $sWise = substrWise($sequence, 2);
$sWise->{$_} //= '' for @dinucs; # substrWise will not create keys for words not found
# this seems like desirable behavior IMO,
# but i'm adding '' to show that the counts match
is_deeply($rWise, $sWise, 'verify equivalence');

use Benchmark qw(:all);
cmpthese(100000, {
    'regex' => sub {
        regexWise($sequence, \@dinucs);
    },
    'substr' => sub {
        substrWise($sequence, 2);
    }

输出：

1..1
ok 1 - verify equivalence
          Rate  regex substr
regex  11834/s     --   -85%
substr 76923/s   550%     --

对于较长的序列（10-100 kbase），优势并不那么明显，但仍然胜出约 70%。

Regex works if you're careful, but there's a simple solution using substr that will be faster and more flexible.

(As of this posting, the regex solution marked as accepted will fail to correctly count dinucleotides in repeated regions like 'AAAA...', of which there are many in naturally occurring sequences.

Once you match 'AA', the regex search resumes on the third character, skipping the middle 'AA' dinucleotide. This doesn't affect the other dinucleotides since if you have 'AC' at one position, you're guaranteed not to have it in the next base, naturally. The particular sequence given in the question will not suffer from this problem since no base appears three times in a row.)

The method I suggest is more flexible in that it can count words of any length; extending the regex method to longer words is complicated since you have to do even more gymnastics with your regex to get an accurate count.

sub substrWise {
    my ($seq, $wordLength) = @_;

    my $cnt = {};

    my $w;
    for my $i (0 .. length($seq) - $wordLength) {
        $w = substr($seq, $i, $wordLength);
        $cnt->{$w}++;
    }

    return $cnt;
}

sub regexWise {
    my ($seq, $dinucs) = @_;

    my $cnt = {};
    for my $d (@$dinucs) {
        if (substr($d, 0,1) eq substr($d, 1,1) ) {
            my $n = substr($d, 0,1);
            $cnt->{$d} = ($seq =~ s/$n(?=$n)/$n/g); # use look-ahead
        } else {
            $cnt->{$d} = ($seq =~ s/$d/$d/g);
        }
    }

    return $cnt;
}


my @dinucs = qw(AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT);

my $sequence = 'AACGTACTGACGTACTGGTTGGTACGA';

use Test::More tests => 1;
my $rWise = regexWise($sequence, \@dinucs);
my $sWise = substrWise($sequence, 2);
$sWise->{$_} //= '' for @dinucs; # substrWise will not create keys for words not found
# this seems like desirable behavior IMO,
# but i'm adding '' to show that the counts match
is_deeply($rWise, $sWise, 'verify equivalence');

use Benchmark qw(:all);
cmpthese(100000, {
    'regex' => sub {
        regexWise($sequence, \@dinucs);
    },
    'substr' => sub {
        substrWise($sequence, 2);
    }

Output:

1..1
ok 1 - verify equivalence
          Rate  regex substr
regex  11834/s     --   -85%
substr 76923/s   550%     --

For longer sequences (10-100 kbase), the advantage is not as pronounced, but it still wins by about 70%.

回复收藏 0 原文

~没有更多了~

关于作者

离不开的别离

暂无简介

文章

575 人气

关注发私信

友情链接

文江博客

在 Perl 中查找字符串中匹配的两个字符的数量

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

佚名

今天

゛时过境迁

达拉崩吧

呆萌少年

孤者何惧

友情链接

在 Perl 中查找字符串中匹配的两个字符的数量

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

佚名

今天

゛时过境迁

达拉崩吧

呆萌少年

孤者何惧

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。