查找两个字符串中的碱基重叠计数和内部间隙

发布于 2024-10-05 15:00:08 字数 994 浏览 0 评论 0原文

我有两个长度相等的字符串,我需要对其进行比较。 我想找到重叠基数(.) 和内部间隙(*)。下面是示例:

------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC
-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---
      ................**.................

重叠数 = 33。 内部间隙的数量= 2。

我可以毫无问题地找到重叠的数量。但我有问题 寻找内部差距。以下是我当前的代码。速度慢得可怕。 原则上我需要计算数百万个这样的对。

#!/usr/bin/perl -w
my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";

print "$s1\n";
print "$s2\n";


my %base = ("A" => 1, "T" => 1, "C" => 1, "G" => 1);

my $ovlp_basecount = 0;
my $internal_gap = 0;

foreach my $si ( 0 .. length($s1)  ) {


    my $base1 = substr($s1,$si,1);
    my $base2 = substr($s2,$si,1);


    # Overlap
    if ( $base{$base1} && $base{$base2} ) {
        $ovlp_basecount++;
    }

    # Not sure how to compute internal gap

}


print "TOTAL OVERLAP BASE = $ovlp_basecount\n";
print "TOTAL Internal Gap \?\n";

请建议我如何有效地找到内部差距和重叠。

I have this two strings of equal length, which I need to compare.
I want to find overlap base(.) and internal gap (*). Below is the example:

------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC
-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---
      ................**.................

Number of overlap = 33.
Number of internal gap = 2.

I have no problem finding the number of overlap. But I have problem
finding internal gap. Below is the current code I have. It is horribly slow.
In principle I need to compute millions of such pairs.

#!/usr/bin/perl -w
my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";

print "$s1\n";
print "$s2\n";


my %base = ("A" => 1, "T" => 1, "C" => 1, "G" => 1);

my $ovlp_basecount = 0;
my $internal_gap = 0;

foreach my $si ( 0 .. length($s1)  ) {


    my $base1 = substr($s1,$si,1);
    my $base2 = substr($s2,$si,1);


    # Overlap
    if ( $base{$base1} && $base{$base2} ) {
        $ovlp_basecount++;
    }

    # Not sure how to compute internal gap

}


print "TOTAL OVERLAP BASE = $ovlp_basecount\n";
print "TOTAL Internal Gap \?\n";

Please advice how can I find internal gap and overlap efficiently.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

缪败 2024-10-12 15:00:08

您可以对字符串使用按位或来查找一个字符串中与另一个字符串中的空白区域重叠的区域。此过程还具有通过将非重叠字符转换为小写来显示重叠的效果,从而使查找重叠也变得非常简单:

#!/usr/bin/perl

use strict;
use warnings;

my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";

$s1 =~ tr/-/\x20/;
$s2 =~ tr/-/\x20/;
my $or = $s1 | $s2;
(my $gap) = $or =~ m/^.*[ACTG]([actg]+)[ACTG].*$/;
(my $overlap = $or) =~ s/[^A-Z]//g;

print "s1:      '$s1'\n";
print "s2:      '$s2'\n";
print "OR:      '$or'\n";
printf "Gap:     '%s' (%d)\n", $gap,     length $gap;
printf "Overlap  '%s' (%d)\n", $overlap, length $overlap;

打印:

s1:      '      ACTAAAAATACAAAAA  TTAGCCAGGCGTGGTGGCAC'
s2:      '     TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG   '
OR:      '     tACTAAAAATACAAAAAaaTTAGCCAGGWGTGGTGGcac'
Gap:     'aa' (2)
Overlap  'ACTAAAAATACAAAAATTAGCCAGGWGTGGTGG' (33)

有关字符串按位运算的更多信息:

http://teaching.idallen.com/cst8214/08w/notes/bit_operations.txt

You can use a bitwise OR on the strings to find the the areas in one string that overlap blank areas in the other. This process also has the effect of revealing the overlap by converting non-overlapping characters to lower case, thus making finding the overlap quite simple too:

#!/usr/bin/perl

use strict;
use warnings;

my $s1 = "------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC";
my $s2 = "-----TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG---";

$s1 =~ tr/-/\x20/;
$s2 =~ tr/-/\x20/;
my $or = $s1 | $s2;
(my $gap) = $or =~ m/^.*[ACTG]([actg]+)[ACTG].*$/;
(my $overlap = $or) =~ s/[^A-Z]//g;

print "s1:      '$s1'\n";
print "s2:      '$s2'\n";
print "OR:      '$or'\n";
printf "Gap:     '%s' (%d)\n", $gap,     length $gap;
printf "Overlap  '%s' (%d)\n", $overlap, length $overlap;

Prints:

s1:      '      ACTAAAAATACAAAAA  TTAGCCAGGCGTGGTGGCAC'
s2:      '     TACTAAAAATACAAAAAAATTAGCCAGGTGTGGTGG   '
OR:      '     tACTAAAAATACAAAAAaaTTAGCCAGGWGTGGTGGcac'
Gap:     'aa' (2)
Overlap  'ACTAAAAATACAAAAATTAGCCAGGWGTGGTGG' (33)

For more information on string bitwise operations:

http://teaching.idallen.com/cst8214/08w/notes/bit_operations.txt

So要识趣 2024-10-12 15:00:08

假设间隙永远不会重叠,您可以使用正则表达式来解决这个问题。这是您的 s1 的答案。

echo '------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC' | perl -ne '$s = 0; foreach(/[GTAC](-+)[GTAC]/) { $s += length($1); } print "$s\n";'
2

Assuming the gaps never overlap, you can solve this using regular expressions. Here's an answer for your s1.

echo '------ACTAAAAATACAAAAA--TTAGCCAGGCGTGGTGGCAC' | perl -ne '$s = 0; foreach(/[GTAC](-+)[GTAC]/) { $s += length($1); } print "$s\n";'
2
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文