如何在 Perl 中从 DNA 序列中提取起始密码子和终止密码子？

发布于 2024-08-07 11:28:21 字数 3333 浏览 3 评论 0原文

我有下面的代码尝试识别给定 DNA 序列的起始密码子和终止密码子的位置。我们将起始密码子定义为ATG序列，将结束密码子定义为TGA、TAA、TAG序列。

我遇到的问题是下面的代码仅适用于前两个序列（DM208659 和 AF038953），但不适用于其余序列。

我下面的方法有什么问题吗？

可以从此处复制粘贴此代码。

      #!/usr/bin/perl -w


while (<DATA>) {
    chomp;
    print "$_\n";
    my ($id,$rna_sq) = split(/\s+/,$_);

    local $_ = $rna_sq;
    while (/atg/g) {
        my $start = pos() - 2;

        if (/tga|taa|tag/g) {

            my $stop    = pos();
            my $gene    = substr( $_, $start - 1, $stop - $start + 1 ),$/;
            my $genelen = length($gene);
            my $ct      = "$id $start $stop $gene $genelen";
            print "\t$ct\n";

        }

    }

}

__DATA__
DM208659    gtgggcctcaaatgtggagcactattctgatgtccaagtggaaagtgctgcgacatttgagcgtcac
AF038953    gatcccagacctcggcttgcagtagtgttagactgaagataaagtaagtgctgtttgggctaacaggatctcctcttgcagtctgcagcccaggacgctgattccagcagcgccttaccgcgcagcccgaagattcactatggtgaaaatcgccttcaatacccctaccgccgtgcaaaaggaggaggcgcggcaagacgtggaggccctcctgagccgcacggtcagaactcagatactgaccggcaaggagctccgagttgccacccaggaaaaagagggctcctctgggagatgtatgcttactctcttaggcctttcattcatcttggcaggacttattgttggtggagcctgcatttacaagtacttcatgcccaagagcaccatttaccgtggagagatgtgcttttttgattctgaggatcctgcaaattcccttcgtggaggagagcctaacttcctgcctgtgactgaggaggctgacattcgtgaggatgacaacattgcaatcattgatgtgcctgtccccagtttctctgatagtgaccctgcagcaattattcatgactttgaaaagggaatgactgcttacctggacttgttgctggggaactgctatctgatgcccctcaatacttctattgttatgcctccaaaaaatctggtagagctctttggcaaactggcgagtggcagatatctgcctcaaacttatgtggttcgagaagacctagttgctgtggaggaaattcgtgatgttagtaaccttggcatctttatttaccaactttgcaataacagaaagtccttccgccttcgtcgcagagacctcttgctgggtttcaacaaacgtgccattgataaatgctggaagattagacacttccccaacgaatttattgttgagaccaagatctgtcaagagtaagaggcaacagatagagtgtccttggtaataagaagtcagagatttacaatatgactttaacattaaggtttatgggatactcaagatatttactcatgcatttactctattgcttatgccgtaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
BC021011    ggggagtccggggcggcgcctggaggcggagccgcccgctgggctaaatggggcagaggccgggaggggtgggggttccccgcgccgcagccatggagcagcttcgcgccgccgcccgtctgcagattgttctg
DM208660    gggatactcaaaatgggggcgctttcctttttgtctgtactgggaagtgcttcgattttggggtgtccc
AF038954    ggacccaagggggccttcgaggtgccttaggccgcttgccttgctctcagaatcgctgccgccatggctagtcagtctcaggggattcagcagctgctgcaggccgagaagcgggcagccgagaaggtgtccgaggcccgcaaaagaaagaaccggaggctgaagcaggccaaagaagaagctcaggctgaaattgaacagtaccgcctgcagagggagaaagaattcaaggccaaggaagctgcggcattgggatcccgtggcagttgcagcactgaagtggagaaggagacccaggagaagatgaccatcctccagacatacttccggcagaacagggatgaagtcttggacaacctcttggcttttgtctgtgacattcggccagaaatccatgaaaactaccgcataaatggatagaagagagaagcacctgtgctgtggagtggcattttagatgccctcacgaatatggaagcttagcacagctctagttacattcttaggagatggccattaaattatttccatatattataagagaggtccttccactttttggagagtagccaatctagctttttggtaacagacttagaaattagcaaagatgtccagctttttaccacagattcctgagggattttagatgggtaaatagagtcagactttgaccaggttttgggcaaagcacatgtatatcagtgtggacttttcctttcttagatctagtttaaaaaaaaaaaccccttaccattctttgaagaaaggaggggattaaataattttttcccctaacactttcttgaaggtcaggggctttatctatgaaaagttagtaaatagttctttgtaacctgtgtgaagcagcagccagccttaaagtagtccattcttgctaatggttagaacagtgaatactagtggaattgtttgggctgcttttagtttctcttaatcaaaattactagatgatagaattcaagaacttgttacatgtattacttggtgtatcgataatcatttaaaagtaaagactctgtcatgcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

原文

I have a code below that try to identify the position of start and end codon of the given DNA sequences.
We define start codon as a ATG sequence and end codon as TGA,TAA,TAG sequences.

The problem I have is that the code below works only for first two sequences (DM208659 and AF038953) but not the rest.

What's wrong with my approach below?

This code can be copy-pasted from here.

      #!/usr/bin/perl -w


while (<DATA>) {
    chomp;
    print "$_\n";
    my ($id,$rna_sq) = split(/\s+/,$_);

    local $_ = $rna_sq;
    while (/atg/g) {
        my $start = pos() - 2;

        if (/tga|taa|tag/g) {

            my $stop    = pos();
            my $gene    = substr( $_, $start - 1, $stop - $start + 1 ),$/;
            my $genelen = length($gene);
            my $ct      = "$id $start $stop $gene $genelen";
            print "\t$ct\n";

        }

    }

}

__DATA__
DM208659    gtgggcctcaaatgtggagcactattctgatgtccaagtggaaagtgctgcgacatttgagcgtcac
AF038953    gatcccagacctcggcttgcagtagtgttagactgaagataaagtaagtgctgtttgggctaacaggatctcctcttgcagtctgcagcccaggacgctgattccagcagcgccttaccgcgcagcccgaagattcactatggtgaaaatcgccttcaatacccctaccgccgtgcaaaaggaggaggcgcggcaagacgtggaggccctcctgagccgcacggtcagaactcagatactgaccggcaaggagctccgagttgccacccaggaaaaagagggctcctctgggagatgtatgcttactctcttaggcctttcattcatcttggcaggacttattgttggtggagcctgcatttacaagtacttcatgcccaagagcaccatttaccgtggagagatgtgcttttttgattctgaggatcctgcaaattcccttcgtggaggagagcctaacttcctgcctgtgactgaggaggctgacattcgtgaggatgacaacattgcaatcattgatgtgcctgtccccagtttctctgatagtgaccctgcagcaattattcatgactttgaaaagggaatgactgcttacctggacttgttgctggggaactgctatctgatgcccctcaatacttctattgttatgcctccaaaaaatctggtagagctctttggcaaactggcgagtggcagatatctgcctcaaacttatgtggttcgagaagacctagttgctgtggaggaaattcgtgatgttagtaaccttggcatctttatttaccaactttgcaataacagaaagtccttccgccttcgtcgcagagacctcttgctgggtttcaacaaacgtgccattgataaatgctggaagattagacacttccccaacgaatttattgttgagaccaagatctgtcaagagtaagaggcaacagatagagtgtccttggtaataagaagtcagagatttacaatatgactttaacattaaggtttatgggatactcaagatatttactcatgcatttactctattgcttatgccgtaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
BC021011    ggggagtccggggcggcgcctggaggcggagccgcccgctgggctaaatggggcagaggccgggaggggtgggggttccccgcgccgcagccatggagcagcttcgcgccgccgcccgtctgcagattgttctg
DM208660    gggatactcaaaatgggggcgctttcctttttgtctgtactgggaagtgcttcgattttggggtgtccc
AF038954    ggacccaagggggccttcgaggtgccttaggccgcttgccttgctctcagaatcgctgccgccatggctagtcagtctcaggggattcagcagctgctgcaggccgagaagcgggcagccgagaaggtgtccgaggcccgcaaaagaaagaaccggaggctgaagcaggccaaagaagaagctcaggctgaaattgaacagtaccgcctgcagagggagaaagaattcaaggccaaggaagctgcggcattgggatcccgtggcagttgcagcactgaagtggagaaggagacccaggagaagatgaccatcctccagacatacttccggcagaacagggatgaagtcttggacaacctcttggcttttgtctgtgacattcggccagaaatccatgaaaactaccgcataaatggatagaagagagaagcacctgtgctgtggagtggcattttagatgccctcacgaatatggaagcttagcacagctctagttacattcttaggagatggccattaaattatttccatatattataagagaggtccttccactttttggagagtagccaatctagctttttggtaacagacttagaaattagcaaagatgtccagctttttaccacagattcctgagggattttagatgggtaaatagagtcagactttgaccaggttttgggcaaagcacatgtatatcagtgtggacttttcctttcttagatctagtttaaaaaaaaaaaccccttaccattctttgaagaaaggaggggattaaataattttttcccctaacactttcttgaaggtcaggggctttatctatgaaaagttagtaaatagttctttgtaacctgtgtgaagcagcagccagccttaaagtagtccattcttgctaatggttagaacagtgaatactagtggaattgtttgggctgcttttagtttctcttaatcaaaattactagatgatagaattcaagaacttgttacatgtattacttggtgtatcgataatcatttaaaagtaaagactctgtcatgcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

戒ㄋ 2024-08-14 11:28:22

我删除了 $_ 的使用（当你对它进行 local 化时，我尤其感到不寒而栗——你这样做是正确的，但为什么要强迫自己担心其他函数是否会clobber $_，而不是使用已经可用的 $rna_sq ？

另外我更正了 $start 和 $stop成为字符串中基于 0 的索引（这使得其余的数学运算更加直接），并提前计算了 $genelen，以便可以直接在 substr 中使用（或者，您可以将 $[ 本地化为 1 以使用基于 1 的数组索引，请参阅 perldoc perlvar。）

use strict;
use warnings;
while (my $line = <DATA>) {
    chomp $line;
    print "processing $line\n";
    my ($id, $rna_sq) = split(/\s+/, $line);

    while ($rna_sq =~ /atg/g) {
        # $start and $stop are 0-based indexes
        my $start = pos($rna_sq) - 3; # back up to include the start sequence

        # discard remnant if no stop sequence can be found
        last unless $rna_sq =~ /tga|taa|tag/g;

        my $stop    = pos($rna_sq);
        my $genelen = $stop - $start;

        my $gene    = substr($rna_sq, $start, $genelen);
        print "\t" . join(' ', $id, $start+1, $stop, $gene, $genelen) . "\n";
    }
}

I removed the use of $_ (I especially shuddered when you localized it -- you did so correctly, but why force yourself to worry if some other function is going to clobber $_, rather than use $rna_sq which is already available?

Additionally I corrected $start and $stop to be 0-based indexes into the string (which made the rest of the math more straight-forward), and calculated $genelen early so it could be used directly in the substr operation. (Alternatively, you could localize $[ to 1 to use 1-based array indexes, see perldoc perlvar.)

use strict;
use warnings;
while (my $line = <DATA>) {
    chomp $line;
    print "processing $line\n";
    my ($id, $rna_sq) = split(/\s+/, $line);

    while ($rna_sq =~ /atg/g) {
        # $start and $stop are 0-based indexes
        my $start = pos($rna_sq) - 3; # back up to include the start sequence

        # discard remnant if no stop sequence can be found
        last unless $rna_sq =~ /tga|taa|tag/g;

        my $stop    = pos($rna_sq);
        my $genelen = $stop - $start;

        my $gene    = substr($rna_sq, $start, $genelen);
        print "\t" . join(' ', $id, $start+1, $stop, $gene, $genelen) . "\n";
    }
}

回复收藏 0 原文

魂归处 2024-08-14 11:28:22

当 if (/tga|taa|tag/g) 找不到结束密码子时，它永远不会跳出内部循环。它不断重复匹配/atg/g，永远不会进一步前进。您可以将其从内循环中强制弹出：

if (/tga|taa|tag/g) {
    ...
}
else {
    last;
}

It's never breaking out of your inner loop when the if (/tga|taa|tag/g) fails to find an end codon. It keeps matching /atg/g repeatedly, never advancing any further. You could forcibly eject it from the inner loop:

if (/tga|taa|tag/g) {
    ...
}
else {
    last;
}

回复收藏 0 原文

々眼睛长脚气 2024-08-14 11:28:22

这完全取决于您是否想要生成可能重叠的序列。例如，序列 AF038954 包含 atgaccatcctccagacatacttccggcagaacagggatga，其末端与 atgaagtcttggacaacctcttggcttttgtctgtga 重叠。你想举报他们两个吗？

如果您不想报告重叠的序列，这是一个非常简单的问题，您可以使用单个正则表达式来解决：

while (<DATA>) {
    chomp;
    print "processing $_\n";
    my ($id, $rna_sq) = split;

    while ($rna_sq =~ /(atg.*?(?:tga|taa|tag))/g) {
      printf "\t%8s %4i %4i %s %i\n",
             $id,
             pos($rna_sq) - length($1) + 1,
             pos($rna_sq),
             $1,
             length($1);
      }
}

正则表达式 (atg.*?(?:tga|taa|tag)) 匹配您所需的开始，然后尽可能少地匹配接下来的内容（即 ? 来阻止 .* 变得“贪婪”），然后是您所需的结束。使用 while 循环对其进行迭代会在本次匹配后重新启动，这满足了不查找重叠的要求。

如果您确实想要报告重叠序列，则确实需要一个两阶段的过程：找到开始，找到结束，然后找到另一个开始，从上次停下来的地方继续寻找开始时间。但您仍然可以使用第二个正则表达式来完成更简单的工作：

while (<DATA>) {
    chomp;
    print "processing $_\n";
    my ($id, $rna_sq) = split;

    while ($rna_sq =~ /atg/g) {
      if (
这里我们使用（通常不推荐）$'特殊变量，其中包含匹配后的内容。我们在此查找序列的末尾并输出详细信息。因为我们针对 $rna_seq 的主要全局匹配不包含该序列（如上面所示），所以我们重新开始搜索上一个搜索停止的位置，即我们找到的起始点之后。这样我们就可以包含重叠的序列。
 =~ /(.*?(?:tga|taa|tag))/) {
        my $match = "atg$1";
        printf "\t%8s %4i %4i %s %i\n",
               $id,
               pos($rna_sq) - 2,
               pos($rna_sq) - 3 + length($match),
               $match,
               length($match);
      }
    }
}

这里我们使用（通常不推荐）$'特殊变量，其中包含匹配后的内容。我们在此查找序列的末尾并输出详细信息。因为我们针对 $rna_seq 的主要全局匹配不包含该序列（如上面所示），所以我们重新开始搜索上一个搜索停止的位置，即我们找到的起始点之后。这样我们就可以包含重叠的序列。

It all depends on whether you want to generate sequences which could overlap. For example, sequence AF038954 contains atgaccatcctccagacatacttccggcagaacagggatga, the end of which overlaps with atgaagtcttggacaacctcttggcttttgtctgtga. Do you want to report them both?

If you don't want to report sequences which overlap, this is a very simple problem, which you can solve with a single regexp:

while (<DATA>) {
    chomp;
    print "processing $_\n";
    my ($id, $rna_sq) = split;

    while ($rna_sq =~ /(atg.*?(?:tga|taa|tag))/g) {
      printf "\t%8s %4i %4i %s %i\n",
             $id,
             pos($rna_sq) - length($1) + 1,
             pos($rna_sq),
             $1,
             length($1);
      }
}

The regexp (atg.*?(?:tga|taa|tag)) matches your required start, then as little as possible of what comes next (that's the ? to stop the .* being "greedy") then your required end. Iterating over it with the while loop restarts after this match, which meets the requirement of not looking for overlaps.

If you do want overlapping sequences reported, you do need a two-stage process: find the start, find the end, and then find another start, picking up where you left off looking for the start the last time. But you can still do a simpler job using a second regexp:

while (<DATA>) {
    chomp;
    print "processing $_\n";
    my ($id, $rna_sq) = split;

    while ($rna_sq =~ /atg/g) {
      if (
Here we use the (generally non-recommended) $' special variable, which contains the content after the match. We look in this to find the end of the sequence and output the details. Because our main global match against $rna_seq doesn't include the sequence (as it does above) we restart the search for a start where the previous search left off, that is just after the start we found. This way we do include overlapping sequences.
 =~ /(.*?(?:tga|taa|tag))/) {
        my $match = "atg$1";
        printf "\t%8s %4i %4i %s %i\n",
               $id,
               pos($rna_sq) - 2,
               pos($rna_sq) - 3 + length($match),
               $match,
               length($match);
      }
    }
}

Here we use the (generally non-recommended) $' special variable, which contains the content after the match. We look in this to find the end of the sequence and output the details. Because our main global match against $rna_seq doesn't include the sequence (as it does above) we restart the search for a start where the previous search left off, that is just after the start we found. This way we do include overlapping sequences.

回复收藏 0 原文

~没有更多了~