使用 Perl 查找最大的开放阅读框

发布于 2024-12-17 10:00:49 字数 2547 浏览 1 评论 0原文

我创建了一个程序，可以读取 DNA 序列，生成互补链，并进一步翻译为 mRNA。然而，我必须找到该 DNA 的最长开放阅读框。我已经编写了一些代码，但是当它打印出语句时，我没有得到答案。帮助？

这就是我所拥有的；

# Search for the longest open reading frame for this DNA.
print "\nHere is the largest ORF, from 5' to 3':\n" ;
local $_ = $RNA_seq ;
while ( / AUG /g ) {
    my $start = pos () - 2 ;
    if ( / UGA|UAA|UAG /g ) {
        my $stop = pos ;
        $gene = substr ( $_ , $start - 1 , $stop - $start + 1 ), $/ ;
        print "$gene" ;
    }
}

# The next set of commands translates the ORF found above for an amino acid seq.
print "\nThe largest reading Frame is:\t\t\t" . $protein { "gene" } . "\n" ;
sub translate {
    my ( $gene , $reading_frame ) = @_ ;
    my %protein = ();
    for ( $i = $reading_frame ; $i < length ( $gene ); $i += 3 ) {
        $codon = substr ( $gene , $i , 3 );
        $amino_acid = translate_codon( $codon );
        $protein { $amino_acid }++;
        $protein { "gene" } .= $amino_acid ;
    }
    return %protein ;
}

sub translate_codon {
if ( $_ [ 0 ] =~ / GC[AGCU] /i )             { return A;} # Alanine;
if ( $_ [ 0 ] =~ / UGC|UGU /i )              { return C;} # Cysteine
if ( $_ [ 0 ] =~ / GAC|GAU /i )              { return D;} # Aspartic Acid;
if ( $_ [ 0 ] =~ / GAA|GAG /i )              { return Q;} # Glutamine;
if ( $_ [ 0 ] =~ / UUC|UUU /i )              { return F;} # Phenylalanine;
if ( $_ [ 0 ] =~ / GG[AGCU] /i )             { return G;} # Glycine;
if ( $_ [ 0 ] =~ / CAC|CAU /i )              { return His;} # Histine (start codon);
if ( $_ [ 0 ] =~ / AU[AUC] /i )              { return I;} # Isoleucine;
if ( $_ [ 0 ] =~ / AAA|AAG /i )              { return K;} # Lysine;
if ( $_ [ 0 ] =~ / UUA|UUG|CU[AGCU] /i )     { return Leu;} # Leucine;
if ( $_ [ 0 ] =~ / AUG /i )                  { return M;} # Methionine;
if ( $_ [ 0 ] =~ / AAC|AAU /i )              { return N;} # Asparagine;
if ( $_ [ 0 ] =~ / CC[AGCU] /i )             { return P;} # Proline;
if ( $_ [ 0 ] =~ / CAA|CAG /i )              { return G;} # Glutamine;
if ( $_ [ 0 ] =~ / AGA|AGG|CG[AGCU] /i )     { return R;} # Arginine;
if ( $_ [ 0 ] =~ / AGC|AGU|UC[AGCU] /i )     { return S;} # Serine;
if ( $_ [ 0 ] =~ / AC[AGCU] /i )             { return T;} # Threonine;
if ( $_ [ 0 ] =~ / GU[AGCU] /i )             { return V;} # Valine;
if ( $_ [ 0 ] =~ / UGG /i )                  { return W;} # Tryptophan;
if ( $_ [ 0 ] =~ / UAC|UAU /i )              { return Y;} # Tyrosine;
if ( $_ [ 0 ] =~ / UAA|UGA|UAG /i )          { return "***" ;} # Stop Codons;
}

我错过了什么吗？

原文

I've created a program that can read DNA sequences that can generate the complementary strand, which further translates to an mRNA. However, I have to find the longest possible open reading frame for that DNA. I've coded something, but when it prints out the statement, I get no answer. Help?

This is what I have;

# Search for the longest open reading frame for this DNA.
print "\nHere is the largest ORF, from 5' to 3':\n" ;
local $_ = $RNA_seq ;
while ( / AUG /g ) {
    my $start = pos () - 2 ;
    if ( / UGA|UAA|UAG /g ) {
        my $stop = pos ;
        $gene = substr ( $_ , $start - 1 , $stop - $start + 1 ), $/ ;
        print "$gene" ;
    }
}

# The next set of commands translates the ORF found above for an amino acid seq.
print "\nThe largest reading Frame is:\t\t\t" . $protein { "gene" } . "\n" ;
sub translate {
    my ( $gene , $reading_frame ) = @_ ;
    my %protein = ();
    for ( $i = $reading_frame ; $i < length ( $gene ); $i += 3 ) {
        $codon = substr ( $gene , $i , 3 );
        $amino_acid = translate_codon( $codon );
        $protein { $amino_acid }++;
        $protein { "gene" } .= $amino_acid ;
    }
    return %protein ;
}

sub translate_codon {
if ( $_ [ 0 ] =~ / GC[AGCU] /i )             { return A;} # Alanine;
if ( $_ [ 0 ] =~ / UGC|UGU /i )              { return C;} # Cysteine
if ( $_ [ 0 ] =~ / GAC|GAU /i )              { return D;} # Aspartic Acid;
if ( $_ [ 0 ] =~ / GAA|GAG /i )              { return Q;} # Glutamine;
if ( $_ [ 0 ] =~ / UUC|UUU /i )              { return F;} # Phenylalanine;
if ( $_ [ 0 ] =~ / GG[AGCU] /i )             { return G;} # Glycine;
if ( $_ [ 0 ] =~ / CAC|CAU /i )              { return His;} # Histine (start codon);
if ( $_ [ 0 ] =~ / AU[AUC] /i )              { return I;} # Isoleucine;
if ( $_ [ 0 ] =~ / AAA|AAG /i )              { return K;} # Lysine;
if ( $_ [ 0 ] =~ / UUA|UUG|CU[AGCU] /i )     { return Leu;} # Leucine;
if ( $_ [ 0 ] =~ / AUG /i )                  { return M;} # Methionine;
if ( $_ [ 0 ] =~ / AAC|AAU /i )              { return N;} # Asparagine;
if ( $_ [ 0 ] =~ / CC[AGCU] /i )             { return P;} # Proline;
if ( $_ [ 0 ] =~ / CAA|CAG /i )              { return G;} # Glutamine;
if ( $_ [ 0 ] =~ / AGA|AGG|CG[AGCU] /i )     { return R;} # Arginine;
if ( $_ [ 0 ] =~ / AGC|AGU|UC[AGCU] /i )     { return S;} # Serine;
if ( $_ [ 0 ] =~ / AC[AGCU] /i )             { return T;} # Threonine;
if ( $_ [ 0 ] =~ / GU[AGCU] /i )             { return V;} # Valine;
if ( $_ [ 0 ] =~ / UGG /i )                  { return W;} # Tryptophan;
if ( $_ [ 0 ] =~ / UAC|UAU /i )              { return Y;} # Tyrosine;
if ( $_ [ 0 ] =~ / UAA|UGA|UAG /i )          { return "***" ;} # Stop Codons;
}

Am I missing something?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

初见你 2024-12-24 10:00:49

放在

use strict;
use warnings;

代码的开头：

它将有助于检测通常不返回字符串而是文件句柄的问题

return A;

如果您想返回字符串，

return 'A';

Put

use strict;
use warnings;

at the beginning of your code: it will help to detect problems

generally with

return A;

you are not returning a string but a file handle, if you want to return a string

return 'A';

回复收藏 0 原文

小梨窩很甜 2024-12-24 10:00:49

如果您想在正则表达式中添加空格以提高可读性，则需要使用正则表达式 x 修饰符：

if ( $_ [ 0 ] =~ / GC[AGCU] /xi )

然后，如果您需要表示空格（您不应该在这里），只需使用 <代码>\s
顺便说一句，不要使用 $_ [0] 考虑：

sub translate_codon {
    my $codon = shift;
    return q(A) if $codon =~ m/ GC[AGCU] /xi;
    return q(F) if $codon =~ m/ UU[CU]   /xi;
}

If you want to put whitespace in your regular expressions for readability, you need to use the regular expression x modifier as:

if ( $_ [ 0 ] =~ / GC[AGCU] /xi )

Then if you need to represent whitespace (which you shouldn't here) , simply use \s
As an aside, rather than use $_ [0] consider:

sub translate_codon {
    my $codon = shift;
    return q(A) if $codon =~ m/ GC[AGCU] /xi;
    return q(F) if $codon =~ m/ UU[CU]   /xi;
}

回复收藏 0 原文

独闯女儿国 2024-12-24 10:00:49

您是否检查过 BioPerl ？我还没有检查过，但也许它已经包含了您需要的功能。或者您目前正在进行一项必须自己编写程序的作业？

编辑

我不太确定您发布的代码的第一部分。例如，您的正则表达式中有空格。您尝试匹配的字符串实际上包含这些空格，还是密码子全部写在一起，如 AUGCCGGAUGA 中所示？在后一种情况下，即使存在您正在寻找的密码子，也根本不匹配（我可能会告诉您您知道的事情......我只是想指出这一点，以防万一:) ）。

另外，你的 pos 函数的代码是什么？

还有一件事：您不必设置 $_，您可以简单地将 $RNA_seq 与模式匹配，如下所示：

if ($RNA_seq =~ m/UGA/) { # ...

EDIT 2

我对第一部分进行了更多思考，我认为这里建议使用 index 函数，因为它会立即为您提供序列中的位置：

#!/usr/bin/perl

use strict;
use warnings;
use List::Util qw( min );

my $string = 'UGAAUGGGCCAUAUUUGACUGAGUUUCAGAUGCCAUUGGCGAUUAG';
# the genes:     *-------------*           *---------------*

my $prev = -1;
my @genes;

while (1) {
   my $start = index($string, 'AUG', $prev+1);
   my $stop  = min grep $_ > -1, (index($string, 'UGA', $start+1), index($string, 'UAA', $start+1),
                                  index($string, 'UAG', $start+1));

   # exit the loop if index is -1, i.e. if there was no more match
   last if ($start < 0 or $stop < 0); 

   # using +1 so that 'AUGA' will not count as a gene
   if ($stop > $start+1) { 
      push @genes, substr($string, $start, $stop);
   }

   $prev = $stop; # I'm assuming that no second AUG will come between AUG and a stop codon
}

print "@genes\n";

这给了您

AUGGGCCAUAUUUGA AUGCCAUUGGCGAUUAG

我想说它可能需要一些改进，但我希望这对大家有所帮助相同的。

Have you checked out BioPerl at all? I haven't checked, but perhaps it already includes the functionality you require. Or are you currently working on an assignment for which you have to write the program yourself?

EDIT

I'm not quite sure about the first section of the code you have posted. For example, there are blanks within your regexes. Does the string you are trying to match actually include those blanks, or are the codons all written together, as in AUGCCGGAUGA? In the latter case there'd simply be no match, even though the codons you're looking for are present (I'm probably telling you things you know though... I just wanted to point it out, just in case :) ).

Also, what's the code of your pos function?

One more thing: you don't have to set $_, you can simply match $RNA_seq against a pattern, like so:

if ($RNA_seq =~ m/UGA/) { # ...

EDIT 2

I've thought a bit more about the first section, I think using the index function would be advisable here, as that immediately gives you the position within the sequence:

#!/usr/bin/perl

use strict;
use warnings;
use List::Util qw( min );

my $string = 'UGAAUGGGCCAUAUUUGACUGAGUUUCAGAUGCCAUUGGCGAUUAG';
# the genes:     *-------------*           *---------------*

my $prev = -1;
my @genes;

while (1) {
   my $start = index($string, 'AUG', $prev+1);
   my $stop  = min grep $_ > -1, (index($string, 'UGA', $start+1), index($string, 'UAA', $start+1),
                                  index($string, 'UAG', $start+1));

   # exit the loop if index is -1, i.e. if there was no more match
   last if ($start < 0 or $stop < 0); 

   # using +1 so that 'AUGA' will not count as a gene
   if ($stop > $start+1) { 
      push @genes, substr($string, $start, $stop);
   }

   $prev = $stop; # I'm assuming that no second AUG will come between AUG and a stop codon
}

print "@genes\n";

This gives you