使用 Perl 查找最大的开放阅读框
我创建了一个程序,可以读取 DNA 序列,生成互补链,并进一步翻译为 mRNA。然而,我必须找到该 DNA 的最长开放阅读框。我已经编写了一些代码,但是当它打印出语句时,我没有得到答案。帮助?
这就是我所拥有的;
# Search for the longest open reading frame for this DNA.
print "\nHere is the largest ORF, from 5' to 3':\n" ;
local $_ = $RNA_seq ;
while ( / AUG /g ) {
my $start = pos () - 2 ;
if ( / UGA|UAA|UAG /g ) {
my $stop = pos ;
$gene = substr ( $_ , $start - 1 , $stop - $start + 1 ), $/ ;
print "$gene" ;
}
}
# The next set of commands translates the ORF found above for an amino acid seq.
print "\nThe largest reading Frame is:\t\t\t" . $protein { "gene" } . "\n" ;
sub translate {
my ( $gene , $reading_frame ) = @_ ;
my %protein = ();
for ( $i = $reading_frame ; $i < length ( $gene ); $i += 3 ) {
$codon = substr ( $gene , $i , 3 );
$amino_acid = translate_codon( $codon );
$protein { $amino_acid }++;
$protein { "gene" } .= $amino_acid ;
}
return %protein ;
}
sub translate_codon {
if ( $_ [ 0 ] =~ / GC[AGCU] /i ) { return A;} # Alanine;
if ( $_ [ 0 ] =~ / UGC|UGU /i ) { return C;} # Cysteine
if ( $_ [ 0 ] =~ / GAC|GAU /i ) { return D;} # Aspartic Acid;
if ( $_ [ 0 ] =~ / GAA|GAG /i ) { return Q;} # Glutamine;
if ( $_ [ 0 ] =~ / UUC|UUU /i ) { return F;} # Phenylalanine;
if ( $_ [ 0 ] =~ / GG[AGCU] /i ) { return G;} # Glycine;
if ( $_ [ 0 ] =~ / CAC|CAU /i ) { return His;} # Histine (start codon);
if ( $_ [ 0 ] =~ / AU[AUC] /i ) { return I;} # Isoleucine;
if ( $_ [ 0 ] =~ / AAA|AAG /i ) { return K;} # Lysine;
if ( $_ [ 0 ] =~ / UUA|UUG|CU[AGCU] /i ) { return Leu;} # Leucine;
if ( $_ [ 0 ] =~ / AUG /i ) { return M;} # Methionine;
if ( $_ [ 0 ] =~ / AAC|AAU /i ) { return N;} # Asparagine;
if ( $_ [ 0 ] =~ / CC[AGCU] /i ) { return P;} # Proline;
if ( $_ [ 0 ] =~ / CAA|CAG /i ) { return G;} # Glutamine;
if ( $_ [ 0 ] =~ / AGA|AGG|CG[AGCU] /i ) { return R;} # Arginine;
if ( $_ [ 0 ] =~ / AGC|AGU|UC[AGCU] /i ) { return S;} # Serine;
if ( $_ [ 0 ] =~ / AC[AGCU] /i ) { return T;} # Threonine;
if ( $_ [ 0 ] =~ / GU[AGCU] /i ) { return V;} # Valine;
if ( $_ [ 0 ] =~ / UGG /i ) { return W;} # Tryptophan;
if ( $_ [ 0 ] =~ / UAC|UAU /i ) { return Y;} # Tyrosine;
if ( $_ [ 0 ] =~ / UAA|UGA|UAG /i ) { return "***" ;} # Stop Codons;
}
我错过了什么吗?
I've created a program that can read DNA sequences that can generate the complementary strand, which further translates to an mRNA. However, I have to find the longest possible open reading frame for that DNA. I've coded something, but when it prints out the statement, I get no answer. Help?
This is what I have;
# Search for the longest open reading frame for this DNA.
print "\nHere is the largest ORF, from 5' to 3':\n" ;
local $_ = $RNA_seq ;
while ( / AUG /g ) {
my $start = pos () - 2 ;
if ( / UGA|UAA|UAG /g ) {
my $stop = pos ;
$gene = substr ( $_ , $start - 1 , $stop - $start + 1 ), $/ ;
print "$gene" ;
}
}
# The next set of commands translates the ORF found above for an amino acid seq.
print "\nThe largest reading Frame is:\t\t\t" . $protein { "gene" } . "\n" ;
sub translate {
my ( $gene , $reading_frame ) = @_ ;
my %protein = ();
for ( $i = $reading_frame ; $i < length ( $gene ); $i += 3 ) {
$codon = substr ( $gene , $i , 3 );
$amino_acid = translate_codon( $codon );
$protein { $amino_acid }++;
$protein { "gene" } .= $amino_acid ;
}
return %protein ;
}
sub translate_codon {
if ( $_ [ 0 ] =~ / GC[AGCU] /i ) { return A;} # Alanine;
if ( $_ [ 0 ] =~ / UGC|UGU /i ) { return C;} # Cysteine
if ( $_ [ 0 ] =~ / GAC|GAU /i ) { return D;} # Aspartic Acid;
if ( $_ [ 0 ] =~ / GAA|GAG /i ) { return Q;} # Glutamine;
if ( $_ [ 0 ] =~ / UUC|UUU /i ) { return F;} # Phenylalanine;
if ( $_ [ 0 ] =~ / GG[AGCU] /i ) { return G;} # Glycine;
if ( $_ [ 0 ] =~ / CAC|CAU /i ) { return His;} # Histine (start codon);
if ( $_ [ 0 ] =~ / AU[AUC] /i ) { return I;} # Isoleucine;
if ( $_ [ 0 ] =~ / AAA|AAG /i ) { return K;} # Lysine;
if ( $_ [ 0 ] =~ / UUA|UUG|CU[AGCU] /i ) { return Leu;} # Leucine;
if ( $_ [ 0 ] =~ / AUG /i ) { return M;} # Methionine;
if ( $_ [ 0 ] =~ / AAC|AAU /i ) { return N;} # Asparagine;
if ( $_ [ 0 ] =~ / CC[AGCU] /i ) { return P;} # Proline;
if ( $_ [ 0 ] =~ / CAA|CAG /i ) { return G;} # Glutamine;
if ( $_ [ 0 ] =~ / AGA|AGG|CG[AGCU] /i ) { return R;} # Arginine;
if ( $_ [ 0 ] =~ / AGC|AGU|UC[AGCU] /i ) { return S;} # Serine;
if ( $_ [ 0 ] =~ / AC[AGCU] /i ) { return T;} # Threonine;
if ( $_ [ 0 ] =~ / GU[AGCU] /i ) { return V;} # Valine;
if ( $_ [ 0 ] =~ / UGG /i ) { return W;} # Tryptophan;
if ( $_ [ 0 ] =~ / UAC|UAU /i ) { return Y;} # Tyrosine;
if ( $_ [ 0 ] =~ / UAA|UGA|UAG /i ) { return "***" ;} # Stop Codons;
}
Am I missing something?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
放在
代码的开头:
它将有助于检测通常不返回字符串而是文件句柄的问题
如果您想返回字符串,
Put
at the beginning of your code: it will help to detect problems
generally with
you are not returning a string but a file handle, if you want to return a string
如果您想在正则表达式中添加空格以提高可读性,则需要使用正则表达式
x
修饰符:然后,如果您需要表示空格(您不应该在这里),只需使用 <代码>\s
顺便说一句,不要使用
$_ [0]
考虑:If you want to put whitespace in your regular expressions for readability, you need to use the regular expression
x
modifier as:Then if you need to represent whitespace (which you shouldn't here) , simply use
\s
As an aside, rather than use
$_ [0]
consider:您是否检查过 BioPerl ?我还没有检查过,但也许它已经包含了您需要的功能。或者您目前正在进行一项必须自己编写程序的作业?
编辑
我不太确定您发布的代码的第一部分。例如,您的正则表达式中有空格。您尝试匹配的字符串实际上包含这些空格,还是密码子全部写在一起,如
AUGCCGGAUGA
中所示?在后一种情况下,即使存在您正在寻找的密码子,也根本不匹配(我可能会告诉您您知道的事情......我只是想指出这一点,以防万一:) )。另外,你的 pos 函数的代码是什么?
还有一件事:您不必设置
$_
,您可以简单地将$RNA_seq
与模式匹配,如下所示:EDIT 2
我对第一部分进行了更多思考,我认为这里建议使用
index
函数,因为它会立即为您提供序列中的位置:这给了您
我想说它可能需要一些改进,但我希望这对大家有所帮助 相同的。
Have you checked out BioPerl at all? I haven't checked, but perhaps it already includes the functionality you require. Or are you currently working on an assignment for which you have to write the program yourself?
EDIT
I'm not quite sure about the first section of the code you have posted. For example, there are blanks within your regexes. Does the string you are trying to match actually include those blanks, or are the codons all written together, as in
AUGCCGGAUGA
? In the latter case there'd simply be no match, even though the codons you're looking for are present (I'm probably telling you things you know though... I just wanted to point it out, just in case :) ).Also, what's the code of your
pos
function?One more thing: you don't have to set
$_
, you can simply match$RNA_seq
against a pattern, like so:EDIT 2
I've thought a bit more about the first section, I think using the
index
function would be advisable here, as that immediately gives you the position within the sequence:This gives you
I would say it might need some refinement, but I hope it'll be helpful all the same.
从头开始基因预测有多种应用。如果您仍然想从头开始编码,我建议您查看动态编程和用于局部序列比对的 Smith–Waterman 算法。但请注意,对于真核生物,您还应该考虑 RNA 剪接。
There are several applications for ab-initio gene predictions. If you still want to code it from scratch I recommend looking into dynamic programming and the Smith–Waterman algorithm for local sequence alignment. Note, however, that for eukaryote organisms you should also account for RNA splicing.