如何在 Perl 中重写此代码的一行代码(或命令行中的更少行代码)?

发布于 2024-10-25 19:12:51 字数 1206 浏览 2 评论 0 原文

我有这样的代码:

#!/usr/bin/perl
use strict;
use warnings;      
my %proteins = qw/
    UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
    CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
    AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
    GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G
    /;
open(INPUT,"<dna.txt");
while (<INPUT>) {    
    tr/[a,c,g,t]/[A,C,G,T]/;
    y/GCTA/CGAU/;    
    foreach my $protein (/(...)/g) {
        if (defined $proteins{$protein}) {
        print $proteins{$protein};
        }
}
}
close(INPUT);

此代码与我的其他问题的答案相关: DNA 到 RNA 并用 Perl 获取蛋白质

程序的输出是:

SIMQNISGREAT

我怎样才能用 Perl 重写该代码,它将在命令行上运行,并且将用更少的代码重写(如果可能的话,一行代码) ?

PS 1: dna.txt 是这样的:

TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT

PS 2: 如果代码行数较少,可以接受编写 my % Proteins 变量到文件中。

I have a code like that:

#!/usr/bin/perl
use strict;
use warnings;      
my %proteins = qw/
    UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
    CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
    AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
    GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G
    /;
open(INPUT,"<dna.txt");
while (<INPUT>) {    
    tr/[a,c,g,t]/[A,C,G,T]/;
    y/GCTA/CGAU/;    
    foreach my $protein (/(...)/g) {
        if (defined $proteins{$protein}) {
        print $proteins{$protein};
        }
}
}
close(INPUT);

This code is related to my other question's answer: DNA to RNA and Getting Proteins with Perl

The output of the program is:

SIMQNISGREAT

How can I rewrite that code with Perl, it will run on command line and it will be rewritten with less code(if possible one line code)?

PS 1: dna.txt is like that:

TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT

PS 2: If the code will be less line, it is accepted to write the my %proteins variable into a file.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

鱼窥荷 2024-11-01 19:12:51

我建议进行的唯一更改是简化 while 循环:

while (<INPUT>) {
    tr/acgt/ACGT/;
    tr/GCTA/CGAU/;
    foreach my $protein (/(...)/g) {
        if (defined $proteins{$protein}) {
            print $proteins{$protein};
        }
    }
}

由于 ytr 是同义词,因此您应该只使用其中之一。我认为 try 读起来更好,所以我选择了 tr。此外,您对它们的称呼非常不同,但这应该是相同的效果,并且只提到您实际更改的字母。 (所有其他字符都被调换为自身。这使得查看实际更改的内容变得更加困难。)

您可能需要删除 open(INPUT," 和相应的 close(INPUT); 行,因为它们使得在 shell 管道或不同输入文件中使用程序变得更加困难。但这取决于您,如果输入文件始终dna.txt并且没有任何不同,那就没问题。

The only changes I would recommend making are simplifying your while loop:

while (<INPUT>) {
    tr/acgt/ACGT/;
    tr/GCTA/CGAU/;
    foreach my $protein (/(...)/g) {
        if (defined $proteins{$protein}) {
            print $proteins{$protein};
        }
    }
}

Since y and tr are synonyms, you should only use one of them. I think tr reads better than y, so I picked tr. Further, you were calling them very differently, but this should be the same effect and only mentions the letters you actually change. (All the other characters were being transposed to themselves. That makes it much harder to see what is actually being changed.)

You might want to remove the open(INPUT,"<dna.txt"); and corresponding close(INPUT); lines, as they make it much harder to use your program in shell pipelines or with different input files. But that's up to you, if the input file will always be dna.txt and never anything different, this is alright.

简单爱 2024-11-01 19:12:51

有人(@kamaci)在另一个帖子中叫了我的名字。这是我在将蛋白质表保留在命令行上时能想到的最好的方法:

perl -nE'say+map+substr("FYVDINLHL%VEMKLQL%VEIKLQFYVDINLHCSGASTRPWSGARTRP%SGARTRPCSGASTR",(s/GGG/GGC/i,vec($_,0,32)&101058048)%63,1),/.../g' dna.txt

(Shell 引用,对于 Windows 引用交换 '" 字符)。此版本标记带有 % 的无效密码子,您可以通过在适当的位置添加 =~y/%//d 来解决这个问题

提示:这会从原始 ASCII 中挑选出 6 位。 RNA 三元组的编码,给出 0 到 101058048 之间的 64 个代码;为了获得字符串索引,我将结果模 63 减少,但这创建了一个双重映射,遗憾的是必须编码两种不同的蛋白质 s/GGG/。 GGC/i 将其中一个映射到编码正确蛋白质的另一个,

还要注意 % 运算符之前的括号,它们隔离,<。 substr 参数列表中的 /code> 运算符 and 修复了 &% 的优先级。在生产代码中使用它,你就是一个非常非常糟糕的人。

Somebody (@kamaci) called my name in another thread. This is the best I can come up with while keeping the protein table on the command line:

perl -nE'say+map+substr("FYVDINLHL%VEMKLQL%VEIKLQFYVDINLHCSGASTRPWSGARTRP%SGARTRPCSGASTR",(s/GGG/GGC/i,vec($_,0,32)&101058048)%63,1),/.../g' dna.txt

(Shell quoting, for Windows quoting swap ' and " characters). This version marks invalid codons with %, you can probably fix that by adding =~y/%//d at an appropriate spot.

Hint: This picks out 6 bits from the raw ASCII encoding of an RNA triple, giving 64 codes between 0 and 101058048; to get a string index, I reduce the result modulo 63, but this creates one double mapping which regrettably had to code two different proteins. The s/GGG/GGC/i maps one of them to another that codes the right protein.

Also note the parentheses before the % operator which both isolate the , operator from the argument list of substr and fix the precedence of & vs %. If you ever use that in production code, you're a bad, bad person.

梦里南柯 2024-11-01 19:12:51
#!/usr/bin/perl
%p=qw/UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G/;
$_=uc<DATA>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g
__DATA__
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT

唷。我能想到的最好的办法,至少这么快。如果您确定输入始终为大写,您还可以删除 uc 来保存另外两个字符。或者,如果输入始终相同,您可以立即将其分配给 $_,而不是从任何地方读取它。

我想我不需要说这段代码不应该在生产环境或除了纯粹的乐趣之外的任何其他地方使用。在进行实际编程时,可读性几乎总是胜过紧凑性。

我在评论中提到的一些其他版本:

Reading %p and the DNA from files:

#!/usr/bin/perl
open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;
open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g

From shell with perl -e:

perl -e 'open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g'
#!/usr/bin/perl
%p=qw/UUU F UUC F UUA L UUG L UCU S UCC S UCA S UCG S UAU Y UAC Y UGU C UGC C UGG W
CUU L CUC L CUA L CUG L CCU P CCC P CCA P CCG P CAU H CAC H CAA Q CAG Q CGU R CGC R CGA R CGG R
AUU I AUC I AUA I AUG M ACU T ACC T ACA T ACG T AAU N AAC N AAA K AAG K AGU S AGC S AGA R AGG R
GUU V GUC V GUA V GUG V GCU A GCC A GCA A GCG A GAU D GAC D GAA E GAG E GGU G GGC G GGA G GGG G/;
$_=uc<DATA>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g
__DATA__
TCATAATACGTTTTGTATTCGCCAGCGCTTCGGTGT

Phew. Best I can come up with, at least this quickly. If you're sure the input is always already in uppercase, you can also drop the uc saving another two characters. Or if the input is always the same, you could assign it to $_ straight away instead of reading it from anywhere.

I guess I don't need to say that this code should not be used in production environments or anywhere else other than pure fun. When doing actual programming, readability almost always wins over compactness.

A few other versions I mentioned in the comments:

Reading %p and the DNA from files:

#!/usr/bin/perl
open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;
open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g

From shell with perl -e:

perl -e 'open A,"<p.txt";map{map{/(...)/;$p{$1}=chop}/(... .)/g}<A>;open B,"<dna.txt";$_=uc<B>;y/GCTA/CGAU/;map{print if$_=$p{$_}}/(...)/g'
绿光 2024-11-01 19:12:51

大多数事情已经指出,尤其是可读性很重要。我不会尝试将程序简化得比下面的更多。

use strict;
use warnings;
# http://stackoverflow.com/questions/5402405/
my $fnprot = shift || 'proteins.txt';
my $fndna  = shift || 'dna.txt';
# build protein table
open my $fhprot, '<', $fnprot or die "open $fnprot: $!";
my %proteins = split /\s+/, do { local $/; <$fhprot> };
close $fhprot;
# process dna data
my @result;
open my $fhdna, '<', $fndna or die "open $fndna: $!";
while (<$fhdna>) {
    tr/acgt/ACGT/;
    tr/GCTA/CGAU/;
    push @result, map $proteins{$_}, grep defined $proteins{$_}, m/(...)/g;
}
close $fhdna;
# check correctness of result (given input as per original post)
my $expected = 'SIMQNISGREAT';
my $got = join '', @result;
die "@result is not expected" if $got ne $expected;
print "@result - $got\n";

我添加的唯一“一行”内容是 while 循环中的 push map grep m//g 。请注意,Perl 5.10 添加了“定义或”运算符 - // - 它允许您编写:

push @result, map $proteins{$_} // (), m/(...)/g;

啊好吧,open do local $/ file slurp 习惯用法很方便将小文件放入内存中。希望你觉得它有点启发。 :-)

Most things have already been pointed out, especially that readability matters. I wouldn't try to reduce the program more than what follows.

use strict;
use warnings;
# http://stackoverflow.com/questions/5402405/
my $fnprot = shift || 'proteins.txt';
my $fndna  = shift || 'dna.txt';
# build protein table
open my $fhprot, '<', $fnprot or die "open $fnprot: $!";
my %proteins = split /\s+/, do { local $/; <$fhprot> };
close $fhprot;
# process dna data
my @result;
open my $fhdna, '<', $fndna or die "open $fndna: $!";
while (<$fhdna>) {
    tr/acgt/ACGT/;
    tr/GCTA/CGAU/;
    push @result, map $proteins{$_}, grep defined $proteins{$_}, m/(...)/g;
}
close $fhdna;
# check correctness of result (given input as per original post)
my $expected = 'SIMQNISGREAT';
my $got = join '', @result;
die "@result is not expected" if $got ne $expected;
print "@result - $got\n";

The only "one-liner" thing I added is the push map grep m//g in the while loop. Note that Perl 5.10 adds the "defined or" operator - // - which allows you to write:

push @result, map $proteins{$_} // (), m/(...)/g;

Ah okay, the open do local $/ file slurp idiom is handy for slurping small files into memory. Hope you find it a bit inspiring. :-)

情定在深秋 2024-11-01 19:12:51

如果将蛋白质数据写入另一个文件,请以空格分隔且不换行。因此,您可以通过读取一次文件来导入数据。

#!/usr/bin/perl
use strict;
use warnings;      

open(INPUT, "<mydata.txt");
open(DATA, "<proteins.txt");
my %proteins = split(" ",<DATA>);

while (<INPUT>) {
    tr/GCTA/CGAU/;
    while(/(\w{3})/gi) {print $proteins{$1} if (exists($proteins{$1}))};
}
close(INPUT);
close(DATA);

您可以删除代码行“tr/a,c,g,t/A,C,G,T/”,因为匹配运算符具有不区分大小写的选项(< b>i 选项)。原始的 foreach 循环可以像上面的代码一样进行优化。 $1 这里的变量是匹配操作括号内的匹配模式结果 /(\w{3})/gi

If write proteins data to another file, space delimited and without line break. So, you can import data by reading file once time.

#!/usr/bin/perl
use strict;
use warnings;      

open(INPUT, "<mydata.txt");
open(DATA, "<proteins.txt");
my %proteins = split(" ",<DATA>);

while (<INPUT>) {
    tr/GCTA/CGAU/;
    while(/(\w{3})/gi) {print $proteins{$1} if (exists($proteins{$1}))};
}
close(INPUT);
close(DATA);

You can remove line of code "tr/a,c,g,t/A,C,G,T/" because match operator has option for case insensitive (i option). And original foreach loop can be optimized like code above. $1 variable here is matched pattern result inside parentheses of match operation /(\w{3})/gi

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文