如何在 Perl 中合并两个 FASTA 文件(一个文件带有换行符)?
我有以下两个 Fasta 文件:
file1.fasta
>0
GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
>1
GTTAAGTTATATCAAACTAAATATACATACTATAAA
>2
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC
file2.qual
>0
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 15 40 40
>1
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40
40 40 40 40 40 40 40 40 40 40 40
>2
40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4
40 8 3 29 10 19 18 40 19 15 5
请注意每个 fasta 标头的“qual”文件中的换行符 - 标记为“>”。 两个文件的文件头数量(“>”)相同。 数值质量的数量 = 序列长度。
我想做的是附加这两个文件,产生:
GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40
GTTAAGTTATATCAAACTAAATATACATACTATAAA 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC 40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4 40 8 3 29 10 19 18 40 19 15 5
但不知何故,我下面的代码无法正确执行此操作? 特别是“qual”文件中每个条目的第二行不会被打印。
use strict;
use Data::Dumper;
use Carp;
use File::Basename;
my $fastafile = $ARGV[0] || "reads/2039F.2.fasta";
my $base = basename( $fastafile, ".fasta" );
my $qualfile = "reads/" . $base . ".qual";
print "$qualfile\n";
open SEQ, '<', $fastafile or die $!; #Seq
open PRB, '<', $qualfile or die $!; #quality
while (my $seq = <SEQ>) {
my $qual = <PRB>;
chomp($seq);
chomp($qual);
if ($seq =~ /^>/ || $qual =~ /^>/) {
next;
}
else {
print "$seq\t$qual\n";
}
}
正确的做法是什么?
I have two following Fasta file:
file1.fasta
>0
GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
>1
GTTAAGTTATATCAAACTAAATATACATACTATAAA
>2
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC
file2.qual
>0
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 15 40 40
>1
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40
40 40 40 40 40 40 40 40 40 40 40
>2
40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4
40 8 3 29 10 19 18 40 19 15 5
Note the line break in "qual" file for each fasta header - marked with ">".
Number of file header ('>') is the same for both files. Number of numerical qualities = sequence length.
What I want to do is to append this two files yielding:
GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40
GTTAAGTTATATCAAACTAAATATACATACTATAAA 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC 40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4 40 8 3 29 10 19 18 40 19 15 5
But somehow my code below fail to do it correctly? Especially the second line of each entry in 'qual' file doesn't get printed.
use strict;
use Data::Dumper;
use Carp;
use File::Basename;
my $fastafile = $ARGV[0] || "reads/2039F.2.fasta";
my $base = basename( $fastafile, ".fasta" );
my $qualfile = "reads/" . $base . ".qual";
print "$qualfile\n";
open SEQ, '<', $fastafile or die $!; #Seq
open PRB, '<', $qualfile or die $!; #quality
while (my $seq = <SEQ>) {
my $qual = <PRB>;
chomp($seq);
chomp($qual);
if ($seq =~ /^>/ || $qual =~ /^>/) {
next;
}
else {
print "$seq\t$qual\n";
}
}
What's the correct way to do it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
问题是您正在并行推进文件,因此当该行为“>”时 在一个文件中,它可能不是“>” 下一个。
您读取数据的方式是成对的,如下所示:
应用循环规则的同一组数据将执行以下操作:
因此,您需要将循环逻辑分开或找到一种使文件匹配的方法。
这是分离寻求的尝试,但我还没有测试过。
更新
我将上面的代码重构为一个函数,该函数将根据需要从任意文件句柄中读取一个块,它似乎可以根据需要工作。 当然,请注意,我在这里尝试了一些我一直想用于实际用途的技巧。
上面的代码经过测试,完全符合您的要求。
请注意, \my stuff
基本上与 相同,
除了前者每次都会创建一个新标量,保证相同的值对连续循环不可见;
所以它变得更像:
The problem is you are advancing through the file in parallel, so when the line is ">" in one file, it might not be ">" in the next.
The way you are reading the data is in pairs, like so:
The same set of data applied your looping rules would do this:
So you need to either separate the looping logic out or find a way to make the files match.
Here is an attempt at separating the seeking, but I haven't tested it.
Update
I re-factored the above code into a single function that will read a chunk from an arbitrary file handle as needed, it seems to work as needed. Note of course I experimented here a little with a trick I've been meaning to use for something practical.
And the above code, tested, does exactly what you want to do.
Note on that \my stuff
is fundamentally the same as
Except for the fact the former creates a new scalar every time, guaranteeing that the same value wont be visible to a sucessive loop;
so it becomes more like:
这是一个不使用 perl 而是使用普通 shell 命令的解决方案:
我搜索了很多年的粘贴命令(知道“这是一个超级基本的操作,有人必须已经实现了一些东西来解决这个问题”) 。
第二个命令行首先将所有换行符转换为空格,然后添加 echo 命令以向输入添加最终换行符(因为 sed 将忽略缺少 EOL 的行),从而将所有输入行连接成一行,然后 sed 命令再次分裂(可移植性说明:并非所有 sed 程序都可以使用任意行长度,但 GNU sed 可以)。
Here is a solution not using perl, but plain shell commands:
I searched many years for the paste command (knowing "this is a super basic operation, someone must already have implemented something to solve this problem").
The second command line first translates all newlines to spaces, and the echo command is added to add a final newline to the input (because sed will ignore lines lacking EOL), thereby joining all the input lines into one single line which then the sed command splits up again (portability note: not all sed programs will work with arbitrary line lengths, but GNU sed does).
您错过了质量分数的第二行(以及后续的每一行),并且还会错过其他序列行。 出于此目的和代码重用目的,处理 FASTA 序列的方法是作为整个条目/记录:
您还可以在第一次替换中轻松捕获 FASTA 标头。
You're missing the 2nd (and every subsequent) line of the quality scores and would also miss additional sequence lines. For this and code re-use purposes, the way to handle FASTA sequences is as whole entries/records:
You could also easily capture the FASTA header in the first replace.