如何按列比较两个文本文件并输出列匹配的次数
我有两个制表符分隔的基因组序列文件(SAM 格式),我想对它们进行比较,看看每个文件中某些测序读取(包含单行)出现了多少次。
下面是输入文件格式的示例:
HWI-AT555:86:D0:6:2208:13551:55125 122 chr1 77028 255 94M555N7M * 0 0 GTGCCTTCCAATTTTGTGAGTGGAGNACAAGTTCGCTAAAGCTAATGAATGATCTACCACCATGATTGAGTGTCTGAGTCGAATCAAGTGAATTGCTGTTAG &&&(((((*****++++++++++++!!&)*++++)+++++++++++++++++++++++++*++++++++*****((((((''''''&&&&'''&&&&&&&& NM:i:3 XS:A:+ NH:i:1
重要的部分是序列读取 ID,它是第一列(即 HWI-....55125)。这就是我想用来比较两个文件的内容,以便我可以计算重复项/副本的数量。
这是我到目前为止所得到的:
unless (@ARGV == 2) {
print "Use as follows: perl program.pl in1.file in2.file\n";
die;
}
my $in1 = $ARGV[0];
my $in2 = $ARGV[1];
open ONE, $in1;
open TWO, $in2;
my %hash1;
my @hit;
while (<ONE>){
chomp;
my @hit = split(/\t/, $_);
$hash1{$hit[0]}=1;
}
close ONE;
my @col;
while (<TWO>){
chomp;
my @col = split(/\t/, $_);
if ($col[0] =~ /^H/){ #only valid sequence read lines start with "H"
print "$col[0]\n" if defined($hash1{$_});
}
}
close TWO;
到目前为止,它在逐行浏览第二个文件时在 hash1 中查找匹配项并打印出所有匹配项。我希望它做的是计算它找到匹配项的次数,然后打印出每个序列 ID 发生的次数和匹配项的总数。
我是编程新手,并且我非常困惑如何在循环过程中有匹配项时保持计数。任何帮助将不胜感激。如果我说得不够清楚,请告诉我。
I have two tab-delimited genome sequence files (SAM format), and I would like to compare them to see how many times certain sequencing reads (which comprise a single line) are present in each.
Here is an example of input file format:
HWI-AT555:86:D0:6:2208:13551:55125 122 chr1 77028 255 94M555N7M * 0 0 GTGCCTTCCAATTTTGTGAGTGGAGNACAAGTTCGCTAAAGCTAATGAATGATCTACCACCATGATTGAGTGTCTGAGTCGAATCAAGTGAATTGCTGTTAG &&&(((((*****++++++++++++!!&)*++++)+++++++++++++++++++++++++*++++++++*****((((((''''''&&&&'''&&&&&&&& NM:i:3 XS:A:+ NH:i:1
The important part is the sequence read id, which is the first column (ie HWI-....55125). This is what I want to use to compare the two files so that I can count the number of duplicates/copies.
Here is what I have so far:
unless (@ARGV == 2) {
print "Use as follows: perl program.pl in1.file in2.file\n";
die;
}
my $in1 = $ARGV[0];
my $in2 = $ARGV[1];
open ONE, $in1;
open TWO, $in2;
my %hash1;
my @hit;
while (<ONE>){
chomp;
my @hit = split(/\t/, $_);
$hash1{$hit[0]}=1;
}
close ONE;
my @col;
while (<TWO>){
chomp;
my @col = split(/\t/, $_);
if ($col[0] =~ /^H/){ #only valid sequence read lines start with "H"
print "$col[0]\n" if defined($hash1{$_});
}
}
close TWO;
So far it looks for a match in hash1 while going through the second file line by line and prints out any matches. What I would like it to do is count how many times it finds a match and then print out the number of times that happens for each sequence id and a total number of matches.
I am new to programming and I am quite stuck with how I can keep a count when there are matches while going through a loop. Any help would be appreciated. Let me know if I didn't make something clear enough.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
用零而不是一初始化您的
%hash1
:然后,在第二个循环中,您可以递增
$hash1{$col[0]}
:无需检查 < code>$col[0] =~ /^H/ 因为
%hash1
将只包含有效序列的条目,因此您只需执行exists
检查哈希值。并且您想要查看$hash1{$col[0]}
而不是$hash1{$_}
因为您只将行的第一部分存储在您的第一个循环$_
将包含整行。此外,如果您只是获取每行的第一个字段,则不需要chomp
调用,但它们没有任何害处,因此您可以根据需要保留它们。这使您将
%hash1
中的所有重复条目作为具有非零值的条目,并且您可以grep
那些:然后显示它们及其计数:
您还可以在显示匹配项时检查计数:
Initialize your
%hash1
with zeros instead of ones:Then, in your second loop, you can increment
$hash1{$col[0]}
:There's no need to check
$col[0] =~ /^H/
since%hash1
will only have entries for valid sequences, so you can just do anexists
check on the hash. And you want to look at$hash1{$col[0]}
rather than$hash1{$_}
since you're only storing the first part of the lines in your first loop,$_
will have the whole line. Furthermore, if you're just grabbing the first field of each line you don't need thechomp
calls but they do no harm so you can keep them if you want.This leaves you with the all the repeated entries in
%hash1
as entries with non-zero values and you cangrep
those out:And then display them with their counts:
You could also check the counts while displaying the matches: