如何按列比较两个文本文件并输出列匹配的次数

发布于 2024-12-16 13:22:39 字数 1336 浏览 0 评论 0原文

我有两个制表符分隔的基因组序列文件(SAM 格式),我想对它们进行比较,看看每个文件中某些测序读取(包含单行)出现了多少次。

下面是输入文件格式的示例:

HWI-AT555:86:D0:6:2208:13551:55125       122     chr1    77028   255     94M555N7M       *       0       0       GTGCCTTCCAATTTTGTGAGTGGAGNACAAGTTCGCTAAAGCTAATGAATGATCTACCACCATGATTGAGTGTCTGAGTCGAATCAAGTGAATTGCTGTTAG   &&&(((((*****++++++++++++!!&)*++++)+++++++++++++++++++++++++*++++++++*****((((((''''''&&&&'''&&&&&&&&   NM:i:3  XS:A:+    NH:i:1

重要的部分是序列读取 ID,它是第一列(即 HWI-....55125)。这就是我想用来比较两个文件的内容,以便我可以计算重复项/副本的数量。

这是我到目前为止所得到的:

unless (@ARGV == 2) {
    print "Use as follows: perl program.pl in1.file in2.file\n";
    die;
}

my $in1 = $ARGV[0];
my $in2 = $ARGV[1];

open ONE, $in1;
open TWO, $in2;

my %hash1;
my @hit;    

while (<ONE>){
    chomp;
    my @hit = split(/\t/, $_);
    $hash1{$hit[0]}=1;
}
close ONE;

my @col;

while (<TWO>){
    chomp;
    my @col = split(/\t/, $_);
    if ($col[0] =~ /^H/){  #only valid sequence read lines start with "H"
        print "$col[0]\n" if defined($hash1{$_});   

    }
}
close TWO;

到目前为止,它在逐行浏览第二个文件时在 hash1 中查找匹配项并打印出所有匹配项。我希望它做的是计算它找到匹配项的次数,然后打印出每个序列 ID 发生的次数和匹配项的总数。

我是编程新手,并且我非常困惑如何在循环过程中有匹配项时保持计数。任何帮助将不胜感激。如果我说得不够清楚,请告诉我。

I have two tab-delimited genome sequence files (SAM format), and I would like to compare them to see how many times certain sequencing reads (which comprise a single line) are present in each.

Here is an example of input file format:

HWI-AT555:86:D0:6:2208:13551:55125       122     chr1    77028   255     94M555N7M       *       0       0       GTGCCTTCCAATTTTGTGAGTGGAGNACAAGTTCGCTAAAGCTAATGAATGATCTACCACCATGATTGAGTGTCTGAGTCGAATCAAGTGAATTGCTGTTAG   &&&(((((*****++++++++++++!!&)*++++)+++++++++++++++++++++++++*++++++++*****((((((''''''&&&&'''&&&&&&&&   NM:i:3  XS:A:+    NH:i:1

The important part is the sequence read id, which is the first column (ie HWI-....55125). This is what I want to use to compare the two files so that I can count the number of duplicates/copies.

Here is what I have so far:

unless (@ARGV == 2) {
    print "Use as follows: perl program.pl in1.file in2.file\n";
    die;
}

my $in1 = $ARGV[0];
my $in2 = $ARGV[1];

open ONE, $in1;
open TWO, $in2;

my %hash1;
my @hit;    

while (<ONE>){
    chomp;
    my @hit = split(/\t/, $_);
    $hash1{$hit[0]}=1;
}
close ONE;

my @col;

while (<TWO>){
    chomp;
    my @col = split(/\t/, $_);
    if ($col[0] =~ /^H/){  #only valid sequence read lines start with "H"
        print "$col[0]\n" if defined($hash1{$_});   

    }
}
close TWO;

So far it looks for a match in hash1 while going through the second file line by line and prints out any matches. What I would like it to do is count how many times it finds a match and then print out the number of times that happens for each sequence id and a total number of matches.

I am new to programming and I am quite stuck with how I can keep a count when there are matches while going through a loop. Any help would be appreciated. Let me know if I didn't make something clear enough.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

天煞孤星 2024-12-23 13:22:39

用零而不是一初始化您的 %hash1

while (<ONE>){
    chomp;
    my @hit = split(/\t/, $_);
    # Start them as "0" for "no duplicates".
    $hash1{$hit[0]} = 0;
}

然后,在第二个循环中,您可以递增 $hash1{$col[0]}

while (<TWO>){
    chomp;
    my @col = split(/\t/, $_);
    # Increment the counter if %hash1 has what we're looking for.
    ++$hash1{$col[0]} if(exists($hash1{$col[0]}));
}

无需检查 < code>$col[0] =~ /^H/ 因为 %hash1 将只包含有效序列的条目,因此您只需执行 exists 检查哈希值。并且您想要查看 $hash1{$col[0]} 而不是 $hash1{$_} 因为您只将行的第一部分存储在您的第一个循环 $_ 将包含整行。此外,如果您只是获取每行的第一个字段,则不需要 chomp 调用,但它们没有任何害处,因此您可以根据需要保留它们。

这使您将 %hash1 中的所有重复条目作为具有非零值的条目,并且您可以 grep 那些:

my @dups = grep { $hash1{$_} > 0 } keys %hash1;

然后显示它们及其计数:

for my $k (sort @dups) {
    print "$k\t$hash1{$k}\n";
}

您还可以在显示匹配项时检查计数:

for my $k (sort keys %hash1) {
    print "$k\t$hash1{$k}\n" if($hash1{$k} > 0);
}

Initialize your %hash1 with zeros instead of ones:

while (<ONE>){
    chomp;
    my @hit = split(/\t/, $_);
    # Start them as "0" for "no duplicates".
    $hash1{$hit[0]} = 0;
}

Then, in your second loop, you can increment $hash1{$col[0]}:

while (<TWO>){
    chomp;
    my @col = split(/\t/, $_);
    # Increment the counter if %hash1 has what we're looking for.
    ++$hash1{$col[0]} if(exists($hash1{$col[0]}));
}

There's no need to check $col[0] =~ /^H/ since %hash1 will only have entries for valid sequences, so you can just do an exists check on the hash. And you want to look at $hash1{$col[0]} rather than $hash1{$_} since you're only storing the first part of the lines in your first loop, $_ will have the whole line. Furthermore, if you're just grabbing the first field of each line you don't need the chomp calls but they do no harm so you can keep them if you want.

This leaves you with the all the repeated entries in %hash1 as entries with non-zero values and you can grep those out:

my @dups = grep { $hash1{$_} > 0 } keys %hash1;

And then display them with their counts:

for my $k (sort @dups) {
    print "$k\t$hash1{$k}\n";
}

You could also check the counts while displaying the matches:

for my $k (sort keys %hash1) {
    print "$k\t$hash1{$k}\n" if($hash1{$k} > 0);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文