需要比较文件中第一列重复的值

发布于 2024-09-14 23:21:07 字数 2667 浏览 4 评论 0原文

所以我的数据样本采用以下格式。

jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   19856   19974
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   21455   21638
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   21727   21897
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   21980   22063
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   24670   24811
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   34741   34902
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   3649    3836
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   59253   59409
jgi|Xentr4|100173|gw1.779.90.1  scaffold_779    101746  101969
jgi|Xentr4|100173|gw1.779.90.1  scaffold_779    106436  107233

我试图做的是对于第一列中的每个唯一名称，检索第 3 列的最小值和第 4 列的最大值。因此最终输入看起来相同，是一个制表符分隔的文件，除了它将为每个唯一名称提供前 2 列，然后第 3 列和第 4 列是上面提到的最小值和最大值。我在编程方面相当新手，并尝试使用哈希来做到这一点，但惨败。现在正在尝试使用数组/正则表达式，如下所示。

open (IN, "POS2") || die "nope\n";
my $prev_qn = super;
my $prev_sn = ultra;
my $prev_start = non;
my $prev_end = nono;
while (<IN>) {
    chomp;
    push (@list, "$_");
}
close (IN);
foreach $v (@list) {
    $info = $v;
    ($query_name, $scaf_num, $start, $end) = split(/\t/, $info);
    unless ($info =~ m/^$prev_qn/) {
        push @ready, $info;
        $prev_qn = $query_name;
        $prev_sn = $scaf_num;
        $prev_start = $start;
        $prev_end = $end;
    }
    else {
        if ($start < $prev_start) {
            splice(@ready,2,1,$start);
        }
        if ($end > $prev_end) {
            splice(@ready,3,1,$end);
        }
        $prev_qn = $query_name;
        $prev_sn = $scaf_num;
        $prev_start = $start;
        $prev_end = $end;
    }

    foreach $z (@ready) {
        print "$z\n";
    }
}

返回的输出如下。

jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
21897
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
22063
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
24811
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
34902
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
3649
34902
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
3649
59409
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
3649
101969

因此，很明显该文件可以很好地进行比较，但它没有按预期替换数组中的元素，只是将它们附加在下面并替换它们。此外，它永远不会打印超过第一个唯一名称的内容。有什么建议吗？

原文

So my data sample is in the following format.

jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   19856   19974
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   21455   21638
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   21727   21897
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   21980   22063
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   24670   24811
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   34741   34902
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   3649    3836
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   59253   59409
jgi|Xentr4|100173|gw1.779.90.1  scaffold_779    101746  101969
jgi|Xentr4|100173|gw1.779.90.1  scaffold_779    106436  107233

and what I am attempting to do is for each unique name in the first column, retrieve the min value for column 3, and the max value for column 4. So the final input will look the same, a tab-delimited file, except that it will have the 1st 2 columns for each unique name, then the 3rd and 4th columns be the min and max values mentioned above. I'm fairly novice at programming and attempted to do this using hashes but failed miserably. Am trying now with arrays/regular expressions as seen below.

open (IN, "POS2") || die "nope\n";
my $prev_qn = super;
my $prev_sn = ultra;
my $prev_start = non;
my $prev_end = nono;
while (<IN>) {
    chomp;
    push (@list, "$_");
}
close (IN);
foreach $v (@list) {
    $info = $v;
    ($query_name, $scaf_num, $start, $end) = split(/\t/, $info);
    unless ($info =~ m/^$prev_qn/) {
        push @ready, $info;
        $prev_qn = $query_name;
        $prev_sn = $scaf_num;
        $prev_start = $start;
        $prev_end = $end;
    }
    else {
        if ($start < $prev_start) {
            splice(@ready,2,1,$start);
        }
        if ($end > $prev_end) {
            splice(@ready,3,1,$end);
        }
        $prev_qn = $query_name;
        $prev_sn = $scaf_num;
        $prev_start = $start;
        $prev_end = $end;
    }

    foreach $z (@ready) {
        print "$z\n";
    }
}

the output this returns is below.

jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
21897
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
22063
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
24811
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
21638
34902
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
3649
34902
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
3649
59409
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   18150   18354
19974
3649
101969

So it seems clear that the file is doing the comparison fine, but it is not replacing the elements in the array as expected, simply appending them beneath and replacing those. Additionally it never prints past the first unique name. Any suggestions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

剪不断理还乱 2024-09-21 23:21:07

这是一种方法。只需提供输入文件名作为命令行参数即可。 <> 运算符将打开文件并将这些行提供给您的脚本。

use strict;
use warnings;

my %h;

while (my $line = <>){
    chomp $line;
    my ($k, $scaff, $mn, $mx) = split /\t/, $line;

    $h{$k} = { min => 9e99, max => -9e99 } unless exists $h{$k};

    $h{$k}{min} = $mn if $mn < $h{$k}{min};
    $h{$k}{max} = $mx if $mx > $h{$k}{max};
}

for my $k (sort keys %h){
    print join("\t", $k, $h{$k}{min}, $h{$k}{max}), "\n";
}

我使用散列的散列来存储最小和最大信息，因为它使代码更具声明性并且因为它很灵活。例如，假设您决定输出需要保留第 1 列中任何名称首次出现的顺序。只需将另一个元素添加到 hash-of-hashes 结构中，以便在名称首次出现时跟踪输入行号：

$h{$k} = { min => 9e99, max => -9e99, line_n => $. } unless exists $h{$k};

然后在对输出进行排序时使用该新信息：

for my $k (sort { $h{$a}{line_n} <=> $h{$b}{line_n} } keys %h){
    # Same as above.
}

Here's one way to do it. Just supply the input file name as a command-line argument. The <> operator will open the file and supply the lines to your script.

use strict;
use warnings;

my %h;

while (my $line = <>){
    chomp $line;
    my ($k, $scaff, $mn, $mx) = split /\t/, $line;

    $h{$k} = { min => 9e99, max => -9e99 } unless exists $h{$k};

    $h{$k}{min} = $mn if $mn < $h{$k}{min};
    $h{$k}{max} = $mx if $mx > $h{$k}{max};
}

for my $k (sort keys %h){
    print join("\t", $k, $h{$k}{min}, $h{$k}{max}), "\n";
}

I use a hash-of-hashes to store the min and max information, because it makes the code more declarative and because it's flexible. For example, suppose you decide that the output needs to preserve the order of the first appearance of any name from column 1. Just add another element to the hash-of-hashes structure to keep track of input line number whenever a name first appears:

$h{$k} = { min => 9e99, max => -9e99, line_n => $. } unless exists $h{$k};

Then use that new piece of info when sorting the output:

for my $k (sort { $h{$a}{line_n} <=> $h{$b}{line_n} } keys %h){
    # Same as above.
}

回复收藏 0 原文

夏日落 2024-09-21 23:21:07

这符合您的要求吗？

open (IN, "POS2") || die "nope\n";
my %data;

# Read data line by line
while (<IN>)
{
    chomp;
    my @fields = split /\t/;

    # Note $fields[0] is the name by which we want to group.
    if (defined $data{$fields[0]})
    {
        # If there is already an entry for this name, update it
        $data{$fields[0]} = [
            $fields[1],
            $data{$fields[0]}[1] < $fields[2] ? $data{$fields[0]}[1] : $fields[2],
            $data{$fields[0]}[2] > $fields[3] ? $data{$fields[0]}[2] : $fields[3]
        ];
    }
    else
    {
        # Otherwise, create a new one
        $data{$fields[0]} = [ $fields[1], $fields[2], $fields[3] ];
    }
}
close (IN);

# Output one row for each group
foreach my $name (keys %data)
{
    my ($stuff, $min, $max) = @{$data{$name}};
    print "$name\t$stuff\t$min\t$max\n";
}

我尝试了这个，它输出了这个：

jgi|Xentr4|100173|gw1.779.90.1  scaffold_779    101746  107233
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   3649    59409

这就是你想要的吗？

Does this do what you are looking for?

open (IN, "POS2") || die "nope\n";
my %data;

# Read data line by line
while (<IN>)
{
    chomp;
    my @fields = split /\t/;

    # Note $fields[0] is the name by which we want to group.
    if (defined $data{$fields[0]})
    {
        # If there is already an entry for this name, update it
        $data{$fields[0]} = [
            $fields[1],
            $data{$fields[0]}[1] < $fields[2] ? $data{$fields[0]}[1] : $fields[2],
            $data{$fields[0]}[2] > $fields[3] ? $data{$fields[0]}[2] : $fields[3]
        ];
    }
    else
    {
        # Otherwise, create a new one
        $data{$fields[0]} = [ $fields[1], $fields[2], $fields[3] ];
    }
}
close (IN);

# Output one row for each group
foreach my $name (keys %data)
{
    my ($stuff, $min, $max) = @{$data{$name}};
    print "$name\t$stuff\t$min\t$max\n";
}

I tried this and it outputs this:

jgi|Xentr4|100173|gw1.779.90.1  scaffold_779    101746  107233
jgi|Xentr4|100164|gw1.1441.2.1  scaffold_1441   3649    59409

Is that what you wanted?

回复收藏 0 原文

无力看清 2024-09-21 23:21:07

可以执行以下操作：

use FileHandle;

$file = new FileHandle "input_file";
@array = <$file>;
close $file;

%seen = ();

foreach (@array){
    ($col1,$col2,$col3,$col4) = split(/[\t\s]+/,$_);
    push(@newarray,$_) unless $seen{$col1}++;
}
print @newarray;

Can do the following:

use FileHandle;

$file = new FileHandle "input_file";
@array = <$file>;
close $file;

%seen = ();

foreach (@array){
    ($col1,$col2,$col3,$col4) = split(/[\t\s]+/,$_);
    push(@newarray,$_) unless $seen{$col1}++;
}
print @newarray;

回复收藏 0 原文

~没有更多了~