需要比较文件中第一列重复的值
所以我的数据样本采用以下格式。
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 19856 19974
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 21455 21638
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 21727 21897
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 21980 22063
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 24670 24811
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 34741 34902
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 3649 3836
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 59253 59409
jgi|Xentr4|100173|gw1.779.90.1 scaffold_779 101746 101969
jgi|Xentr4|100173|gw1.779.90.1 scaffold_779 106436 107233
我试图做的是对于第一列中的每个唯一名称,检索第 3 列的最小值和第 4 列的最大值。因此最终输入看起来相同,是一个制表符分隔的文件,除了它将为每个唯一名称提供前 2 列,然后第 3 列和第 4 列是上面提到的最小值和最大值。我在编程方面相当新手,并尝试使用哈希来做到这一点,但惨败。现在正在尝试使用数组/正则表达式,如下所示。
open (IN, "POS2") || die "nope\n";
my $prev_qn = super;
my $prev_sn = ultra;
my $prev_start = non;
my $prev_end = nono;
while (<IN>) {
chomp;
push (@list, "$_");
}
close (IN);
foreach $v (@list) {
$info = $v;
($query_name, $scaf_num, $start, $end) = split(/\t/, $info);
unless ($info =~ m/^$prev_qn/) {
push @ready, $info;
$prev_qn = $query_name;
$prev_sn = $scaf_num;
$prev_start = $start;
$prev_end = $end;
}
else {
if ($start < $prev_start) {
splice(@ready,2,1,$start);
}
if ($end > $prev_end) {
splice(@ready,3,1,$end);
}
$prev_qn = $query_name;
$prev_sn = $scaf_num;
$prev_start = $start;
$prev_end = $end;
}
foreach $z (@ready) {
print "$z\n";
}
}
返回的输出如下。
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
21638
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
21638
21897
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
21638
22063
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
21638
24811
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
21638
34902
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
3649
34902
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
3649
59409
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
3649
101969
因此,很明显该文件可以很好地进行比较,但它没有按预期替换数组中的元素,只是将它们附加在下面并替换它们。此外,它永远不会打印超过第一个唯一名称的内容。有什么建议吗?
So my data sample is in the following format.
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 19856 19974
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 21455 21638
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 21727 21897
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 21980 22063
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 24670 24811
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 34741 34902
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 3649 3836
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 59253 59409
jgi|Xentr4|100173|gw1.779.90.1 scaffold_779 101746 101969
jgi|Xentr4|100173|gw1.779.90.1 scaffold_779 106436 107233
and what I am attempting to do is for each unique name in the first column, retrieve the min value for column 3, and the max value for column 4. So the final input will look the same, a tab-delimited file, except that it will have the 1st 2 columns for each unique name, then the 3rd and 4th columns be the min and max values mentioned above. I'm fairly novice at programming and attempted to do this using hashes but failed miserably. Am trying now with arrays/regular expressions as seen below.
open (IN, "POS2") || die "nope\n";
my $prev_qn = super;
my $prev_sn = ultra;
my $prev_start = non;
my $prev_end = nono;
while (<IN>) {
chomp;
push (@list, "$_");
}
close (IN);
foreach $v (@list) {
$info = $v;
($query_name, $scaf_num, $start, $end) = split(/\t/, $info);
unless ($info =~ m/^$prev_qn/) {
push @ready, $info;
$prev_qn = $query_name;
$prev_sn = $scaf_num;
$prev_start = $start;
$prev_end = $end;
}
else {
if ($start < $prev_start) {
splice(@ready,2,1,$start);
}
if ($end > $prev_end) {
splice(@ready,3,1,$end);
}
$prev_qn = $query_name;
$prev_sn = $scaf_num;
$prev_start = $start;
$prev_end = $end;
}
foreach $z (@ready) {
print "$z\n";
}
}
the output this returns is below.
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
21638
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
21638
21897
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
21638
22063
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
21638
24811
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
21638
34902
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
3649
34902
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
3649
59409
jgi|Xentr4|100164|gw1.1441.2.1 scaffold_1441 18150 18354
19974
3649
101969
So it seems clear that the file is doing the comparison fine, but it is not replacing the elements in the array as expected, simply appending them beneath and replacing those. Additionally it never prints past the first unique name. Any suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是一种方法。只需提供输入文件名作为命令行参数即可。
<>
运算符将打开文件并将这些行提供给您的脚本。我使用散列的散列来存储最小和最大信息,因为它使代码更具声明性并且因为它很灵活。例如,假设您决定输出需要保留第 1 列中任何名称首次出现的顺序。只需将另一个元素添加到 hash-of-hashes 结构中,以便在名称首次出现时跟踪输入行号:
然后在对输出进行排序时使用该新信息:
Here's one way to do it. Just supply the input file name as a command-line argument. The
<>
operator will open the file and supply the lines to your script.I use a hash-of-hashes to store the min and max information, because it makes the code more declarative and because it's flexible. For example, suppose you decide that the output needs to preserve the order of the first appearance of any name from column 1. Just add another element to the hash-of-hashes structure to keep track of input line number whenever a name first appears:
Then use that new piece of info when sorting the output:
这符合您的要求吗?
我尝试了这个,它输出了这个:
这就是你想要的吗?
Does this do what you are looking for?
I tried this and it outputs this:
Is that what you wanted?
可以执行以下操作:
Can do the following: