Awk：有条件删除重复行

发布于 2024-12-07 09:36:08 字数 1010 浏览 2 评论 0原文

我有一个包含 8 列的制表符分隔文本文件：

Erythropoietin Receptor Integrin Beta 4 11.7    9.7 164 195 19  3.2
Erythropoietin Receptor Receptor Tyrosine Phosphatase F 10.8    2.6 97  107 15  3.2
Erythropoietin Receptor Leukemia Inhibitory Factor Receptor 12.0    3.6 171 479 14  3.2
Erythropoietin Receptor Immunoglobulin 9    10.4    3.1 100 108 24  3.3
Erythropoietin Receptor Collagen Alpha 1 Xx 10.7    2.7 93  105 18  3.3
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 5    11.4    3.2 114 114 25  1.7
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 14   11.1    2.1 99  100 28  1.8
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 1B   10.9    4.9 133 162 29  1.9
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 11A  11.5    5.1 130 166 25  1.9

第一列和第二列包含蛋白质名称，第八列包含每个蛋白质对之间的“距离”分数。我想删除包含重复蛋白质对的行，并仅保留距离最小的对（第 8 列中的最低值）。这意味着对于蛋白质 A-蛋白质对，BI 希望删除除距离分数最低的那一个之外的所有出现的情况。即使蛋白质名称交换（在不同的列中），该对也被视为重复。这意味着蛋白质 A 蛋白质 B 与蛋白质 B 蛋白质 A 相同。

原文

I have a tab-delimited text file with 8 columns:

Erythropoietin Receptor Integrin Beta 4 11.7    9.7 164 195 19  3.2
Erythropoietin Receptor Receptor Tyrosine Phosphatase F 10.8    2.6 97  107 15  3.2
Erythropoietin Receptor Leukemia Inhibitory Factor Receptor 12.0    3.6 171 479 14  3.2
Erythropoietin Receptor Immunoglobulin 9    10.4    3.1 100 108 24  3.3
Erythropoietin Receptor Collagen Alpha 1 Xx 10.7    2.7 93  105 18  3.3
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 5    11.4    3.2 114 114 25  1.7
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 14   11.1    2.1 99  100 28  1.8
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 1B   10.9    4.9 133 162 29  1.9
Tumor Necrosis Factor Receptor  Tumor Necrosis Factor Receptor 11A  11.5    5.1 130 166 25  1.9

The first and second column contain protein names and the 8th column contains the "distance" score between each protein pair. I would like to remove the lines containing duplicate protein pairs and keep only the pair with the lowest distance (the lowest value in the 8th column). This means that for the pair Protein A-Protein B I would like to remove all occurrences except the one with the lowest distance score. The pair is considered duplicate even if the protein names are swapped (in different columns). This means that Protein A Protein B is the same as Protein B Protein A.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不一样的天空 2024-12-14 09:36:08

像这样的东西（未经测试）：

awk -F'\t' 'END {
  for (r in rec) print rec[r] 
  }
{
  if (mina[$1, $2] < $NF || minb[$2, $1] < $NF) {
    mina[$1, $2] = $NF; minb[$2, $1] = $NF
    rec[$1, $2] = $0
    }  
  }' infile

Something like this (untested):

awk -F'\t' 'END {
  for (r in rec) print rec[r] 
  }
{
  if (mina[$1, $2] < $NF || minb[$2, $1] < $NF) {
    mina[$1, $2] = $NF; minb[$2, $1] = $NF
    rec[$1, $2] = $0
    }  
  }' infile

回复收藏 0 原文

风吹短裙飘 2024-12-14 09:36:08

我希望这是最终更新^_^

kent$  awk -F'\t' '{if($1$2 in a){
                if($8<a[$1$2]){
                        a[$1$2]=$8;r[$1$2]=$0;
                }
        }else if ($2$1 in a){
                if($8<a[$2$1]){
                        a[$2$1] = $8;r[$2$1] = $0;
                }
        }else{
                a[$1$2]=$8; r[$1$2]=$0;
        }
} END{for(x in r)print r[x]}' yourFile

I hope this would be the final update ^_^

kent$  awk -F'\t' '{if($1$2 in a){
                if($8<a[$1$2]){
                        a[$1$2]=$8;r[$1$2]=$0;
                }
        }else if ($2$1 in a){
                if($8<a[$2$1]){
                        a[$2$1] = $8;r[$2$1] = $0;
                }
        }else{
                a[$1$2]=$8; r[$1$2]=$0;
        }
} END{for(x in r)print r[x]}' yourFile

回复收藏 0 原文

~没有更多了~