Awk:有条件删除重复行
我有一个包含 8 列的制表符分隔文本文件:
Erythropoietin Receptor Integrin Beta 4 11.7 9.7 164 195 19 3.2
Erythropoietin Receptor Receptor Tyrosine Phosphatase F 10.8 2.6 97 107 15 3.2
Erythropoietin Receptor Leukemia Inhibitory Factor Receptor 12.0 3.6 171 479 14 3.2
Erythropoietin Receptor Immunoglobulin 9 10.4 3.1 100 108 24 3.3
Erythropoietin Receptor Collagen Alpha 1 Xx 10.7 2.7 93 105 18 3.3
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 5 11.4 3.2 114 114 25 1.7
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 14 11.1 2.1 99 100 28 1.8
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 1B 10.9 4.9 133 162 29 1.9
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 11A 11.5 5.1 130 166 25 1.9
第一列和第二列包含蛋白质名称,第八列包含每个蛋白质对之间的“距离”分数。我想删除包含重复蛋白质对的行,并仅保留距离最小的对(第 8 列中的最低值)。这意味着对于蛋白质 A-蛋白质对,BI 希望删除除距离分数最低的那一个之外的所有出现的情况。即使蛋白质名称交换(在不同的列中),该对也被视为重复。这意味着蛋白质 A 蛋白质 B 与蛋白质 B 蛋白质 A 相同。
I have a tab-delimited text file with 8 columns:
Erythropoietin Receptor Integrin Beta 4 11.7 9.7 164 195 19 3.2
Erythropoietin Receptor Receptor Tyrosine Phosphatase F 10.8 2.6 97 107 15 3.2
Erythropoietin Receptor Leukemia Inhibitory Factor Receptor 12.0 3.6 171 479 14 3.2
Erythropoietin Receptor Immunoglobulin 9 10.4 3.1 100 108 24 3.3
Erythropoietin Receptor Collagen Alpha 1 Xx 10.7 2.7 93 105 18 3.3
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 5 11.4 3.2 114 114 25 1.7
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 14 11.1 2.1 99 100 28 1.8
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 1B 10.9 4.9 133 162 29 1.9
Tumor Necrosis Factor Receptor Tumor Necrosis Factor Receptor 11A 11.5 5.1 130 166 25 1.9
The first and second column contain protein names and the 8th column contains the "distance" score between each protein pair. I would like to remove the lines containing duplicate protein pairs and keep only the pair with the lowest distance (the lowest value in the 8th column). This means that for the pair Protein A-Protein B I would like to remove all occurrences except the one with the lowest distance score. The pair is considered duplicate even if the protein names are swapped (in different columns). This means that Protein A Protein B is the same as Protein B Protein A.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
像这样的东西(未经测试):
Something like this (untested):
我希望这是最终更新^_^
I hope this would be the final update ^_^