如何比较 3 个文件(看看它们之间有什么共同点)?
我想比较 3 个文件,看看文件中有多少信息是相同的。文件格式是这样的:
Chr11 447 . A C 74 . DP=22;AF1=1;CI95=1,1;DP4=0,0,9,8;MQ=15;FQ=-78 GT:PL:GQ 1/1:107,51,0:99
Chr10 449 . G C 35 . DP=26;AF1=0.5;CI95=0.5,0.5;DP4=5,0,7,8;MQ=20;FQ=11.3;PV4=0.055,0.0083,0.028,1 GT:PL:GQ 0/1:65,0,38:41
Chr12 517 . G A 222 . DP=122;AF1=1;CI95=1,1;DP4=0,0,77,40;MQ=23;FQ=-282 GT:PL:GQ 1/1:255,255,0:99
Chr10 761 . G A 41 . DP=93;AF1=0.5;CI95=0.5,0.5;DP4=11,34,6,35;MQ=19;FQ=44;PV4=0.29,1.8e-35,1,1 GT:PL:GQ 0/1:71,0,116:74
我只对前两列感兴趣(如果前两列相同,那么我认为它是相等的)。这是我用来比较两个文件的命令:
awk 'FILENAME==ARGV[1] {pair[$1 " " $2]; next} ($1 " " $2 in pair)' file1 file2 | wc -l
我想使用 awk 命令,因为我的文件非常大并且 awk 处理它们非常好!但我不知道如何将它用于 3 个文件!
I want to compare 3 files together to see how much of the information in the files are the same. The file format is something like this:
Chr11 447 . A C 74 . DP=22;AF1=1;CI95=1,1;DP4=0,0,9,8;MQ=15;FQ=-78 GT:PL:GQ 1/1:107,51,0:99
Chr10 449 . G C 35 . DP=26;AF1=0.5;CI95=0.5,0.5;DP4=5,0,7,8;MQ=20;FQ=11.3;PV4=0.055,0.0083,0.028,1 GT:PL:GQ 0/1:65,0,38:41
Chr12 517 . G A 222 . DP=122;AF1=1;CI95=1,1;DP4=0,0,77,40;MQ=23;FQ=-282 GT:PL:GQ 1/1:255,255,0:99
Chr10 761 . G A 41 . DP=93;AF1=0.5;CI95=0.5,0.5;DP4=11,34,6,35;MQ=19;FQ=44;PV4=0.29,1.8e-35,1,1 GT:PL:GQ 0/1:71,0,116:74
I'm only interested in the first two columns (if the first two columns are the same then I consider it as equal). This is the comand that I use for comparing two files :
awk 'FILENAME==ARGV[1] {pair[$1 " " $2]; next} ($1 " " $2 in pair)' file1 file2 | wc -l
I would like to use the awk command since my files are really big and awk handle them really good! but I couldn't figure out how to use it for 3 files!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
无意发起编辑器战争,但我熟悉 VI,vimdiff 及其变体以并行视图显示多个文件之间的比较,我觉得这非常方便。你可以简单地调用它
Not intended to start an editor war, but I am familiar with VI, and vimdiff and its variants show the comparison between multiple files in parallel view, which I find very handy. Simply you can call it with
如果只是打印出所有 3 个文件中常见的对 (column1 + column2),并利用文件对中唯一的事实,您可以这样做:
这可以用任意数字来实现只要修改最后一个命令的参数即可。
它的作用如下:
awk '{print $1" "$2}' abc | sort
)uniq -c
)如果您经常这样做,您可以将其表示为 bash 函数(并将其放入
.bashrc
中),该函数对文件计数进行参数化。使用任意数量的文件调用它:
common_pairs file1 file2 file3 fileN
If it's simply to print out the pairs (column1 + column2) that are common in all 3 files, and making use of the fact that a pair is unique within a file, you could do it this way:
This can be made with arbitrary numbers of files as long as you modify the param of the last command.
Here's what it does:
awk '{print $1" "$2}' a b c | sort
)uniq -c
)If you're doing this often, you can express it as a bash function (and drop it in your
.bashrc
) which parametrises the file counts.Call it with any number of files you want:
common_pairs file1 file2 file3 fileN
为此,我将使用命令 cut、sort 和 comm。
使用cut删除不需要的字段。
对结果进行排序,因为 comm 需要排序的输入。
使用comm获取file1和file2中的行。
再次使用 comm 获取 file3 中的行。
脚本可能如下所示:
(当然,可以使用扩展的 shell 语法来避免临时文件,但我不想隐藏复杂语法表达式背后的想法)
在文件
tmp.txt 中。 1+2+3
您现在应该在所有三个文件中都包含密钥。如果您对整行感兴趣,可以将命令 join 与任何输入文件的排序版本结合使用)For this I'd use the commands cut, sort and comm.
With cut cut away the fields not needed.
sort the outcome since comm expects sorted input.
Use comm to get the lines which are in file1 and file2.
Use comm again to get the lines that are also in file3.
A script could look like this:
(Of course one may use extended shell syntax to avoid temporary files, but I don't want to hide the idea behind complex syntax expressions)
In file
tmp.1+2+3
you now should have the keys present in all three files. If you're interested in the whole lines, you may use the command join in combination with a sorted version of any of the thee input files)只需阅读您的最后一条评论 - 您希望加入文件,但删除重复项?
Just read your last comment - You want the files joined, but duplicates removed?