如何比较 3 个文件(看看它们之间有什么共同点)?

发布于 2024-12-13 07:46:42 字数 916 浏览 4 评论 0原文

我想比较 3 个文件,看看文件中有多少信息是相同的。文件格式是这样的:

Chr11   447     .       A       C       74      .       DP=22;AF1=1;CI95=1,1;DP4=0,0,9,8;MQ=15;FQ=-78   GT:PL:GQ        1/1:107,51,0:99
Chr10   449     .       G       C       35      .       DP=26;AF1=0.5;CI95=0.5,0.5;DP4=5,0,7,8;MQ=20;FQ=11.3;PV4=0.055,0.0083,0.028,1   GT:PL:GQ        0/1:65,0,38:41
Chr12   517     .       G       A       222     .       DP=122;AF1=1;CI95=1,1;DP4=0,0,77,40;MQ=23;FQ=-282       GT:PL:GQ        1/1:255,255,0:99
Chr10   761     .       G       A       41      .       DP=93;AF1=0.5;CI95=0.5,0.5;DP4=11,34,6,35;MQ=19;FQ=44;PV4=0.29,1.8e-35,1,1      GT:PL:GQ        0/1:71,0,116:74

我只对前两列感兴趣(如果前两列相同,那么我认为它是相等的)。这是我用来比较两个文件的命令:

awk 'FILENAME==ARGV[1] {pair[$1 " " $2]; next} ($1 " " $2 in pair)'  file1 file2 | wc -l

我想使用 awk 命令,因为我的文件非常大并且 awk 处理它们非常好!但我不知道如何将它用于 3 个文件!

I want to compare 3 files together to see how much of the information in the files are the same. The file format is something like this:

Chr11   447     .       A       C       74      .       DP=22;AF1=1;CI95=1,1;DP4=0,0,9,8;MQ=15;FQ=-78   GT:PL:GQ        1/1:107,51,0:99
Chr10   449     .       G       C       35      .       DP=26;AF1=0.5;CI95=0.5,0.5;DP4=5,0,7,8;MQ=20;FQ=11.3;PV4=0.055,0.0083,0.028,1   GT:PL:GQ        0/1:65,0,38:41
Chr12   517     .       G       A       222     .       DP=122;AF1=1;CI95=1,1;DP4=0,0,77,40;MQ=23;FQ=-282       GT:PL:GQ        1/1:255,255,0:99
Chr10   761     .       G       A       41      .       DP=93;AF1=0.5;CI95=0.5,0.5;DP4=11,34,6,35;MQ=19;FQ=44;PV4=0.29,1.8e-35,1,1      GT:PL:GQ        0/1:71,0,116:74

I'm only interested in the first two columns (if the first two columns are the same then I consider it as equal). This is the comand that I use for comparing two files :

awk 'FILENAME==ARGV[1] {pair[$1 " " $2]; next} ($1 " " $2 in pair)'  file1 file2 | wc -l

I would like to use the awk command since my files are really big and awk handle them really good! but I couldn't figure out how to use it for 3 files!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

∞觅青森が 2024-12-20 07:46:42

无意发起编辑器战争,但我熟悉 VI,vimdiff 及其变体以并行视图显示多个文件之间的比较,我觉得这非常方便。你可以简单地调用它

$ vimdiff <filelist>

Not intended to start an editor war, but I am familiar with VI, and vimdiff and its variants show the comparison between multiple files in parallel view, which I find very handy. Simply you can call it with

$ vimdiff <filelist>
清引 2024-12-20 07:46:42

如果只是打印出所有 3 个文件中常见的对 (column1 + column2),并利用文件对中唯一的事实,您可以这样做:

awk '{print $1" "$2}' a b c | sort | uniq -c | awk '{if ($1==3){print $2" "$3}}'

这可以用任意数字来实现只要修改最后一个命令的参数即可。

它的作用如下:

  1. 打印并排序所有文件的前 2 列 (awk '{print $1" "$2}' abc | sort)
  2. 计算重复条目的数量 (uniq -c
  3. 如果重复条目数==文件数,我们就找到了匹配项。打印它。

如果您经常这样做,您可以将其表示为 bash 函数(并将其放入 .bashrc 中),该函数对文件计数进行参数化。

function common_pairs { 
    awk '{print $1" "$2}' $@ | sort | uniq -c | awk -v numf=$# '{if ($1==numf){print $2" "$3}}'; 
}

使用任意数量的文件调用它:common_pairs file1 file2 file3 fileN

If it's simply to print out the pairs (column1 + column2) that are common in all 3 files, and making use of the fact that a pair is unique within a file, you could do it this way:

awk '{print $1" "$2}' a b c | sort | uniq -c | awk '{if ($1==3){print $2" "$3}}'

This can be made with arbitrary numbers of files as long as you modify the param of the last command.

Here's what it does:

  1. prints and sorts the first 2 columns of all files (awk '{print $1" "$2}' a b c | sort)
  2. count the number of duplicate entries (uniq -c)
  3. if duplicate entry count == number of files, we found a match. print it.

If you're doing this often, you can express it as a bash function (and drop it in your .bashrc) which parametrises the file counts.

function common_pairs { 
    awk '{print $1" "$2}' $@ | sort | uniq -c | awk -v numf=$# '{if ($1==numf){print $2" "$3}}'; 
}

Call it with any number of files you want: common_pairs file1 file2 file3 fileN

等待圉鍢 2024-12-20 07:46:42

为此,我将使用命令 cutsortcomm

  1. 使用cut删除不需要的字段。

  2. 对结果进行排序,因为 comm 需要排序的输入。

  3. 使用comm获取file1和file2中的行。

  4. 再次使用 comm 获取 file3 中的行。

脚本可能如下所示:

 for i in 1 2 3
  do
   # options to cut may have to be adjusted for your input files
   cut -c1-15 file$i | sort > tmp.$i
  done

 comm -12 tmp.1 tmp.2   > tmp.1+2
 comm -12 tmp.3 tmp.1+2 > tmp.1+2+3

(当然,可以使用扩展的 shell 语法来避免临时文件,但我不想隐藏复杂语法表达式背后的想法)

在文件 tmp.txt 中。 1+2+3 您现在应该在所有三个文件中都包含密钥。如果您对整行感兴趣,可以将命令 join 与任何输入文件的排序版本结合使用)

For this I'd use the commands cut, sort and comm.

  1. With cut cut away the fields not needed.

  2. sort the outcome since comm expects sorted input.

  3. Use comm to get the lines which are in file1 and file2.

  4. Use comm again to get the lines that are also in file3.

A script could look like this:

 for i in 1 2 3
  do
   # options to cut may have to be adjusted for your input files
   cut -c1-15 file$i | sort > tmp.$i
  done

 comm -12 tmp.1 tmp.2   > tmp.1+2
 comm -12 tmp.3 tmp.1+2 > tmp.1+2+3

(Of course one may use extended shell syntax to avoid temporary files, but I don't want to hide the idea behind complex syntax expressions)

In file tmp.1+2+3 you now should have the keys present in all three files. If you're interested in the whole lines, you may use the command join in combination with a sorted version of any of the thee input files)

纸伞微斜 2024-12-20 07:46:42

只需阅读您的最后一条评论 - 您希望加入文件,但删除重复项?

 sort file1 file2 file3 | uniq > newfile

Just read your last comment - You want the files joined, but duplicates removed?

 sort file1 file2 file3 | uniq > newfile
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文