在 Unix 中查找集合补码

发布于 2024-08-19 05:23:55 字数 317 浏览 13 评论 0原文

给定这两个文件：

 $ cat A.txt     $ cat B.txt
    3           11
    5           1
    1           12
    2           3
    4           2

我想找到 A 中“但不是”B 中的行号。它的unix命令是什么？

我尝试过这个但似乎失败了：

comm -3 <(sort -n A.txt) <(sort -n B.txt) | sed 's/\t//g'

原文

Given this two files:

 $ cat A.txt     $ cat B.txt
    3           11
    5           1
    1           12
    2           3
    4           2

I want to find lines number that is in A "BUT NOT" in B.
What's the unix command for it?

I tried this but seems to fail:

comm -3 <(sort -n A.txt) <(sort -n B.txt) | sed 's/\t//g'

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

猥琐帝 2024-08-26 05:23:55

comm -2 -3 <(sort A.txt) <(sort B.txt)

如果我理解正确的话，应该做你想做的事。

编辑：实际上，comm需要按字典顺序对文件进行排序，因此您不希望在排序中使用-n 命令：

$ cat A.txt
1
4
112
$ cat B.txt
1
112
# Bad:
$ comm -2 -3 <(sort -n B.txt) <(sort -n B.txt)
4
comm: file 1 is not in sorted order
112
# OK:
$ comm -2 -3 <(sort A.txt) <(sort B.txt)
4

comm -2 -3 <(sort A.txt) <(sort B.txt)

should do what you want, if I understood you correctly.

Edit: Actually, comm needs the files to be sorted in lexicographical order, so you don't want -n in your sort command:

$ cat A.txt
1
4
112
$ cat B.txt
1
112
# Bad:
$ comm -2 -3 <(sort -n B.txt) <(sort -n B.txt)
4
comm: file 1 is not in sorted order
112
# OK:
$ comm -2 -3 <(sort A.txt) <(sort B.txt)
4

回复收藏 0 原文

墨落成白 2024-08-26 05:23:55

你可以试试这个

$ awk 'FNR==NR{a[$0];next} (!($0 in a))' B.txt A.txt
5
4

you can try this

$ awk 'FNR==NR{a[$0];next} (!($0 in a))' B.txt A.txt
5
4

回复收藏 0 原文

花开柳相依 2024-08-26 05:23:55

请注意，awk 解决方案有效，但保留了 A 中的重复项（B 中没有）； python 解决方案对结果进行重复数据删除，

还请注意 comm 不计算真实的集合差异；如果某行在 A 中重复，而在 B 中重复次数较少，comm 将在结果中留下“额外”行：

$ cat A.txt 
120
121
122
122
$ cat B.txt 
121
122
121
$ comm -23 <(sort A.txt) <(sort B.txt)
120
122

如果不希望出现此行为，请使用 sort -u 删除重复项（仅 A 问题中的重复项）：

$ comm -23 <(sort -u A.txt) <(sort B.txt)
120

note that the awk solution works, but retains duplicates in A (which aren't in B); the python solution de-dupes the result

also note that comm doesn't compute a true set difference; if a line is repeated in A, and repeated fewer times in B, comm will leave the "extra" line(s) in the result:

$ cat A.txt 
120
121
122
122
$ cat B.txt 
121
122
121
$ comm -23 <(sort A.txt) <(sort B.txt)
120
122

if this behavior is undesired, use sort -u to remove duplicates (only the dupes in A matter):

$ comm -23 <(sort -u A.txt) <(sort B.txt)
120

回复收藏 0 原文

你列表最软的妹 2024-08-26 05:23:55

它可以通过编写类似于在 Makefile 中编写的定义来执行设置操作：

someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection

它非常酷，您应该检查一下。我个人不建议使用不是为执行集合操作的作业而构建的临时命令。当您确实需要执行许多集合操作或者您有任何相互依赖的集合操作时，它不会很好地工作。不仅如此，setdown 还允许您编写依赖于其他集合运算的集合运算！

无论如何，我认为它非常酷，你应该完全检查一下。

注意：我认为 Setdown 比 comm 好得多，因为 Setdown 不要求您对输入进行正确排序。相反，Setdown 将为您对输入进行排序，并且它使用外部排序。因此它可以处理大量文件。我认为这是一个很大的好处，因为我忘记对传递到 comm 的文件进行排序的次数不胜枚举。

I wrote a program recently called Setdown that does Set operations from the cli.

It can perform set operations by writing a definition similar to what you would write in a Makefile:

someUnion: "file-1.txt" \/ "file-2.txt"
someIntersection: "file-1.txt" /\ "file-2.txt"
someDifference: someUnion - someIntersection

Its pretty cool and you should check it out. I personally don't recommend using ad-hoc commands that were not built for the job to perform set operations. It won't work well when you really need to do many set operations or if you have any set operations that depend on each other. Not only that but setdown lets you write set operations that depend on other set operations!

At any rate, I think that it's pretty cool and you should totally check it out.

Note: I think that Setdown is much better than comm simply because Setdown does not require that you correctly sort your inputs. Instead Setdown will sort your inputs for you AND it uses external sort. So it can handle massive files. I consider this a major benefit because the number of times that I have forgotten to sort the files that I passed into comm is beyond count.

回复收藏 0 原文