在 Unix 中查找集合补码
给定这两个文件:
$ cat A.txt $ cat B.txt
3 11
5 1
1 12
2 3
4 2
我想找到 A 中“但不是”B 中的行号。 它的unix命令是什么?
我尝试过这个但似乎失败了:
comm -3 <(sort -n A.txt) <(sort -n B.txt) | sed 's/\t//g'
Given this two files:
$ cat A.txt $ cat B.txt
3 11
5 1
1 12
2 3
4 2
I want to find lines number that is in A "BUT NOT" in B.
What's the unix command for it?
I tried this but seems to fail:
comm -3 <(sort -n A.txt) <(sort -n B.txt) | sed 's/\t//g'
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果我理解正确的话,应该做你想做的事。
编辑:实际上,
comm
需要按字典顺序对文件进行排序,因此您不希望在排序中使用
命令:-n
should do what you want, if I understood you correctly.
Edit: Actually,
comm
needs the files to be sorted in lexicographical order, so you don't want-n
in yoursort
command:你可以试试这个
you can try this
请注意,awk 解决方案有效,但保留了 A 中的重复项(B 中没有); python 解决方案对结果进行重复数据删除,
还请注意
comm
不计算真实的集合差异;如果某行在 A 中重复,而在 B 中重复次数较少,comm
将在结果中留下“额外”行:如果不希望出现此行为,请使用
sort -u
删除重复项(仅 A 问题中的重复项):note that the awk solution works, but retains duplicates in A (which aren't in B); the python solution de-dupes the result
also note that
comm
doesn't compute a true set difference; if a line is repeated in A, and repeated fewer times in B,comm
will leave the "extra" line(s) in the result:if this behavior is undesired, use
sort -u
to remove duplicates (only the dupes in A matter):它可以通过编写类似于在 Makefile 中编写的定义来执行设置操作:
它非常酷,您应该检查一下。我个人不建议使用不是为执行集合操作的作业而构建的临时命令。当您确实需要执行许多集合操作或者您有任何相互依赖的集合操作时,它不会很好地工作。不仅如此,setdown 还允许您编写依赖于其他集合运算的集合运算!
无论如何,我认为它非常酷,你应该完全检查一下。
注意:我认为 Setdown 比 comm 好得多,因为 Setdown 不要求您对输入进行正确排序。相反,Setdown 将为您对输入进行排序,并且它使用外部排序。因此它可以处理大量文件。我认为这是一个很大的好处,因为我忘记对传递到 comm 的文件进行排序的次数不胜枚举。
I wrote a program recently called Setdown that does Set operations from the cli.
It can perform set operations by writing a definition similar to what you would write in a Makefile:
Its pretty cool and you should check it out. I personally don't recommend using ad-hoc commands that were not built for the job to perform set operations. It won't work well when you really need to do many set operations or if you have any set operations that depend on each other. Not only that but setdown lets you write set operations that depend on other set operations!
At any rate, I think that it's pretty cool and you should totally check it out.
Note: I think that Setdown is much better than comm simply because Setdown does not require that you correctly sort your inputs. Instead Setdown will sort your inputs for you AND it uses external sort. So it can handle massive files. I consider this a major benefit because the number of times that I have forgotten to sort the files that I passed into comm is beyond count.
这是使用
join
执行此操作的另一种方法:来自 关于
join
的文档:Here is another way to do it with
join
:From the documentation on
join
: