制表符分隔文本文件的快速交集、补集和并集？

发布于 2024-12-19 17:25:33 字数 405 浏览 0 评论 0原文

有人可以推荐一个基于 UNIX 的快速实用程序（最好用 C 语言编写）来获得制表符分隔文本文件的高效、流式交集/并集吗？例如，允许诸如“给我文件 A 中具有列值 K 且未出现在文件 B 的任何 K 列中的所有条目”之类的查询。

例如，如果文件 A 是：

bob sally sue
bob mary john

并且文件 B 是：

john sally sue
foo bar quux

则文件 A 相对于 B 在第 2 列上的补集将返回“bob mary john”，因为这是文件 B 中唯一在第 2 列中具有值但未出现在文件 B。

我不想使用数据库，但想要一个基于命令行的实用程序。 awk 是答案还是有更简单的东西？谢谢。

原文

Can someone recommend a fast unix-based utility (ideally written in C) for getting efficient, streaming intersection/union of tab-delimited text files? For example, allow queries such as "give me the all the entries that in file A that have a column value K that does not appear in any column K of file B".

e.g., if file A is:

bob sally sue
bob mary john

and file B is:

john sally sue
foo bar quux

then complement of file A relative to B on column 2 would return "bob mary john", since that's the only in file B that has a value in column 2 that does not appear in file B.

I'd prefer not to use a database, but would like a command line based utility. Is awk the answer or is there something simpler?
thanks.

分享到QQ

分享到微博