制表符分隔文本文件的快速交集、补集和并集?
有人可以推荐一个基于 UNIX 的快速实用程序(最好用 C 语言编写)来获得制表符分隔文本文件的高效、流式交集/并集吗?例如,允许诸如“给我文件 A 中具有列值 K 且未出现在文件 B 的任何 K 列中的所有条目”之类的查询。
例如,如果文件 A 是:
bob sally sue
bob mary john
并且文件 B 是:
john sally sue
foo bar quux
则文件 A 相对于 B 在第 2 列上的补集将返回“bob mary john”,因为这是文件 B 中唯一在第 2 列中具有值但未出现在文件 B。
我不想使用数据库,但想要一个基于命令行的实用程序。 awk 是答案还是有更简单的东西? 谢谢。
Can someone recommend a fast unix-based utility (ideally written in C) for getting efficient, streaming intersection/union of tab-delimited text files? For example, allow queries such as "give me the all the entries that in file A that have a column value K that does not appear in any column K of file B".
e.g., if file A is:
bob sally sue
bob mary john
and file B is:
john sally sue
foo bar quux
then complement of file A relative to B on column 2 would return "bob mary john", since that's the only in file B that has a value in column 2 that does not appear in file B.
I'd prefer not to use a database, but would like a command line based utility. Is awk the answer or is there something simpler?
thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果仅针对特定查询,我可能会使用 awk、哈希 B 的 2. 列并根据哈希过滤 A。
If it were only for that particularly query, I'd probably go with awk, hash B's 2. columns and filter A based on the hash.