两个大单词列表的交集
我有两个单词列表(180k 和 260k),我想生成第三个文件,它是两个列表中出现的单词集。
最好(最有效)的方法是什么?我读过讨论使用 grep 的论坛,但是我认为单词列表对于这种方法来说太大了。
I have two word lists (180k and 260k), and I would like to generate a third file which is the set of words that appear in BOTH lists.
What is the best (most efficient) way of doing this? I've read forums talking about using grep, however I think the word lists are too big for this method.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果两个文件已排序(或者可以排序),则可以使用
comm -1 -2 file1 file2
打印出交集。If the two files are sorted (or you can sort them), you can use
comm -1 -2 file1 file2
to print out the intersection.你是对的,grep 是个坏主意。输入“man join”并按照说明进行操作。
如果您的文件只是单列中的单词列表,或者至少,如果重要的单词是每行的第一个单词,那么您需要做的就是:
否则,您可能需要给 join(1) 命令一些附加说明:
You are correct, grep would be a bad idea. Type "man join" and follow the instructions.
If your files are just lists of words in a single column, or at least, if the important word is the first on each line, then all you need to do is:
Otherwise, you may need to give the join(1) command some additional instructions:
假设每行一个单词,我将使用
grep
:-x
匹配整行(没有部分匹配)-F
按字面解释给定的模式(无正则表达式)-f seta
指定要搜索的模式setb
是搜索seta
内容的文件comm< /code> 会做同样的事情,但需要你的集合预先排序:
Presuming one word per line, I would use
grep
:-x
matches the whole lines (no partial matches)-F
interprets the given patterns literally (no regular expressions)-f seta
specifies the patterns to searchsetb
is the file to search for the contents ofseta
comm
will do the same thing, but requires your sets to be pre-sorted:grep -P '[ A-Za-z0-9]*' 文件 1 | xargs -0 -I {} grep {} file2 > file3
我相信这会查找 file1 中的任何内容,然后检查 file1 中的内容是否在 file2 中,并将匹配的任何内容放入 file3 中。
grep -P '[ A-Za-z0-9]*' file1 | xargs -0 -I {} grep {} file2 > file3
I believe this looks for anything in file1, then checks if what was in file1 is in file2, and puts anything that matches into file3.
以前我设法找到一个 Perl 脚本,它可以执行类似的操作:
http://www.perlmonks。 org/?node_id=160735
Back in the days I managed to find a Perl script that does something similar:
http://www.perlmonks.org/?node_id=160735