两个大单词列表的交集

发布于 2024-10-13 07:19:16 字数 114 浏览 11 评论 0原文

我有两个单词列表(180k 和 260k),我想生成第三个文件,它是两个列表中出现的单词集。

最好(最有效)的方法是什么?我读过讨论使用 grep 的论坛,但是我认为单词列表对于这种方法来说太大了。

I have two word lists (180k and 260k), and I would like to generate a third file which is the set of words that appear in BOTH lists.

What is the best (most efficient) way of doing this? I've read forums talking about using grep, however I think the word lists are too big for this method.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

甜心小果奶 2024-10-20 07:19:16

如果两个文件已排序(或者可以排序),则可以使用 comm -1 -2 file1 file2 打印出交集。

If the two files are sorted (or you can sort them), you can use comm -1 -2 file1 file2 to print out the intersection.

紙鸢 2024-10-20 07:19:16

你是对的,grep 是个坏主意。输入“man join”并按照说明进行操作。

如果您的文件只是单列中的单词列表,或者至少,如果重要的单词是每行的第一个单词,那么您需要做的就是:

$ sort -b -o f1 file1
$ sort -b -o f2 file2
$ join f1 f2

否则,您可能需要给 join(1) 命令一些附加说明:

JOIN(1)                   BSD General Commands Manual                  JOIN(1)

NAME
     join -- relational database operator

SYNOPSIS
     join [-a file_number | -v file_number] [-e string] [-o list] [-t char] [-1 field] [-2 field] file1 file2

DESCRIPTION
     The join utility performs an ``equality join'' on the specified files and writes the result to the standard output.  The ``join field'' is the field in each file by which the files are compared.  The
     first field in each line is used by default.  There is one line in the output for each pair of lines in file1 and file2 which have identical join fields.  Each output line consists of the join field,
     the remaining fields from file1 and then the remaining fields from file2.
     . . .
     . . .

You are correct, grep would be a bad idea. Type "man join" and follow the instructions.

If your files are just lists of words in a single column, or at least, if the important word is the first on each line, then all you need to do is:

$ sort -b -o f1 file1
$ sort -b -o f2 file2
$ join f1 f2

Otherwise, you may need to give the join(1) command some additional instructions:

JOIN(1)                   BSD General Commands Manual                  JOIN(1)

NAME
     join -- relational database operator

SYNOPSIS
     join [-a file_number | -v file_number] [-e string] [-o list] [-t char] [-1 field] [-2 field] file1 file2

DESCRIPTION
     The join utility performs an ``equality join'' on the specified files and writes the result to the standard output.  The ``join field'' is the field in each file by which the files are compared.  The
     first field in each line is used by default.  There is one line in the output for each pair of lines in file1 and file2 which have identical join fields.  Each output line consists of the join field,
     the remaining fields from file1 and then the remaining fields from file2.
     . . .
     . . .
心是晴朗的。 2024-10-20 07:19:16

假设每行一个单词,我将使用 grep

grep -xFf seta setb  
  • -x 匹配整行(没有部分匹配)
  • -F 按字面解释给定的模式(无正则表达式)
  • -f seta 指定要搜索的模式
  • setb 是搜索 seta 内容的文件

comm< /code> 会做同样的事情,但需要你的集合预先排序:

comm -12 <(sort seta) <(sort setb)

Presuming one word per line, I would use grep:

grep -xFf seta setb  
  • -x matches the whole lines (no partial matches)
  • -F interprets the given patterns literally (no regular expressions)
  • -f seta specifies the patterns to search
  • setb is the file to search for the contents of seta

comm will do the same thing, but requires your sets to be pre-sorted:

comm -12 <(sort seta) <(sort setb)
む无字情书 2024-10-20 07:19:16

grep -P '[ A-Za-z0-9]*' 文件 1 | xargs -0 -I {} grep {} file2 > file3

我相信这会查找 file1 中的任何内容,然后检查 file1 中的内容是否在 file2 中,并将匹配的任何内容放入 file3 中。

grep -P '[ A-Za-z0-9]*' file1 | xargs -0 -I {} grep {} file2 > file3

I believe this looks for anything in file1, then checks if what was in file1 is in file2, and puts anything that matches into file3.

萌化 2024-10-20 07:19:16

以前我设法找到一个 Perl 脚本,它可以执行类似的操作:

http://www.perlmonks。 org/?node_id=160735

Back in the days I managed to find a Perl script that does something similar:

http://www.perlmonks.org/?node_id=160735

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文