当前位置：文江博客话题详情

两个大单词列表的交集

发布于 2024-10-13 07:19:16 字数 114 浏览 11 评论 0原文

我有两个单词列表（180k 和 260k），我想生成第三个文件，它是两个列表中出现的单词集。

最好（最有效）的方法是什么？我读过讨论使用 grep 的论坛，但是我认为单词列表对于这种方法来说太大了。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甜心小果奶 2024-10-20 07:19:16

如果两个文件已排序（或者可以排序），则可以使用 comm -1 -2 file1 file2 打印出交集。

回复收藏 0 原文

紙鸢 2024-10-20 07:19:16

你是对的，grep 是个坏主意。输入“man join”并按照说明进行操作。

如果您的文件只是单列中的单词列表，或者至少，如果重要的单词是每行的第一个单词，那么您需要做的就是：

$ sort -b -o f1 file1
$ sort -b -o f2 file2
$ join f1 f2

否则，您可能需要给 join(1) 命令一些附加说明：

JOIN(1)                   BSD General Commands Manual                  JOIN(1)

NAME
     join -- relational database operator

SYNOPSIS
     join [-a file_number | -v file_number] [-e string] [-o list] [-t char] [-1 field] [-2 field] file1 file2

DESCRIPTION
     The join utility performs an ``equality join'' on the specified files and writes the result to the standard output.  The ``join field'' is the field in each file by which the files are compared.  The
     first field in each line is used by default.  There is one line in the output for each pair of lines in file1 and file2 which have identical join fields.  Each output line consists of the join field,
     the remaining fields from file1 and then the remaining fields from file2.
     . . .
     . . .

You are correct, grep would be a bad idea. Type "man join" and follow the instructions.

If your files are just lists of words in a single column, or at least, if the important word is the first on each line, then all you need to do is:

$ sort -b -o f1 file1
$ sort -b -o f2 file2
$ join f1 f2

Otherwise, you may need to give the join(1) command some additional instructions:

JOIN(1)                   BSD General Commands Manual                  JOIN(1)

NAME
     join -- relational database operator

SYNOPSIS
     join [-a file_number | -v file_number] [-e string] [-o list] [-t char] [-1 field] [-2 field] file1 file2

DESCRIPTION
     The join utility performs an ``equality join'' on the specified files and writes the result to the standard output.  The ``join field'' is the field in each file by which the files are compared.  The
     first field in each line is used by default.  There is one line in the output for each pair of lines in file1 and file2 which have identical join fields.  Each output line consists of the join field,
     the remaining fields from file1 and then the remaining fields from file2.
     . . .
     . . .

回复收藏 0 原文