当前位置：文江博客话题详情

algorithm wikipedia data-structures

在 25 GB 语料库中搜索单个单词

发布于 2024-09-01 19:34:46 字数 99 浏览 6 评论 0 原文

我必须在 25 GB 的维基百科语料库中搜索单个单词。我使用了 grep 但需要很多时间。是否有一种高效且简单的表示方法可以快速搜索。另外，我想找到完全匹配的。

谢谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

早茶月光 2024-09-08 19:34:46

您可能想要对从单词到位置列表（字节码偏移量）的映射进行索引。单词列表将按字母顺序排序。然后，您可以拥有一个二级索引，用于显示这个大单词列表中某些字母的开头位置。

Lazy hash           |   Word index               |  Corpus
aaa   starts at X   |   aaa                      |  lorem ipsum dolor
aab   starts at Y   |   ...                      |  sit amet .....
aac   ...           |   and  486, 549, 684, ...  |  ...
...   ...           |                            |
zzz   ...           |                            |

这是我所在系的自然语言教授提倡的方式（我们在算法课程的实验室中进行了此练习）。

You would probably want to do an index of a mapping from word to list of locations (bytecode offsets). The list of words would be sorted alphabetically. You could then have a secondary index of where certain letters start in this large list of words.

Lazy hash           |   Word index               |  Corpus
aaa   starts at X   |   aaa                      |  lorem ipsum dolor
aab   starts at Y   |   ...                      |  sit amet .....
aac   ...           |   and  486, 549, 684, ...  |  ...
...   ...           |                            |
zzz   ...           |                            |

This is the way advocated by the natural language professor at my department (we did this exercise as a lab in an algorithm course).

回复收藏 0 原文

暖阳 2024-09-08 19:34:46

您是否尝试过使用索引引擎...例如，Lucene 和 Nutch？ Lucene 是索引引擎。 Nutch 是网络爬虫。结合力量！

我忘了提及... CouchDB (http://couchdb.apache.org/)

回复收藏 0 原文

娇纵 2024-09-08 19:34:46

我已经成功使用 Boyer-Moore 算法及其 < a href="http://en.wikipedia.org/wiki/Boyer-Moore-Horspool_algorithm" rel="nofollow noreferrer">简化版本。网络上有各种语言的实现。

回复收藏 0 原文

花开半夏魅人心 2024-09-08 19:34:46

@aloobe 的答案是使用将单词映射到位置的索引文件。我只是想对此进行阐述，尽管我认为OP正在寻找的答案可能只是博耶-摩尔。

索引文件将如下所示（简化为使用人类可读的 2 位数字）：

53 17 89 03
77 79 29 39
88 01 05 15
...

上面的每个条目都是您认为足够重要以进行索引的单词或字母的字节偏移量。实际上，您不会使用字母索引，因为索引文件比语料库大！

诀窍是，如果您要用位置替换这些位置处的单词，您的索引文件将是按字母顺序排序的语料库版本：

and and are as
ate bad bat bay
bear best bin binge

这使您能够执行通过索引文件对语料库进行二分搜索。如果您正在搜索上面的单词“best”，您将抓取索引文件中的中间条目 79。然后您将转到语料库中的位置/字节 79 并查看那里有什么单词。这是坏。我们知道按字母顺序 best > bad，因此该位置必须位于索引文件的第二半部分。

因此，我们获取 79（第 6 个）和 15（第 12 个）之间的中间索引，在我的示例中为 01。然后我们查看语料库中的位置/字节 88（第 9 个）来查找 bear。 最佳> bear 所以我们再试一次 - 中间索引现在是 01（第 10 位）或 05（第 11 位），具体取决于您的舍入方式。但显然我们会在再搜索 1 到 2 次后找到最佳。如果像示例一样有 12 个单词，那么在最坏的情况下最多需要 4 次搜索。对于平均单词长度为 5 个字母且字母之间有空格的 25GB 文件来说，大约有 40 亿个单词。然而，在最坏的情况下，您只会搜索约 32 次。那时，程序的更多时间都花在了磁盘旋转和缓冲输入上，而不是实际搜索上！

此方法也适用于重复的单词。如果您想查找单词 the 的所有位置，您可以对 the 进行二分搜索，直到找到索引。然后，您将重复从索引文件中的位置减去1，每次都使用该值来查看语料库。如果该位置的单词仍然是 the，请继续。当您最终停止时，您将拥有索引文件中映射到 the 的最早索引。

索引文件的创建是唯一困难的部分。您需要遍历语料库中的每个单词，建立单词及其索引的数据结构。在此过程中，跳过太常见或太短而无法列出的单词，例如“a”、“I”、“the”、“and”、“is”等。完成后，您可以获取该数据结构并将其转为索引文件。对于 25GB 文件，您的索引需要 >不幸的是，32 位，所以使用 long （在 Java 中）或 long long （在 C 中）来保存它。没有理由它应该是人类可读的，因此将索引写为 64 位值，而不是字符串。

我推荐的结构是自平衡二叉搜索树。每个节点都是一个字符串值（单词）和索引。然而，树仅根据字符串来比较节点。如果这样做，则按顺序遍历（左、节点、右）将准确地给出索引文件。

希望这有帮助！我几年前开发手机词典时使用的一个例子是 Jim Breen 的 EDICT。由于 EUC 编码和日语字符，可能很难理解，但意图是相同的。

@aloobe had the answer of using an index file that mapped words to locations. I just want to expound upon this, though I think the answer the OP is looking for may just be Boyer-Moore.

The index file would look like this (simplified to use human-readable 2-digits):

53 17 89 03
77 79 29 39
88 01 05 15
...

Each entry above is the byte offset of a word or letter that you've deemed important enough to index. In practice, you won't use letter indices as then your index file is larger than your corpus!

The trick is, if you were to substitute the words at those locations with the locations, your index file would be an alphabetically-sorted version of the corpus:

and and are as
ate bad bat bay
bear best bin binge

This enables you to do Binary Search on the corpus through the index file. If you are searching for the word "best" above, you would grab the middle entry in the index file, 79. Then you would go to position/byte 79 in the corpus and see what word is there. It is bad. We know that alphabetically best > bad, so the position must be in the 2nd half of the index file.

So we grab the middle index between 79 (6th) and 15 (12th), which is 01 in my example. Then we look at position/byte 88 (9th) in the corpus to find bear. best > bear so we try again - the middle index now is either 01 (10th) or 05 (11th) depending on how you round. But clearly we'll find best in 1 or 2 more searches. If we have 12 words like the example, it will take at most 4 searches in the worst case. For a 25GB file with an average word length of say 5 letters and spaces between them, that's ~4 billion words. However, in the worst case scenario you will only search ~32 times. At that point, more of your program's time is spent spinning up the disk and buffering input than actually searching!

This method works with duplicate words as well. If you want to find all of the locations of the word the, you would binary search on the until you found the index. Then you would subtract 1 from the position in the index file repeatedly, using the value each time to look into the corpus. If the word at that location is still the, continue. When you finally stop, you have the earliest index in the index file that maps to the.

The creation of the index file is the only tough part. You need to go through each word in the corpus, building up a data structure of the words and their indices. Along the way, skip words that are too common or short to be listed, like "a", "I", "the", "and", "is", etc. When you are finished, you can take that data structure and turn it into an index file. For a 25GB file, your indices will need to be > 32 bits, unfortunately, so use a long (in Java) or long long (in C) to hold it. There's no reason it should be human readable for you, so write the indices out as 64 bit values, not strings.

The structure I would recommend is a self-balancing binary search tree. Each node is a string value (the word) and index. The tree compares nodes based only on the string, however. If you do this, then in-order traversal (left, node, right) will give you exactly the index file.

Hope this helps! An example I used years ago developing a mobile phone dictionary is Jim Breen's EDICT. It may be difficult to pick up because of the EUC-encoding and Japanese characters, but the intent is the same.

回复收藏 0 原文

~没有更多了~