以低内存占用存储大型字典的方法+快速查找(在 Android 上)
我正在开发一个 Android 文字游戏应用程序,需要大量可用的(约 250,000 个单词词典)。我需要:
- 相当快的查找,例如恒定时间更好,有时需要每秒进行 200 次查找来解决单词难题,并且可能需要在 0.2 秒内进行 20 次查找,以检查用户刚刚拼写的单词。
编辑:查找通常会询问“在字典中吗?”。我也想在单词中支持最多两个通配符,但这很简单,只需生成通配符可能存在的所有可能的字母并检查生成的单词(即对具有两个通配符的单词进行 26 * 26 查找) 。
- 由于它是一个移动应用程序,因此使用尽可能少的内存并且只需要少量的初始下载词典数据是首要任务。
我的第一次天真的尝试使用了 Java 的 HashMap 类,这导致了内存不足异常。我研究过使用 Android 上可用的 SQL lite 数据库,但这似乎有点矫枉过正。
有什么好方法可以满足我的需要?
I'm developing an android word game app that needs a large (~250,000 word dictionary) available. I need:
- reasonably fast look ups e.g. constant time preferable, need to do maybe 200 lookups a second on occasion to solve a word puzzle and maybe 20 lookups within 0.2 second more often to check words the user just spelled.
EDIT: Lookups are typically asking "Is in the dictionary?". I'd like to support up to two wildcards in the word as well, but this is easy enough by just generating all possible letters the wildcards could have been and checking the generated words (i.e. 26 * 26 lookups for a word with two wildcards).
- as it's a mobile app, using as little memory as possible and requiring only a small initial download for the dictionary data is top priority.
My first naive attempts used Java's HashMap class, which caused an out of memory exception. I've looked into using the SQL lite databases available on android, but this seems like overkill.
What's a good way to do what I need?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
你也可以用更低级的方法来实现你的目标……如果这是一个文字游戏,那么我怀疑你正在处理 27 个字母的字母表。假设一个字母表不超过 32 个字母,即每个字母 5 位。您可以使用 5 位/字母的简单编码将 12 个字母(12 x 5 = 60 位)塞入单个 Java long 中。
这意味着实际上如果您的单词不超过 12 个字母/单词,您可以将您的字典表示为一组 Java long。如果您有 250,000 个单词,将此集合简单地表示为单个已排序的长整型数组应占用 250,000 个单词 x 8 字节/单词 = 2,000,000 ~ 2MB 内存。然后通过二分查找进行查找,考虑到数据集的规模较小,这种查找速度应该非常快(比较次数少于 20 次,因为 2^20 的比较次数超过一百万次)。
如果您的单词长于 12 个字母,那么 I 会将大于 12 个字母的单词存储在另一个数组中,其中 1 个单词将以明显的方式由 2 个串联的 Java long 表示。
注意:它之所以有效,并且可能比 trie 更节省空间,并且至少实现起来非常简单,是因为字典是不变的......如果您需要修改数据集,那么搜索树是很好的选择,但如果数据set 是常量,您通常可以使用简单的二分搜索来运行一种方法。
You can achieve your goals with more lowly approaches also... if it's a word game then I suspect you are handling 27 letters alphabet. So suppose an alphabet of not more than 32 letters, i.e. 5 bits per letter. You can cram then 12 letters (12 x 5 = 60 bits) into a single Java long by using 5 bits/letter trivial encoding.
This means that actually if you don't have longer words than 12 letters / word you can just represent your dictionary as a set of Java longs. If you have 250,000 words a trivial presentation of this set as a single, sorted array of longs should take 250,000 words x 8 bytes / word = 2,000,000 ~ 2MB memory. Lookup is then by binary search, which should be very fast given the small size of the data set (less than 20 comparisons as 2^20 takes you to above one million).
IF you have longer words than 12 letters, then I would store the >12 letters words in another array where 1 word would be represented by 2 concatenated Java longs in an obvious manner.
NOTE: the reason why this works and is likely more space-efficient than a trie and at least very simple to implement is that the dictionary is constant... search trees are good if you need to modify the data set, but if the data set is constant, you can often run a way with simple binary search.
我假设您想检查给定的单词是否属于字典。
看看bloom 过滤器。
布隆过滤器可以执行“X 是否属于预定义集合”类型的查询,且存储需求非常小。如果查询的答案是肯定的,则错误的概率很小(且可调整),如果查询的答案是否定的,则答案保证是正确的。
根据 Wikipedia 文章,您可能需要不到 4 MB 的空间来存储包含 250 000 个单词的词典,且错误概率为 1%。
如果该单词确实包含在字典中,则布隆过滤器将正确回答“在字典中”。如果字典中没有这个词,布隆过滤器可能会以很小的概率错误地给出答案“在字典中”。
I am assuming that you want to check if given word belongs to dictionary.
Have a look at bloom filter.
The bloom filter can do "does X belong to predefined set" type of queries with very small storage requirements. If the answer to query is yes, it has small (and adjustable) probability to be wrong, if the answer to query is no, then the answer guaranteed to be correct.
According the Wikipedia article you could need less than 4 MB space for your dictionary of 250 000 words with 1% error probability.
The bloom filter will correctly answer "is in dictionary" if the word actually is contained in dictionary. If dictionary does not have the word, the bloom filter may falsely give answer "is in dictionary" with some small probability.
存储目录的一种非常有效的方法是有向非循环词图 (DAWG)。
以下是一些链接:
A very efficient way to store a directory is a Directed Acyclic Word Graph (DAWG).
Here are some links:
您将需要某种 trie。我认为也许 三元搜索 trie 会很好。它们的查找速度非常快,内存使用量也很低。 本文提供了有关 TST 的更多信息。它还讨论了排序,因此并非所有内容都适用。 这篇文章可能更适用一些。正如文章所说,TST
正如此表所示,查找时间非常长与使用哈希表相当。
You'll be wanting some sort of trie. Perhaps a ternary search trie would be good I think. They give very fast look-up and low memory usage. This paper gives some more info about TSTs. It also talks about sorting so not all of it will apply. This article might be a little more applicable. As the article says, TSTs
As this table shows, the look-up times are very comparable to using a hash table.
您还可以使用 Android NDK 并用 C 语言进行结构或 C++。
You could also use the Android NDK and do the structure in C or C++.
我使用的设备基本上是通过二进制压缩文件工作的,其拓扑结构类似于二叉树的结构。在叶子上,您将获得霍夫曼压缩文本。查找节点需要跳到文件的各个位置,然后仅加载真正需要的数据部分。
The devices that I worked basically worked from a binary compressed file, with a topology that resembled the structure of a binary tree. At the leafs, you would have the Huffmann compressed text. Finding a node would involve having to skip to various locations of the file, and then only load the portion of the data really needed.
正如“Antti Huima”所建议的非常酷的想法,尝试使用 long 存储字典单词。然后使用二分搜索进行搜索。
Very cool idea as suggested by "Antti Huima" trying to Store dictionary words using long. and then search using binary search.