如何实现字典(Trie vs HashTable 以及重要问题)?
我遇到过几个问题和文章,说java中的字典实现最好使用tries。但据我所知,其中大多数都没有解决重要问题。因此,接下来是一个现实世界的任务:
假设我需要使用 java 实现一个字典(比如 Lingvo,但更简单)。对于我的特定任务,需要存储单词定义并执行快速字典查找。
请解决下一个问题:
- 那么我应该使用什么数据结构(Trie 或 HashTable)?
- 如果我需要字典不区分大小写,应该如何组织它(搜索,数据结构)?
- 如果我希望它(搜索、字典)区分大小写怎么办?
PS:代码示例受到高度赞赏。 :)
感谢您的提前答复。
更新:如果我们谈论的是 Java 中的标准 DS 实现,那么 HashTable 真的是最适合此特定任务的实现吗?为什么不是 HashMap、TreeMap 或 LinkedHashMap?
I've ran across several questions and articles saying that dictionary implementation in java is done best using tries. But most of them didn't address important issues, as far as I saw it. So, next is a real world task:
Let's assume that I need to implement a dictionary (let's say something like Lingvo, but simpler) using java. For my particular task it is needed to store word definitions and to perform fast dictionary lookup.
Please, address next questions:
- What data structure should I use then (Trie or HashTable)?
- How should it(search, datastruct) be organised if I need the dictionary to be case insensitive?
- What if I want it(search, dictionary) to be case sensitive?
P.S.: Code examples are highly appreciated. :)
Thanks for answers in advance.
UPDATE:If we're talking about standard DS implementations in java, is it true that HashTable will be the best one for this particular task? Why not HashMap, TreeMap or LinkedHashMap?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我只想解决您问题中的一点:
trie 是不是通用的字典数据结构。原因是 trie 是用于(子)字符串搜索的专用搜索树。一般来说,您会对一般搜索树更感兴趣,例如二元搜索树或B 树。
所有这些实现都依赖于字典元素的排序,并且所有这些实现都具有针对常见操作的对数平均情况和最坏情况运行时间。
相比之下,哈希表不需要元素的相对排序。相反,它要求元素可散列并且可相等。普通哈希表特性的最坏情况特性比树差得多,即元素数量呈线性。
然而,只要稍加小心,哈希表操作的平均情况就可以保持不变(即与容器大小无关)。更重要的是,可以证明,较慢的操作是极其罕见的。
在实践中,这意味着除了非常专业的用例之外,哈希表轻而易举地击败了基于树的字典。
这样做的缺点是哈希表对其元素强加了看似任意的顺序。如果您有兴趣按排序顺序从字典中获取项目,那么哈希表不适合您。
(字典还有其他有趣的实现,例如 跳过列表,它可以与搜索树和概率实现(例如 布隆过滤器。)
只有在处理字符串值字典时才能使用基于 trie 的实现,在这种情况下,它实际上经常这是一个不错的选择,特别是当字典中的许多字符串共享公共前缀并且相当短时。
I want to address just one point in your question:
A trie is not a general-purpose dictionary data structure. The reason is that a trie is a specialized search tree for (sub)string search. Generally, you will be more interested in general search trees, e.g. binary search trees or B-trees.
All these implementations rely on an ordering of the dictionary elements, and all of them have a logarithmic average-case and worst-case runtime for common operations.
A hash table, by contrast, does not require a relative ordering of the elements. Instead, it requires that elements are hashable and equality comparable. The worst-case characteristic of common hash table characteristics is much worse than for trees, namely linear in the number of elements.
However, with a bit of care the average case for hash tables operations can be made constant (i.e. independent of the container size). What’s more, it can be proven that slower operations are exceedingly rare.
In practice, this means that except for very specialized use-cases, hash tables beat tree-based dictionaries hands down.
The downside to this is that hash tables impose an arbitrary-seeming order on its elements. If you are interested in getting the items from your dictionary in sorted order, hash tables are not for you.
(There are other interesting implementations of dictionaries, e.g. skip lists which rival search trees and probabilistic implementations like the Bloom filter.)
A trie-based implementation can only be used if you are dealing with a dictionary of string values, in which case it is actually often a good choice, in particular if many strings in the dictionary share common prefixes and are rather short.
编辑停止对此进行投票:我误读了这个问题。 OP并不需要字典来验证单词拼写/建议/提前输入查找/自动完成/任何东西(我认为这就是他所追求的)。 OP 位于键/值映射之后,其中每个单词都有一个定义。
在研究过字典之后,我可以告诉你,你采取了错误的方法。
它并不像在哈希表或特里树之间进行选择那么简单。
您提到 Lingvo:它不仅仅是一张桌子。
您想要为势均力敌的比赛提供建议吗?然后,您可能需要根据用户输入的内容生成排列,并为每个排列查看它是否存在于 dico 中:如果存在,则需要计算其“Levenhstein 编辑距离”,并首先建议具有最短的 LED。
您是否希望自动完成/建议最有可能的匹配(就像 Google 所做的那样)?然后,您需要一个非常先进的数据结构,例如 BK 树(如果我理解正确的话,基本上是 LED 树)。
你的字典里有多少个单词?您将无法使用由 400 000 个单词组成的字典(使用字符串和其他重量级 Java 对象/数据结构)而不会对性能造成严重影响(再次强调:字典不仅仅是一个哈希表,字典通常涉及多个数据结构)。这不容易装入用户的计算机内存中。有一些已知的、可搜索的方法来存储单词,其中每个单词可以压缩为每个单词少于 15 位(每个单词少于 15 位,您没看错)。
除此之外,您可能希望根据语音提出建议:例如使用双变音位映射。
字典(如“单词字典”)不仅仅是一个键/值表。由于用户应排除哪些功能以及涉及的数据量,这确实是一个复杂的野兽。只是简单的英语+一些专业领域术语,医学,计算机科学,等等。将为您提供数十万条数据:尝试将其放入 Java HashMap 中,然后...Kaboom!
EDIT stop upvoting this: I misread the question. The OP is not after a dictionary to verify word spellings/suggestions/type-ahead-lookup/auto-completion/whatever (which I thought was what he was after). The OP is after a key/value mapping where for each word there's a definition.
Having worked on dictionaries, I can tell you that you're taking the wrong approach.
It's not as simple as a choice between an hashtable or a trie.
You mention Lingvo: it's much more than just a table.
Do you want close match to be offered suggestions? You may then need to things like generating permutations on what the user entered and for each permutation see if it exists in the dico: if it does, you'd then need to compute its' Levenhstein Edit Distance and suggest first the words that have the shortest LED.
Do you want most likely matches to be auto-completed/suggested (like what Google does)? You'd then need a very advanced data structure like a BK-tree (basically a tree of LED if I understand it correctly).
How many words will you have in your dictionary? You won't be able to use a dictionary made of 400 000 words using Strings and other heavyweight Java objects / data structure without a serious performance hit (once again: a dictionary is more than just one hashtable, a dictionary typically involve several datastructures). This won't easily fit in your users' computer memory. There are known, searchable, ways to store words where every single word can be packed on fewer than 15 bits per word (fewer than 15 bits per word, you read correctly).
In addition to that, you may want to do suggestion based on phonetics: like by using a double-metaphone mapping.
A dictionary, as in a "word dictionary" is so much more than just a key/value table. It is really a complicated beast due to which features the user shall except and due to the amount of data involved. Just plain english + a few specialized domains terminologies, medical, comp-sci, whatever. will give you hundreds of thousands of data: try to put that in a Java HashMap and... Kaboom!
Java 中的字典实现,绝对哈希集合是最好的选择。
关于
HashMap
或HashTable
:主要是如果您的类以多线程方式使用,则必须使用HashTable
,否则HashMap
是最好的选择。HashMap
与TreeMap
: 如果您需要在集合中插入顺序,那么我们必须使用TreeMap
。HashMap
与LinkedHashMap
:LinkedHashMap
实现与HashMap
的不同之处在于它维护了双重- 贯穿其所有条目的链接列表。该链表定义了迭代顺序,通常是将键插入到映射中的顺序(插入顺序)。请注意,如果将键重新插入到映射中,插入顺序不会受到影响。 (如果m.put(k, v)
在m.containsKey( k)
将在调用之前立即返回 true。)Dictionary implementation in Java, definitely hash collections are best bet.
Regarding
HashMap
orHashTable
: Mainly if your class is used in multithreaded manner than you have to useHashTable
, otherwiseHashMap
is the best option.HashMap
vsTreeMap
: If you need insertion order into collection then we have to useTreeMap
.HashMap
vsLinkedHashMap
:LinkedHashMap
implementation differs fromHashMap
in that it maintains a doubly-linked list running through all of its entries. This linked list defines the iteration ordering, which is normally the order in which keys were inserted into the map (insertion-order). Note that insertion order is not affected if a key is re-inserted into the map. (A keyk
is reinserted into a mapm
ifm.put(k, v)
is invoked whenm.containsKey(k)
would return true immediately prior to the invocation.)