设计牛津英语词典
在一次采访中有人问我将如何设计《牛津英语词典》。
我告诉他我会使用 TREE 数据结构,但他回答说这会占用大量内存。那么应该使用哪种其他数据结构呢?
I was asked in an interview how I would design the Oxford English Dictionary.
I told him that I'd use a TREE data structure, but he replied that it would take a lot of memory. So which other data structure should be used?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我听说过去在手机中用于存储 T9 词典的一种数据结构如下(好吧,这仅解决了关键问题,但没有解决定义存储):
条目已排序,每个条目应以偏移量开始上一个条目应该从哪里继续,也是延续。例如:
将解码为 apple、applicable、application。然而,这可能与合并链的尝试没有什么不同,请参阅
Wikipedia 发现的有向非循环词图 ,它与树的不同之处在于它不仅有分支,而且分支可以合并,其中单词具有相同的后缀。这确实可以成为一个优越的存储空间。
One data structure I heard was used in the past in mobile phones for storing T9 dictionaries is the following (well, this addresses only the key issue, but not the definition storage):
Entries are sorted, and each entry should start with an offset into the previous entry from where it should be continued, and also the continuation. For example:
would decode to apple, applicable, application. However this might not be that different from tries with merged chains, see
Wikipedia uncovered the Directed acyclic word graph, which differs from trees that it not only branches, but branches can merge, where words have the same suffix. This indeed could be a superior storage.
它不会使用很多内存。你的回答很好。也许是 1995 年。认为自己很幸运。
It wouldn't use a lot of memory. Your answer was fine. Maybe in 1995. Consider yourself lucky.
正如其他人提到的,如果没有足够的屋顶来容纳精心设计的特里树,那么可能也没有空间容纳任何其他类型的索引。由于这是一个面试问题,听起来他试图引导您使用 B 树等经典的外核数据结构。
或者,一个好的回应可能是询问更多信息,例如“您想对此数据结构执行什么类型的操作,以及您需要什么类型的性能?”如果您只想要拼写检查,那么布隆过滤器可能是最有效的“数据结构”......
As others have mentioned, if there isn't enough roof for a well-designed trie, there probably isn't room for any other kind of index, either. Since this is about an interview question, it sounds like he was trying to steer you towards classic out-of-core datastructures like B-trees.
Alternately, a good response might have been to ask for more information, like "what kinds of operations will you want to do on this datastructure, and what kind of performance do you need?" If you just want a spellcheck, then a Bloom filter might be the most efficient "datastructure"...