将 trie 保存到磁盘
这听起来是一个简单的问题,但我不知道如何寻找它的答案。
我在 C# 中有一个 trie 实现,它将存储字典文件中大约 80K 的单词。加载所有这些单词需要相当长的时间(超过 5 分钟)。我想知道“保留”这些数据的最佳方法是什么,这样我就不必每次启动应用程序时都重新加载所有单词?
谢谢。
This sounds like a simple question, but I don't know how to search for its answer.
I have a trie implementation in C# that will store about 80K words from a dictionary file. It takes quite a while to load all these words (more than 5 mins). I was wondering, what is the best way to "persist" those data so I don't have to reload all words every time I start the application?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我只是以旧的 MFC 二进制方式序列化它。基本上,读/写应该尽可能快,你唯一要做的就是分配和初始化输入结构,无论如何你都需要这样做。
也就是说,要序列化特里树的节点,您可以执行以下操作:
编辑:只需重新阅读您的问题,并且您想从单词列表从头开始构建特里树?正如其他人所说,分析,但不仅仅是使用任何旧的分析器。他们并不都发现你的问题。 这就是我所做的。 所花费的时间不应比读取文件所花费的时间加上创建结构所花费的时间多得多。
I would just serialize it in the old MFC binary fashion. Basically the reading/writing should be about as fast as possible, and the only thing you're left with is allocating and initializing the structure on input, which you need to do anyway.
That is, to serialize a node of the trie, you do this:
Edit: Just re-read your question, and you want to build the trie from scratch from the wordlist? As others said, profile, but not just with any old profiler. They don't all find your problem. Here's what I do. The time it takes should not be much more than the time it takes to read the file plus the time it takes to create the structure.
由于性能缓慢且序列化/反序列化时间缓慢,我最近重构了类似的数据结构。
我的解决方案是完全废弃 trie 并使用本机 .NET 集合 - 字典和查找。
我正在处理大约 40 万字。从内存中构建数据结构大约需要 5 秒,该数据结构是由多个字典和查找索引的对象列表。
Dictionary
其中键n - 中的字母数
搜索词。
字典是一个
Lookup
其中键是字符串有n个字母,值为全部
以该字符串开头的字符串。
例如对于关键的“st”值可能是
“开始”、“停止”和“字符串”。
为了创建数据结构,我只需迭代 i = 1 到 maxlength 的整个单词列表,即可为每个 i 创建所有不同“开头为”字符串的查找。将它们插入顶级词典中即可完成。
这消除了对定制特里树的需要。我发现性能差异(搜索时间)可以忽略不计,但加载速度非常有利于我的设计(更不用说使用简单 .NET 类型的简单性和可维护性)。
I recently refactored a similar data structure, due to slow performance and slow serialization / deserialization times.
My solution was to scrap the trie altogether and go with native .NET collections - Dictionaries and Lookups.
I'm working with about 400k words. From memory it takes about 5 seconds to build the data structure, which is a list of objects indexed by a number of dictionaries and lookups.
Dictionary<int, var>
where the keyis n - the number of letters in the
search term.
dictionary is a
Lookup<string,
where the key is a stringstring>
with n letters, and the value is all
strings that start with that string.
e.g for key 'st' values might be
'start', 'stop' and 'string'.
To create the data structure I simply iterate over the entire list of words for i = 1 to maxlength to create a Lookup of all distinct 'starts with' strings for each i. Plug those into the top level dictionary and you're done.
This removes the need for a custom-built trie. I found the performance difference (search time) to be neglible, but the speed of loading to hugely favour my design (not to mention simplicity and maintainability of using simple .NET types).
与所有其他性能问题一样,理想的解决方案将通过分析您当前的解决方案和您提出的其他候选解决方案得出。瓶颈在哪里?输入/输出?对文本进行词法分析?在特里树中形成链接?如果不了解您的性能目标、特里结构使用的性质以及当前存在的瓶颈,将很难提出具体建议。
需要考虑的问题:
一种可能的策略:创建并保留一本包含 1,000 个(左右)最常用单词的“最常用单词”字典。在启动时将这些单词加载到 trie 中,并在另一个线程上生成完整词典的加载;当读取新单词时逐渐添加到创建的特里树中。
同步后,用户将看到
不完整的 trie 直到加载
完全完成。这可能是也可能不是一个阻碍因素,具体取决于特里树的用途。
Like all other performance issues, the ideal solution will follow from profiling your current solution and other candidate solutions that you come up with. Where's the bottleneck? The I/O? Lexing the text? Forming the links in the trie? Will be hard to make a concrete suggestion without knowing your performance goals, the nature of the trie-usage and bottlenecks currently present.
Issues to consider:
One possible strategy: Create and persist a 'most common words' dictionary with the 1,000 (or so) of the most frequently-used words. Load these words into the trie on start-up, and spawn the loading of the full-dictionary on another thread; incrementally adding to the created trie as new words are read.
synchronization, user will see an
incomplete trie until loading is
fully complete. This may or may not be a showstopper depending on what the trie is being used for.