将 trie 保存到磁盘

发布于 2024-09-20 00:18:17 字数 166 浏览 1 评论 0原文

这听起来是一个简单的问题,但我不知道如何寻找它的答案。

我在 C# 中有一个 trie 实现,它将存储字典文件中大约 80K 的单词。加载所有这些单词需要相当长的时间(超过 5 分钟)。我想知道“保留”这些数据的最佳方法是什么,这样我就不必每次启动应用程序时都重新加载所有单词?

谢谢。

This sounds like a simple question, but I don't know how to search for its answer.

I have a trie implementation in C# that will store about 80K words from a dictionary file. It takes quite a while to load all these words (more than 5 mins). I was wondering, what is the best way to "persist" those data so I don't have to reload all words every time I start the application?

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

虫児飞 2024-09-27 00:18:31

我只是以旧的 MFC 二进制方式序列化它。基本上,读/写应该尽可能快,你唯一要做的就是分配和初始化输入结构,无论如何你都需要这样做。

也就是说,要序列化特里树的节点,您可以执行以下操作:

Read/Write number N of subnodes
For each subnode
  If reading, allocate a subnode in this node
  Read/Write the character for the subnode
  Serialize the subnode
End

编辑:只需重新阅读您的问题,并且您想从单词列表从头开始构建特里树?正如其他人所说,分析,但不仅仅是使用任何旧的分析器。他们并不都发现你的问题。 这就是我所做的。 所花费的时间不应比读取文件所花费的时间加上创建结构所花费的时间多得多。

I would just serialize it in the old MFC binary fashion. Basically the reading/writing should be about as fast as possible, and the only thing you're left with is allocating and initializing the structure on input, which you need to do anyway.

That is, to serialize a node of the trie, you do this:

Read/Write number N of subnodes
For each subnode
  If reading, allocate a subnode in this node
  Read/Write the character for the subnode
  Serialize the subnode
End

Edit: Just re-read your question, and you want to build the trie from scratch from the wordlist? As others said, profile, but not just with any old profiler. They don't all find your problem. Here's what I do. The time it takes should not be much more than the time it takes to read the file plus the time it takes to create the structure.

心如狂蝶 2024-09-27 00:18:29

由于性能缓慢且序列化/反序列化时间缓慢,我最近重构了类似的数据结构。

我的解决方案是完全废弃 trie 并使用本机 .NET 集合 - 字典和查找。

我正在处理大约 40 万字。从内存中构建数据结构大约需要 5 秒,该数据结构是由多个字典和查找索引的对象列表。

  • 该结构的顶层是
    Dictionary 其中键
    n - 中的字母数
    搜索词。
  • 中的每个值
    字典是一个 Lookup 其中键是字符串
    有n个字母,值为全部
    以该字符串开头的字符串。
    例如对于关键的“st”值可能是
    “开始”、“停止”和“字符串”。

为了创建数据结构,我只需迭代 i = 1 到 maxlength 的整个单词列表,即可为每个 i 创建所有不同“开头为”字符串的查找。将它们插入顶级词典中即可完成。

这消除了对定制特里树的需要。我发现性能差异(搜索时间)可以忽略不计,但加载速度非常有利于我的设计(更不用说使用简单 .NET 类型的简单性和可维护性)。

I recently refactored a similar data structure, due to slow performance and slow serialization / deserialization times.

My solution was to scrap the trie altogether and go with native .NET collections - Dictionaries and Lookups.

I'm working with about 400k words. From memory it takes about 5 seconds to build the data structure, which is a list of objects indexed by a number of dictionaries and lookups.

  • The top level of the structure is a
    Dictionary<int, var> where the key
    is n - the number of letters in the
    search term.
  • Each value in the
    dictionary is a Lookup<string,
    string>
    where the key is a string
    with n letters, and the value is all
    strings that start with that string.
    e.g for key 'st' values might be
    'start', 'stop' and 'string'.

To create the data structure I simply iterate over the entire list of words for i = 1 to maxlength to create a Lookup of all distinct 'starts with' strings for each i. Plug those into the top level dictionary and you're done.

This removes the need for a custom-built trie. I found the performance difference (search time) to be neglible, but the speed of loading to hugely favour my design (not to mention simplicity and maintainability of using simple .NET types).

吲‖鸣 2024-09-27 00:18:27

与所有其他性能问题一样,理想的解决方案将通过分析您当前的解决方案和您提出的其他候选解决方案得出。瓶颈在哪里?输入/输出?对文本进行词法分析?在特里树中形成链接?如果不了解您的性能目标、特里结构使用的性质以及当前存在的瓶颈,将很难提出具体建议。

需要考虑的问题:

  1. 存储格式:文本?二进制?
  2. 持久数据:trie 的整个结构(例如 XML)或只是单词列表,依靠运行时代码将它们推送到数据结构中的正确位置?标记与数据的比率是多少?解析起来有多重?
  3. 存储位置:数据库/平面文件/...?
  4. 增量加载:可能吗?

一种可能的策略:创建并保留一本包含 1,000 个(左右)最常用单词的“最常用单词”字典。在启动时将这些单词加载到 trie 中,并在另一个线程上生成完整词典的加载;当读取新单词时逐渐添加到创建的特里树中。

  • 优点:用户将看到更快的启动时间。
  • 缺点:可能需要跨线程
    同步后,用户将看到
    不完整的 trie 直到加载
    完全完成。这可能是也可能不是一个阻碍因素,具体取决于特里树的用途。

Like all other performance issues, the ideal solution will follow from profiling your current solution and other candidate solutions that you come up with. Where's the bottleneck? The I/O? Lexing the text? Forming the links in the trie? Will be hard to make a concrete suggestion without knowing your performance goals, the nature of the trie-usage and bottlenecks currently present.

Issues to consider:

  1. Storage format: Text? Binary?
  2. Persisted data: The entire structure of the trie (e.g. as XML) or just a list of words, relying on run-time code to push them into the right location in the data-structure? What's the markup to data ratio? How heavy is it to parse?
  3. Storage location: DB / flat-file / ...?
  4. Incremental loading: Possible?

One possible strategy: Create and persist a 'most common words' dictionary with the 1,000 (or so) of the most frequently-used words. Load these words into the trie on start-up, and spawn the loading of the full-dictionary on another thread; incrementally adding to the created trie as new words are read.

  • Pros: User will see faster start-up time.
  • Cons: Might require cross-thread
    synchronization, user will see an
    incomplete trie until loading is
    fully complete. This may or may not be a showstopper depending on what the trie is being used for.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文