内存要求低的渐近快速关联数组

发布于 2024-09-11 19:31:45 字数 304 浏览 3 评论 0原文

好的,尝试已经有一段时间了。典型的实现应该为您提供 O(m) 的查找、插入和删除操作,与数据集的大小 n 无关,其中 m 是消息长度。然而,在最坏的情况下,同样的实现每个输入字节占用 256 个字。

其他数据结构,特别是散列,可以为您提供预期的 O(m) 查找、插入和删除,某些实现甚至提供恒定时间查找。然而,在最坏的情况下,例程要么不会停止,要么需要 O(nm) 时间。

问题是,是否有一种数据结构能够提供 O(m) 查找、插入和删除时间,同时保持与哈希或搜索树相当的内存占用?

可以说我只对最坏情况下的行为感兴趣,无论是在时间上还是在空间上。

Ok, tries have been around for a while. A typical implementation should give you O(m) lookup, insert and delete operations independently of the size n of the data set, where m is the message length. However, this same implementation takes up 256 words per input byte, in the worst case.

Other data structures, notably hashing, give you expected O(m) lookup, insertion and deletion, with some implementations even providing constant time lookup. Nevertheless, in the worst case the routines either do not halt or take O(nm) time.

The question is, is there a data structure that provides O(m) lookup, insertion and deletion time while keeping a memory footprint comparable to hashing or search trees?

It might be appropriate to say I am only interested in worst case behaviour, both in time and space-wise.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

弥枳 2024-09-18 19:31:45

您尝试过 Patricia-(别名 critbit- 或 Radix-)吗?我认为他们解决了最坏情况的空间问题。

Did you try Patricia-(alias critbit- or Radix-) tries? I think they solve the worst-case space issue.

已下线请稍等 2024-09-18 19:31:45

有一种称为后缀数组的结构。我不记得这个领域的研究,但我认为他们使用这种结构的查找时间已经非常接近 O(m) 了,而且它比典型的基于树的索引方法要紧凑得多。

Dan Gusfield 的书是字符串算法的圣经。

There is a structure known as a suffix array. I can't remember the research in this area, but I think they've gotten pretty darn close to O(m) lookup time with this structure, and it is much more compact that your typical tree-based indexing methods.

Dan Gusfield's book is the Bible of string algorithms.

烟若柳尘 2024-09-18 19:31:45

我认为没有理由担心最坏的情况,原因有两个:

  1. 所有 trie 节点的总和中的活动分支总数永远不会超过存储数据的总大小。
  2. 节点大小唯一成为问题的时候是您正在排序/存储的数据中是否存在巨大的扇出。助记符就是一个例子。如果您依赖 trie 作为压缩机制,那么哈希表也不会为您提供更好的帮助。

如果您需要压缩并且几乎没有公共子序列,那么您需要根据数据的特定形状而不是基于关于字符串的通用假设来设计压缩算法。例如,在完全/高度填充的助记符数据集的情况下,跟踪数据中的“漏洞”而不是填充的数据的数据结构可能更有效。

也就是说,如果您有适度的扇出,那么避免使用固定的 trie 节点大小可能是值得的。您可以将 trie 的每个节点设为一个哈希表。从较小的尺寸开始,并随着元素的插入而增加。当每个哈希表由于增加而必须重新组织时,最坏情况的插入将是 c * m,其中 c 是可能的字符/唯一原子元素的数量。

I don't think there a reason to be worried about the worst case for two reasons:

  1. You'll never have more total active branches in the sum of all trie nodes than the total size of the stored data.
  2. The only time the node size becomes an issue is if there is huge fan-out in the data you're sorting/storing. Mnemonics would be an example of that. If you're relying on the trie as a compression mechanism, then a hash table would do no better for you.

If you need to compress and you have few/no common subsequences, then you need to design a compression algorithm based on the specific shape of the data rather than based on generic assumptions about strings. For example, in the case of a fully/highly populated mnemonic data set, a data structure that tracked the "holes" in the data rather than the populated data might be more efficient.

That said, it might pay for you to avoid a fixed trie node size if you have moderate fan-out. You could make each node of the trie a hash table. Start with a small size and increase as elements are inserted. Worst-case insertion would be c * m when every hash table had to be reorganized due to increases where c is the number of possible characters / unique atomic elements.

牵你手 2024-09-18 19:31:45

根据我的经验,我认为三种实现可以满足您的要求:

可以查看基准测试此处。它们与哈希表一样快,但内存要求更低,最坏情况也更好。

In my experience there are three implementation that I think could met your requirement:

You can see the benchmark here. They are as fast as hashtable, but with lower memory requirement and better worst-case.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文