C 中关联集合的简单空间有效实现?
我正在寻找一个关联集合,它支持在至少 O(Log(N)) 时间内按键检索和插入值(删除不重要),并且在代码大小和运行方面具有非常低的内存开销- 时间内存消耗。
我正在为一个用 C 编写的小型嵌入式应用程序执行此操作,因此我试图最大限度地减少所需的代码量和消耗的内存量。
如果不是的话,Google 稀疏哈希数据结构也是一种可能不是用 C++ 编写的,而且更简单。
据我所知,大多数哈希表实现都使用相当多的额外空间,至少需要键值总数的两倍空间,否则每个条目都需要额外的指针(例如桶链接哈希算法)。在我的结构中,键值对只是两个指针。
目前我正在使用一个已排序的键/值对数组,但插入是 O(N) 的。我忍不住认为必须有一种聪明的方法来提高插入的摊销运行时间,例如通过分组进行插入,但我没有取得任何成功。
我想这在某些圈子里一定是一个比较知名的问题,所以为了让这个问题不太主观,我想知道上述问题最常见的解决方案是什么?
[编辑:]
一些可能相关的附加信息:
- 键是整数
- 值的数量可以很小,从 1 到 2^32。
- 使用模式是不可预测的。
- 我希望保持内存消耗尽可能低(例如,将所需内存大小加倍,这并不理想)
I am looking for an associative collection that supports both retrieval and insertion of values by key (deletion not important) in at least O(Log(N)) time, and that has a very low memory overhead both in terms of code size and run-time memory consumption.
I am doing this for a small embedded application written in C, so I am trying to minimize the amount of code required, and the amount of memory consumed.
The Google sparse hash data structure would be a possibility if it wasn't written in C++, and was simpler.
Most hash table implementations that I am aware of use a fair amount of extra space, requiring at least twice as much space as the total number of key-values, or else requiring extra pointers per entry (e.g. bucket chaining hash algorithms). In my structure, key value pairs are just two pointers.
Currently I am using an array of key/value pairs which is sorted, but the insertion is O(N). I can't help but think there must be a clever way to improve the amortized running time of insertion, for example by doing the insertions in groups, but I am not having any success.
I think that this must be a relatively well-known problem in certain circles, so to make this not too subjective, I'm wondering what the most common solution to the problem stated above is?
[EDIT:]
Some additional information that could be relevant:
- Keys are integers
- Number of values could be tiny anywhere from 1 to 2^32.
- Usage patterns are unpredicatable.
- I am hoping to keep memory consumption as low as possible (e.g. doubling the size of memory required, would not be ideal)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
查看二叉搜索树并克服最坏的情况(两者的复杂度为 O(n))搜索和插入)使用平衡树。
Look at binary search tree and to overcome the worst case(which has O(n) complexity for both search and insert) use a balanced tree.
您可以使用不使用链接的哈希表,例如线性探测或布谷鸟哈希方案。后备实现只是一个数组,负载因子约为 0.5,开销不会太差,并且实现复杂性(至少对于线性或二次探测)也不会太多。
如果您想要一个二叉搜索树的良好实现,该二叉搜索树具有出色的性能保证并且编写起来不太困难,请考虑研究展开树。它们保证摊销 O(lg n) 查找,并且每个节点只需要两个指针。平衡步骤也比大多数平衡 BST 容易得多。
You could use a hash table that doesn't use chaining, such as a linear probing or cuckoo hashing scheme. The backing implementation is just an array, and with a load factor of around 0.5, the overhead won't be too bad, and the implementation complexity (at least for linear or quadratic probing) isn't too much.
If you want a good implementation of a binary search tree that has excellent guarantees on performance and isn't too hard to code up, consider looking into splay trees. They guarantee amortized O(lg n) lookups, and require just two pointers per node. The balance step is also substantially easier than most balanced BSTs.
我可能会使用具有双重哈希的哈希表来解决冲突。总体思路是对原始值进行散列,如果发生冲突,则进行第二次散列,该散列给出了一个步长值,您将在遍历数组时使用该步长值来查找放置该值的位置。这很好地利用了内存,因为它没有指针的开销,并且在比线性探测高得多的负载系数下保持了合理的效率。
编辑:如果您想要改变现在正在做的事情,一种可能性是处理集群中的插入:保留一个排序数组和一个单独的新插入集合。当新插入的集合变得太大时,将这些项目合并到主集合中。
对于辅助收藏,您有多种选择。您可以只使用未排序的数组,并进行线性搜索 - 并限制其大小(例如)log(M),其中 M 是主数组的大小。在这种情况下,整体搜索仍为 O(log N),不会产生内存开销,并且大多数插入速度都相当快。当您将集合合并在一起时,您(通常)希望对辅助集合进行排序,然后与主集合合并。这使您可以根据辅助集合中的项目数量分摊线性合并。
或者,您可以使用树作为辅助集合。这意味着新插入的项目使用额外的指针存储空间,但(再次)保持较小的大小会限制开销。
I'd probably use a hash table with double hashing to resolve collisions. The general idea is to hash your original value, and if that collides do a second hash that gives a step value you'll use in walking through the array to find a place to put the value. This makes quite good use of memory as it has no overhead for pointers, and retains reasonable efficiency at much higher load factors than linear probing.
Edit: If you want a variation of what you're doing right now, one possibility is to handle insertions in clusters: keep a sorted array, and a separate collection of new insertions. When the collection of new insertions gets too large, merge those items into the main collection.
For the secondary collection you have a couple of choices. You can just use an un-sorted array, and do a linear search -- and just limit its size so (say) log(M), where M is the size of the main array. In this case, an overall search remains O(log N), imposes no memory overhead, and keeps most insertions quite fast. When you do merge the collections together, you (normally) want to sort the secondary collection, then merge with the primary. This lets you amortize the linear merge over the number of items in the secondary collection.
Alternatively, you can use a tree for your secondary collection. This means newly inserted items use extra storage for pointers, but (again) keeping that size small limits the overhead.