设计hash_table时,要注意几个方面?
我有一些候选方面:
- 哈希函数很重要,哈希码应该尽可能唯一。
- 后端数据结构很重要,搜索、插入和删除操作的时间复杂度都应该是O(1)。
- 内存管理很重要,每个hash_table条目的内存开销应该尽可能少。当 hash_table 扩展时,内存应该有效地增加,而当 hash_table 收缩时,内存应该有效地进行垃圾回收。而有了这些内存操作,aspect 2也应该被填满了。
- 如果hash_table将在多线程中使用,它应该是线程安全的并且也是高效的。
我的问题是:
- 还有哪些方面值得关注?
- 如何设计hash_table来充分满足这些方面呢?
- 有什么资源我可以参考吗?
非常感谢!
阅读一些材料后,更新我的问题。 :)
In a book explaining the source code of SGI STL, I found some useful informations:
- 后端数据结构是链表的桶。当在 hash_table 中查找、插入或删除元素时:
- 使用哈希函数计算桶中对应的位置,并将元素存储在此位置之后的链接列表。
- 当元素的大小大于桶的大小时,桶需要调整大小:将尺寸扩大到旧尺寸的2倍。存储桶的大小应该是prime。然后将旧存储桶和元素复制到新存储桶和元素。
- 我没有找到元素数量远小于桶数量时垃圾回收的逻辑>。但我认为,当首先进行多次插入,然后进行多次删除时,应该考虑这种逻辑。
- 其他数据结构例如具有线性检测或方形检测的数组不如链表。
- 一个好的哈希函数可以避免簇,而双哈希可以帮助解决簇。
关于multi_threads的问题仍然悬而未决。 :D
I have some candidate aspects:
- The hash function is important, the hashcode should be unique as far as possible.
- The backend data structure is important, the search, insert and delete operations should all have time complexity O(1).
- The memory management is important, the memory overhead of every hash_table entry should be as least as possible. When the hash_table is expanding, the memory should increase efficiently, and when the hash_table is shrinking, the memory should do garbage collection efficiently. And with these memory operations, the aspect 2 should also be full filled.
- If the hash_table will be used in multi_threads, it should be thread safe and also be efficient.
My questions are:
- Are there any more aspects worth attention?
- How to design the hash_table to full fill these aspects?
- Are there any resources I can refer to?
Many thanks!
After reading some material, update my questions. :)
In a book explaining the source code of SGI STL, I found some useful informations:
- The backend data structure is a bucket of linked list. When search, insert or delete an element in the hash_table:
- Use a hash function to calculate the corresponding position in the bucket, and the elements are stored in the linked list after this position.
- When the size of elements is larger than the size of buckets, the buckets need resize: expand the size to be 2 times larger than the old size. The size of buckets should be prime. Then copy the old buckets and elements to the new one.
- I didn't find the logic of garbage collection when the number of elements is much smaller than the number of buckets. But I think this logic should be considerated when many inserts at first then many deletes later.
- Other data structures such as arrays with linear detection or square detection is not as good as linked list.
- A good hash function can avoid clusters, and double hash can help to resolve clusters.
The question about multi_threads is still open. :D
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
有两个(稍微)正交的问题。
虽然哈希函数显然很重要,但一般来说,您将后端的设计与哈希函数的设计分开:
对于哈希函数,我会建议阅读 CityHash 或 MurmurHash (带有 对 SO 的解释)。
正如您所指出的,对于后端,存在各种问题。一些评论:
您从未谈论过您将要使用的数据量。写入者可以更新不同的存储桶而不会相互干扰,因此如果您有大量数据,您可以尝试将它们分散以避免争用。
参考文献:
There are two (slightly) orthogonal concern.
While the hash function is obviously important, in general you separate the design of the backend from the design of the hash function:
For hash functions, I would suggest reading about CityHash or MurmurHash (with an explanation on SO).
For the back-end, there are various concerns, as you noted. Some remarks:
You never talked about the amount of data you're about to use. Writers can update different buckets without interfering with one another, so if you have a lot of data, you can try to spread them around to avoid contention.
References:
我建议您阅读 http://www.azulsystems.com /blog/cliff/2007-03-26-non-blocking-hashtable
该链接指向 Cliff Click 的博客,其中有一个关于哈希函数的条目。他的一些结论是:
I suggest you read http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable
The link points to a blog by Cliff Click which has an entry on hash functions. Some of his conclusions are: