设计hash_table时,要注意几个方面?

发布于 2024-11-05 20:31:09 字数 1153 浏览 0 评论 0原文

我有一些候选方面:

  1. 哈希函数很重要,哈希码应该尽可能唯一。
  2. 后端数据结构很重要,搜索、插入和删除操作的时间复杂度都应该是O(1)。
  3. 内存管理很重要,每个hash_table条目的内存开销应该尽可能少。当 hash_table 扩展时,内存应该有效地增加,而当 hash_table 收缩时,内存应该有效地进行垃圾回收。而有了这些内存操作,aspect 2也应该被填满了。
  4. 如果hash_table将在多线程中使用,它应该是线程安全的并且也是高效的。

我的问题是:

  1. 还有哪些方面值得关注?
  2. 如何设计hash_table来充分满足这些方面呢?
  3. 有什么资源我可以参考吗?

非常感谢!



阅读一些材料后,更新我的问题。 :)


In a book explaining the source code of SGI STL, I found some useful informations:

  1. 后端数据结构是链表。当在 hash_table 中查找、插入或删除元素时:
    1. 使用哈希函数计算中对应的位置,并将元素存储在此位置之后的链接列表
    2. 元素的大小大于的大小时,需要调整大小:将尺寸扩大到旧尺寸的2倍。存储桶的大小应该是prime。然后将旧存储桶和元素复制到新存储桶和元素。
    3. 我没有找到元素数量远小于桶数量垃圾回收的逻辑>。但我认为,当首先进行多次插入,然后进行多次删除时,应该考虑这种逻辑。
  2. 其他数据结构例如具有线性检测方形检测的数组不如链表
  3. 一个好的哈希函数可以避免,而双哈希可以帮助解决

关于multi_threads的问题仍然悬而未决。 :D


I have some candidate aspects:

  1. The hash function is important, the hashcode should be unique as far as possible.
  2. The backend data structure is important, the search, insert and delete operations should all have time complexity O(1).
  3. The memory management is important, the memory overhead of every hash_table entry should be as least as possible. When the hash_table is expanding, the memory should increase efficiently, and when the hash_table is shrinking, the memory should do garbage collection efficiently. And with these memory operations, the aspect 2 should also be full filled.
  4. If the hash_table will be used in multi_threads, it should be thread safe and also be efficient.

My questions are:

  1. Are there any more aspects worth attention?
  2. How to design the hash_table to full fill these aspects?
  3. Are there any resources I can refer to?

Many thanks!

After reading some material, update my questions. :)


In a book explaining the source code of SGI STL, I found some useful informations:

  1. The backend data structure is a bucket of linked list. When search, insert or delete an element in the hash_table:
    1. Use a hash function to calculate the corresponding position in the bucket, and the elements are stored in the linked list after this position.
    2. When the size of elements is larger than the size of buckets, the buckets need resize: expand the size to be 2 times larger than the old size. The size of buckets should be prime. Then copy the old buckets and elements to the new one.
    3. I didn't find the logic of garbage collection when the number of elements is much smaller than the number of buckets. But I think this logic should be considerated when many inserts at first then many deletes later.
  2. Other data structures such as arrays with linear detection or square detection is not as good as linked list.
  3. A good hash function can avoid clusters, and double hash can help to resolve clusters.

The question about multi_threads is still open. :D


如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

掩于岁月 2024-11-12 20:31:09

有两个(稍微)正交的问题。

虽然哈希函数显然很重要,但一般来说,您将后端的设计与哈希函数的设计分开:

  • 哈希函数取决于要存储的数据
  • ,后端取决于存储的要求

对于哈希函数,我会建议阅读 CityHashMurmurHash (带有 对 SO 的解释)。

正如您所指出的,对于后端,存在各种问题。一些评论:

  • 我们谈论的是平均复杂度还是最坏情况复杂度?据我所知,如果没有完美的散列,实现 O(1) 几乎是不可能的,尽管最坏情况的频率和复杂性可以大大降低。
  • 我们谈论的是摊余复杂度吗?一般来说,摊销的复杂性可以以“尖峰”为代价提供更好的吞吐量。线性重新哈希以吞吐量稍​​低为代价,将为您提供更平滑的曲线。
  • 关于多线程,请注意读/写模式可能会影响解决方案,考虑到极端情况,1 个生产者和 99 个读者与 99 个生产者和 1 个读者有很大不同。一般来说,写入很难并行化,因为它们可能需要修改结构。最坏的情况是,它们可能需要序列化。
  • 在摊销的情况下,垃圾收集相当微不足道,而线性重新哈希则稍微复杂一些,但可能是最不具有挑战性的部分。

您从未谈论过您将要使用的数据量。写入者可以更新不同的存储桶而不会相互干扰,因此如果您有大量数据,您可以尝试将它们分散以避免争用。

参考文献:

  • Wikipedia 上的文章公开了许多不同的实现,总是很高兴了解各种
  • 实现GoogleTalk 展示了一个用 Java 语言设计的、专为高度多线程系统设计的哈希表。

There are two (slightly) orthogonal concern.

While the hash function is obviously important, in general you separate the design of the backend from the design of the hash function:

  • the hash function depends on the data to be stored
  • the backend depends on the requirements of the storage

For hash functions, I would suggest reading about CityHash or MurmurHash (with an explanation on SO).

For the back-end, there are various concerns, as you noted. Some remarks:

  • Are we talking average or worst case complexity ? Without perfect hashing, achieving O(1) is nigh-impossible as far as I know, though the worst case frequency and complexity can be considerably dampened.
  • Are we talking amortized complexity ? Amortized complexity in general offer better throughput at the cost of "spikes". Linear rehashing, at the cost of a slightly lower throughput, will give you a smoother curve.
  • With regard to multi-threading, note that the Read/Write pattern may impact the solution, considering extreme cases, 1 producer and 99 readers is very different from 99 producers and 1 reader. In general writes are harder to parallelize, because they may require modifying the structure. At worst, they might require serialization.
  • Garbage Collection is pretty trivial in the amortized case, with linear-rehashing it's a bit more complicated, but probably the least challenging portion.

You never talked about the amount of data you're about to use. Writers can update different buckets without interfering with one another, so if you have a lot of data, you can try to spread them around to avoid contention.

References:

  • The article on Wikipedia exposes lots of various implementations, always good to peek at the variety
  • This GoogleTalk from Dr Cliff (Azul Systems) shows a hash table designed for heavily multi-threaded systems, in Java.
哎呦我呸! 2024-11-12 20:31:09

我建议您阅读 http://www.azulsystems.com /blog/cliff/2007-03-26-non-blocking-hashtable

该链接指向 Cliff Click 的博客,其中有一个关于哈希函数的条目。他的一些结论是:

  • 要从散列到索引,请使用二进制 AND 而不是对质数取模。这要快很多倍。您的桌子大小必须是 2 的幂。
  • 对于哈希冲突,不要使用链表,将值存储在表中以提高缓存性能。
  • 通过使用状态机,您可以获得非常快速的多线程实现。在他的博客文章中,他列出了状态机中的状态,但由于许可证问题,他没有提供源代码。

I suggest you read http://www.azulsystems.com/blog/cliff/2007-03-26-non-blocking-hashtable

The link points to a blog by Cliff Click which has an entry on hash functions. Some of his conclusions are:

  • To go from hash to index, use binary AND instead of modulo a prime. This is many times faster. Your table size must be a power of two.
  • For hash collisions don't use a linked list, store the values in the table to improve cache performance.
  • By using a state machine you can get a very fast multi-thread implementation. In his blog entry he lists the states in the state machine, but due to license problems he does not provide source code.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文