如果一致性哈希很有效,为什么人们不到处使用它呢?
有人问我一致性哈希的一些缺点。但我认为它只是比传统的 hash%N 哈希成本高一点。正如标题所提到的,如果一致性哈希非常好,我们为什么不直接使用它呢?
你知道更多吗?谁能告诉我一些?
I was asked some shortcommings of consistent hash. But I think it just costs a little more than a traditional hash%N hash. As the title mentioned, if consistent hash is very good, why not we just use it?
Do you know more? Who can tell me some?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
实现一致性哈希并不是一件简单的事,在许多情况下,您的哈希表很少或根本不需要重新映射或者可以相当快地重新映射。
Implementing consistent hashing is not trivial and in many cases you have a hash table that rarely or never needs remapping or which can remap rather fast.
我知道一致性哈希的唯一重大缺点是实现它比简单哈希更复杂。更多代码意味着更多地方会引入错误,但现在有免费可用的选项。
从技术上讲,一致性哈希会消耗更多的 CPU;查阅排序列表来确定将对象映射到哪个服务器是一个 O(log n) 操作,其中 n 是服务器数量 X 每个服务器的槽数,而简单散列是 O(1)。
但实际上,O(log n) 太快了,这并不重要。 (例如,8台服务器 X 每台服务器 1024 个插槽 = 8192 个项目,在最坏的情况下 log2(8192) = 最多 13 次比较。)原作者对其进行了测试,发现使用一致性哈希计算缓存服务器仅花费了 20 微秒设置。同样,一致性哈希会消耗空间来存储服务器槽的排序列表,而简单哈希不需要空间,但所需的量很小,大约为 Kb。
为什么它不为人所熟知呢?如果我不得不猜测,我会说这只是因为学术思想传播到工业界需要时间。 (原始论文写于 1997 年。)
The only substantial shortcoming of consistent hashing I'm aware of is that implementing it is more complicated than simple hashing. More code means more places to introduce a bug, but there are freely available options out there now.
Technically, consistent hashing consumes a bit more CPU; consulting a sorted list to determine which server to map an object to is an O(log n) operation, where n is the number of servers X the number of slots per server, while simple hashing is O(1).
In practice, though, O(log n) is so fast it doesn't matter. (E.g., 8 servers X 1024 slots per server = 8192 items, log2(8192) = 13 comparisons at most in the worst case.) The original authors tested it and found that computing the cache server using consistent hashing took only 20 microseconds in their setup. Likewise, consistent hashing consumes space to store the sorted list of server slots, while simple hashing takes no space, but the amount required is minuscule, on the order of Kb.
Why is it not better known? If I had to guess, I would say it's only because it can take time for academic ideas to propagate out into industry. (The original paper was written in 1997.)
我假设你正在专门讨论哈希表,因为你提到了 mod N。如果我的假设有误,请纠正我,因为哈希用于各种不同的事物。
原因是一致性哈希并没有真正解决哈希表迫切需要解决的问题。在重新哈希时,哈希表可能需要重新分配其元素的很大一部分,无论如何,可能是其中的大多数。这是因为我们可能会重新散列以增加表的大小,这通常是以二次方的方式完成的;例如,一旦表开始变得太满,将节点数量加倍是非常典型的。
因此,用一致的哈希术语来说,我们不仅仅是添加一个节点;而是添加一个节点。我们将节点数量加倍。这意味着,无论如何,最好的情况是,我们要移动一半的元素。当然,一致的哈希技术可以减少移动,并尝试接近这一理想状态,但最好的情况改进只是 2 倍的常数因子,这不会改变我们的整体复杂性。
从另一端来看,在大多数应用程序中,哈希表都与缓存性能有关。让它们运行得更快的所有兴趣在于尽可能快地计算东西,占用尽可能少的内存。无论您如何看待这一点,添加一致性哈希可能会导致速度减慢 2 倍以上;最终,一致性哈希会变得更糟。
最后,从另一个角度来看,整个问题有点不重要。我们希望重新散列速度很快,但更重要的是我们根本不重新散列。在任何正常的实际场景中,当程序员发现由于重新哈希而遇到问题时,正确的答案几乎总是通过选择适当的大小来找到避免(或至少限制)重新哈希的方法。考虑到这是典型的场景,为一些不应该发生的事情维持相当大的侧面结构显然不是一个胜利,而且,这又会让我们整体变慢。
哈希表上的几乎所有优化工作要么是如何更快地计算哈希,要么是如何更快地执行冲突解决。这些事情发生的时间尺度比我们谈论的一致性哈希要小得多,一致性哈希通常用于我们谈论以微秒甚至毫秒为单位的时间尺度,因为我们必须执行 I/O 操作。
I assume you're talking about hash tables specifically, since you mention mod N. Please correct me if I'm wrong in that assumption, as hashes are used for all sorts of different things.
The reason is that consistent hashing doesn't really solve a problem that hash tables pressingly need to solve. On a rehash, a hash table probably needs to reassign a very large fraction of its elements no matter what, possibly a majority of them. This is because we're probably rehashing to increase the size of our table, which is usually done quadratically; it's very typical, for instance, to double the amount of nodes, once the table starts to get too full.
So in consistent hashing terms, we're not just adding a node; we're doubling the amount of nodes. That means, one way or another, best case, we're moving half of the elements. Sure, a consistent hashing technique could cut down on the moves, and try to approach this ideal, but the best case improvement is only a constant factor of 2x, which doesn't change our overall complexity.
Approaching from the other end, hash tables are all about cache performance, in most applications. All interest in making them go fast is on computing stuff as quickly as possible, touching as little memory as possible. Adding consistent hashing is probably going to be more than a 2x slowdown, no matter how you look at this; ultimately, consistent hashing is going to be worse.
Finally, this entire issue is sort of unimportant from another angle. We want rehashing to be fast, but it's much more important that we don't rehash at all. In any normal practical scenario, when a programmer sees he's having a problem due to rehashing, the correct answer is nearly always to find a way to avoid (or at least limit) the rehashing, by choosing an appropriate size to begin with. Given that this is the typical scenario, maintaining a fairly substantial side-structure for something that shouldn't even be happening is obviously not a win, and again, makes us overall slower.
Nearly all of the optimization effort on hash tables is either in how to calculate the hash faster, or how to perform collision resolution faster. These are things that happen on a much smaller time scale than we're talking about for consistent hashing, which is usually used where we're talking about time scales measured in microseconds or even milliseconds because we have to do I/O operations.
我还要加上我的5美分。
I will also add my 5 cents.
原因是因为一致性哈希往往会导致范围扫描查询的读取端承担更多工作。
例如,如果您想搜索按特定列排序的条目,那么您需要将查询发送到每个节点,因为一致的散列甚至会将“相邻”项目放置在单独的节点中。
通常更愿意使用与使用模式相匹配的分区。更好的是在许多不同的分区/格式中复制相同的数据
The reason is because Consistent Hashing tends to cause more work on the Read side for range scan queries.
For example, if you want to search for entries that are sorted by a particular column then you'd need to send the query to EVERY node because consistent hashing will place even "adjacent" items in separate nodes.
It's often preferred to instead use a partitioning that is going to match the usage patterns. Better yet replicate the same data in a host of different partitions/formats