如果一致性哈希很有效，为什么人们不到处使用它呢？

发布于 2024-11-17 09:34:16 字数 181 浏览 12 评论 0原文

有人问我一致性哈希的一些缺点。但我认为它只是比传统的 hash%N 哈希成本高一点。正如标题所提到的，如果一致性哈希非常好，我们为什么不直接使用它呢？

你知道更多吗？谁能告诉我一些？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

山有枢 2024-11-24 09:34:16

实现一致性哈希并不是一件简单的事，在许多情况下，您的哈希表很少或根本不需要重新映射或者可以相当快地重新映射。

回复收藏 0 原文

所谓喜欢 2024-11-24 09:34:16

我知道一致性哈希的唯一重大缺点是实现它比简单哈希更复杂。更多代码意味着更多地方会引入错误，但现在有免费可用的选项。

从技术上讲，一致性哈希会消耗更多的 CPU；查阅排序列表来确定将对象映射到哪个服务器是一个 O(log n) 操作，其中 n 是服务器数量 X 每个服务器的槽数，而简单散列是 O(1)。

但实际上，O(log n) 太快了，这并不重要。（例如，8台服务器 X 每台服务器 1024 个插槽 = 8192 个项目，在最坏的情况下 log2(8192) = 最多 13 次比较。）原作者对其进行了测试，发现使用一致性哈希计算缓存服务器仅花费了 20 微秒设置。同样，一致性哈希会消耗空间来存储服务器槽的排序列表，而简单哈希不需要空间，但所需的量很小，大约为 Kb。

为什么它不为人所熟知呢？如果我不得不猜测，我会说这只是因为学术思想传播到工业界需要时间。（原始论文写于 1997 年。）

回复收藏 0 原文

﹏雨一样淡蓝的深情 2024-11-24 09:34:16

我假设你正在专门讨论哈希表，因为你提到了 mod N。如果我的假设有误，请纠正我，因为哈希用于各种不同的事物。

原因是一致性哈希并没有真正解决哈希表迫切需要解决的问题。在重新哈希时，哈希表可能需要重新分配其元素的很大一部分，无论如何，可能是其中的大多数。这是因为我们可能会重新散列以增加表的大小，这通常是以二次方的方式完成的；例如，一旦表开始变得太满，将节点数量加倍是非常典型的。

因此，用一致的哈希术语来说，我们不仅仅是添加一个节点；而是添加一个节点。我们将节点数量加倍。这意味着，无论如何，最好的情况是，我们要移动一半的元素。当然，一致的哈希技术可以减少移动，并尝试接近这一理想状态，但最好的情况改进只是 2 倍的常数因子，这不会改变我们的整体复杂性。

从另一端来看，在大多数应用程序中，哈希表都与缓存性能有关。让它们运行得更快的所有兴趣在于尽可能快地计算东西，占用尽可能少的内存。无论您如何看待这一点，添加一致性哈希可能会导致速度减慢 2 倍以上；最终，一致性哈希会变得更糟。

最后，从另一个角度来看，整个问题有点不重要。我们希望重新散列速度很快，但更重要的是我们根本不重新散列。在任何正常的实际场景中，当程序员发现由于重新哈希而遇到问题时，正确的答案几乎总是通过选择适当的大小来找到避免（或至少限制）重新哈希的方法。考虑到这是典型的场景，为一些不应该发生的事情维持相当大的侧面结构显然不是一个胜利，而且，这又会让我们整体变慢。

哈希表上的几乎所有优化工作要么是如何更快地计算哈希，要么是如何更快地执行冲突解决。这些事情发生的时间尺度比我们谈论的一致性哈希要小得多，一致性哈希通常用于我们谈论以微秒甚至毫秒为单位的时间尺度，因为我们必须执行 I/O 操作。

I assume you're talking about hash tables specifically, since you mention mod N. Please correct me if I'm wrong in that assumption, as hashes are used for all sorts of different things.

The reason is that consistent hashing doesn't really solve a problem that hash tables pressingly need to solve. On a rehash, a hash table probably needs to reassign a very large fraction of its elements no matter what, possibly a majority of them. This is because we're probably rehashing to increase the size of our table, which is usually done quadratically; it's very typical, for instance, to double the amount of nodes, once the table starts to get too full.

So in consistent hashing terms, we're not just adding a node; we're doubling the amount of nodes. That means, one way or another, best case, we're moving half of the elements. Sure, a consistent hashing technique could cut down on the moves, and try to approach this ideal, but the best case improvement is only a constant factor of 2x, which doesn't change our overall complexity.

Approaching from the other end, hash tables are all about cache performance, in most applications. All interest in making them go fast is on computing stuff as quickly as possible, touching as little memory as possible. Adding consistent hashing is probably going to be more than a 2x slowdown, no matter how you look at this; ultimately, consistent hashing is going to be worse.

Finally, this entire issue is sort of unimportant from another angle. We want rehashing to be fast, but it's much more important that we don't rehash at all. In any normal practical scenario, when a programmer sees he's having a problem due to rehashing, the correct answer is nearly always to find a way to avoid (or at least limit) the rehashing, by choosing an appropriate size to begin with. Given that this is the typical scenario, maintaining a fairly substantial side-structure for something that shouldn't even be happening is obviously not a win, and again, makes us overall slower.

Nearly all of the optimization effort on hash tables is either in how to calculate the hash faster, or how to perform collision resolution faster. These are things that happen on a much smaller time scale than we're talking about for consistent hashing, which is usually used where we're talking about time scales measured in microseconds or even milliseconds because we have to do I/O operations.

回复收藏 0 原文