了解一致性哈希

发布于 2024-11-15 05:30:54 字数 2310 浏览 2 评论 0原文

在过去的几天里，我一直在研究 PHP 的一致哈希算法。我希望更好地了解一致性哈希的实际工作原理，以便我可以在一些即将进行的项目中使用它。在我看来，Flexihash确实是唯一易于理解的纯PHP实现，所以我从中做了一些笔记。

我创建了自己的一个小算法，试图了解它是如何工作的，以及如何让它尽可能快地工作。我对我的算法与 Flexihash 相比的速度感到惊讶，这让我想知道我的实现是否在某种程度上存在缺陷，或者也许我没有掌握整个概念的关键部分。

下面列出了 100 万个顺序键（0 到 1,000,000）的迭代中的速度差异。显示每个节点以显示实际散列到该特定节点的键数。

Flexihash:
 Time: 269 seconds
Speed: 3714 hashes/sec

 node1: 80729
 node2: 88390
 node3: 103623
 node4: 112338
 node5: 79893
 node6: 85377
 node7: 80966
 node8: 134462
 node9: 117046
node10: 117176

My implementation:
 Time: 3 seconds
Speed: 265589 hashes/sec

 node1: 100152
 node2: 100018
 node3: 99884
 node4: 99889
 node5: 100057
 node6: 100030
 node7: 99948
 node8: 99918
 node9: 100011
node10: 100093

这是我当前的哈希算法实现。

class Base_Hasher_Consistent
{
    protected $_nodes;

    public function __construct($nodes=NULL)
    {
        $this->_nodes = array();

        $node_count = count($nodes);
        $hashes_per_node = (int)(PHP_INT_MAX / $node_count);

        $old_hash_count = 0;

        foreach($nodes as $node){
            if(!($node == $nodes[$node_count-1])){
                $this->_nodes[] = array(
                    'name' => $node,
                    'begin' => $old_hash_count,
                    'end' => $old_hash_count + $hashes_per_node - 1
                );

                $old_hash_count += $hashes_per_node;
            }else{
                $this->_nodes[] = array(
                    'name' => $node,
                    'begin' => $old_hash_count,
                    'end' => PHP_INT_MAX
                );
            }
        }
    }

    public function hashToNode($data_key,$count=0)
    {
        $hash_code = (int)abs(crc32($data_key));

        foreach($this->_nodes as $node){
            if($hash_code >= $node['begin']){
                if($hash_code <= $node['end']){
                    return($node['name']);
                }
            }
        }
    }
}

我是否遗漏了什么，或者该算法真的只是比 Flexihash 快吗？另外，我确实了解 Flexihash 支持查找多个节点，所以我不确定这是否与此有关。

我只是想要一些保证，让我了解一致性哈希是如何工作的，或者可能链接到一些真正解释它的文章。

谢谢！

原文

I've been looking at consistent hashing algorithms for PHP over the past couple of days. I wish to gain a better understanding of how consistent hashing actually works, so I can utilize it on some upcoming projects. It appears to me that Flexihash is really the only pure-PHP implementation out there that was easy to understand, so I took some notes from it.

I have created a small algorithm of my own to attempt to understand how it works, and how to make it work as fast as possible. I was surprised at how fast my algorithm was compared to Flexihash, which makes me wonder if my implementation is flawed in some way, or perhaps I'm not grasping a crucial piece of the entire concept.

The speed differences are listed below, over an iteration of 1 million sequential keys (0 to 1,000,000). Each node is displayed to show how many keys actually hashed to that particular node.

Flexihash:
 Time: 269 seconds
Speed: 3714 hashes/sec

 node1: 80729
 node2: 88390
 node3: 103623
 node4: 112338
 node5: 79893
 node6: 85377
 node7: 80966
 node8: 134462
 node9: 117046
node10: 117176

My implementation:
 Time: 3 seconds
Speed: 265589 hashes/sec

 node1: 100152
 node2: 100018
 node3: 99884
 node4: 99889
 node5: 100057
 node6: 100030
 node7: 99948
 node8: 99918
 node9: 100011
node10: 100093

Here is my current implementation of the hashing algorithm.

class Base_Hasher_Consistent
{
    protected $_nodes;

    public function __construct($nodes=NULL)
    {
        $this->_nodes = array();

        $node_count = count($nodes);
        $hashes_per_node = (int)(PHP_INT_MAX / $node_count);

        $old_hash_count = 0;

        foreach($nodes as $node){
            if(!($node == $nodes[$node_count-1])){
                $this->_nodes[] = array(
                    'name' => $node,
                    'begin' => $old_hash_count,
                    'end' => $old_hash_count + $hashes_per_node - 1
                );

                $old_hash_count += $hashes_per_node;
            }else{
                $this->_nodes[] = array(
                    'name' => $node,
                    'begin' => $old_hash_count,
                    'end' => PHP_INT_MAX
                );
            }
        }
    }

    public function hashToNode($data_key,$count=0)
    {
        $hash_code = (int)abs(crc32($data_key));

        foreach($this->_nodes as $node){
            if($hash_code >= $node['begin']){
                if($hash_code <= $node['end']){
                    return($node['name']);
                }
            }
        }
    }
}

Am I missing something, or is the algorithm really just faster than Flexihash? Also, I do understand that Flexihash supports looking up multiple nodes, so I'm not sure if that has anything to do with it.

I would just like some reassurance that I understand how consistent hashing works, or maybe links to some articles that really explain it well.

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浅沫记忆 2024-11-22 05:30:54

那么您想知道 crc32 是如何工作的，还是只是想知道如何编写一个好的“bucket”实现？

您的存储桶实现看起来不错。如果您这样做，您可能会做得更快：

$bucket_index = floor($hash_code / $hashes_per_node);
return $this->_nodes[$bucket_index]['name'];

您编写的“算法”只是创建 $nodes 数量的存储桶，并计算出 $data_key 应该在这些存储桶中的哪些存储桶中去。您使用的实际哈希算法是 crc32，如果您正在处理存储桶，这可能不是理想的算法。

如果你想知道 crc32 是如何工作的以及为什么散列是一致的......请在维基百科或其他东西上查找。据我所知，不存在不一致的哈希这样的事情，因此所有哈希算法在定义上都是一致的。

哈希算法的一个特点是，它可以为相似的 data_key 生成非常不同的哈希值。对于 crc32 来说也是如此，因为 crc32 旨在检测微小的变化。但它不能保证生成的哈希值“良好传播”。由于您正在处理存储桶，因此您需要一种散列算法，该算法具有特定的属性，可以在整个范围内生成散列。对于 crc32 来说，它可能只是巧合地工作得很好。我就用md5。

So do you want to know how crc32 works or are you just wondering how to write a good 'bucket' implementation?

Your bucket implementation looks fine. You could probably do it faster if you just did:

$bucket_index = floor($hash_code / $hashes_per_node);
return $this->_nodes[$bucket_index]['name'];

The 'algorithm' your wrote just makes $nodes number of buckets and figures out in which of those buckets a $data_key should go. The actual hashing algorithm you're using is crc32, which may not be an ideal algorithm if you're doing buckets.

If you want to know how crc32 works and why the hash is consistent .. look it up on wikipedia or something. To my knowledge there is no such thing as an inconsistent hash, so all hashing algorithms are by definition consistent.

One characteristic of a hashing algorithm is that it can generate hashes that are very dissimilar for data_keys that are similar. This is true for crc32 since crc32 is intended to detect minor changes. What it does not guarantee though is that the resulting hashes are 'well spread'. Since you're doing buckets you want a hashing algorithm that has the specific property that it generates hashes all over the spectrum. For crc32 it may just coincidentally work really well. I'd just use md5.

回复收藏 0 原文