使用 unordered_set 防止不同哈希值的键落在同一个存储桶中

发布于 2024-09-29 15:31:46 字数 1889 浏览 10 评论 0原文

这可能是一个愚蠢的问题,但这里是:

我将字典哈希到基于 unordered_set 的哈希表中。我的哈希函数被故意设置为“坏”,因为包含同一组字母的所有字符串都会哈希为相同的值。我最初尝试覆盖正常的哈希函数行为,并使用每个单词中字母的“频率直方图”作为哈希值(我了解到这是不可能的:)),但其中一个线程建议使用 26-位位掩码实现相同。到目前为止,哈希函数运行得很好。

例如,在我的方案中,CITIED 和 CITED 散列到相同的值 1049144。我的想法是,给定一组字母,我想找到包含该组字母的所有单词。

我猜我还没有完全理解哈希的概念(或者我的代码完全错误),因为我无法完全解释我遇到的行为:
我决定查找由字符串“LIVEN”中的字母组成的所有单词。 我的输出(带有哈希键)如下:

VENVILLE,4215328  
LEVIN,4215328  
ENLIVEN,4215328  
CURTSEYED,37486648  

CURTSEYED 到底是如何降落在那里的?可以看出,它的哈希值与其余三个单词不同。我对哈希表的理解/实现的问题出在哪里?

产生上述输出的代码:


    typedef std::unordered_set< std::string, my_string_hash_function, my_string_equality> Dict    
    DictHash dict;       
    DictHash::const_local_iterator c_l_itr;

    DictHash::size_type bs = dict.bucket (std::string ("LIVEN"));
    for (c_l_itr = dict.begin(bs); c_l_itr != dict.end(bs); c_l_itr++)
         std::cout 

我的哈希函数:

struct my_string_hash_function  
{  
    std::size_t operator()(const std::string& s) const  
    {  
        unsigned long hash = 0;  
        std::string::const_iterator itr;

        for (itr = s.begin(); itr != s.end(); itr++)
     hash |= 2 << (*itr - int('A'));

      return hash;
    } 
};

比较函数:

struct my_string_equality
{
    bool operator()(const std::string& s1, const std::string& s2) const
    {
        if (s1.length() != s2.length())
     return false; 

        unsigned int hash1 = 0, hash2 = 0;
        const char *str1, *str2;
        int i,len;

        len = s1.length();
        str1 = s1.c_str();
        str2 = s2.c_str();

        for (i = 0; i < len; i++)
        {
            hash1 |= 2 << (str1[i] - (int)'A');
            hash2 |= 2 << (str2[i] - (int)'A');
        }

        return hash1 == hash2;
   }
};

This might be a silly question, but here goes :

I hashed a dictionary of words into an unordered_set based hash-table. My hash function was made intentionally "bad", in that all strings that contained the same set of letters would hash to the same value. I initially tried to over-ride the normal hash function behaviour, and use a "frequency histogram" of the letters in each word as a hash value (which I learnt was impossible :) ), but one of the threads suggested using a 26-bit bitmask to achieve the same. The hash function works fine and dandy thus far.

For example, in my scheme, CITIED and CITED hash to the same value, 1049144. My idea was that given a set of letters, I wanted to find all the words containing letters from that set.

I am guessing that I haven't quite understood the concept of hashing (or my code is plain wrong), as I can't quite explain the behaviour I encountered :
I decided to look for all words which consisted of letters from the string "LIVEN".
My output (with hash key) was as follows :

VENVILLE,4215328  
LEVIN,4215328  
ENLIVEN,4215328  
CURTSEYED,37486648  

How on earth did CURTSEYED land up there? As can be seen, it has a different hash value from the remaining three words. Where does the fault lie with my understanding/implementation of the hash table?

Code that produced above output:


    typedef std::unordered_set< std::string, my_string_hash_function, my_string_equality> Dict    
    DictHash dict;       
    DictHash::const_local_iterator c_l_itr;

    DictHash::size_type bs = dict.bucket (std::string ("LIVEN"));
    for (c_l_itr = dict.begin(bs); c_l_itr != dict.end(bs); c_l_itr++)
         std::cout 

My hash function :

struct my_string_hash_function  
{  
    std::size_t operator()(const std::string& s) const  
    {  
        unsigned long hash = 0;  
        std::string::const_iterator itr;

        for (itr = s.begin(); itr != s.end(); itr++)
     hash |= 2 << (*itr - int('A'));

      return hash;
    } 
};

Comparison function :

struct my_string_equality
{
    bool operator()(const std::string& s1, const std::string& s2) const
    {
        if (s1.length() != s2.length())
     return false; 

        unsigned int hash1 = 0, hash2 = 0;
        const char *str1, *str2;
        int i,len;

        len = s1.length();
        str1 = s1.c_str();
        str2 = s2.c_str();

        for (i = 0; i < len; i++)
        {
            hash1 |= 2 << (str1[i] - (int)'A');
            hash2 |= 2 << (str2[i] - (int)'A');
        }

        return hash1 == hash2;
   }
};

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

じ违心 2024-10-06 15:31:46

不同的哈希值不一定会出现在不同的桶中。通常,哈希表会根据hash_value % number_of_buckets来选择存储桶,因此与存储桶数量模数相等的哈希值将最终出现在同一个存储桶中。

本质上,您无法保证哪个哈希值出现在哪个存储桶中。

Different hash values will not necessarily end up in different buckets. Generally a hash table will choose a bucket based on hash_value % number_of_buckets, so hashes that are equal modulo the number of buckets will wind up in the same bucket.

Essentially, you cannot guarantee anything about which hash value appears in which bucket.

苏辞 2024-10-06 15:31:46

我认为您在 my_string_equality 中也存在潜在的错误...难道您不想使用常规的 std::string::operator==()? AFAIK 你应该比较实际的对象值,而不是比较它们的哈希值(容器已经知道哈希值,它可以只调用 my_string_hash_function 并比较结果,如果这是它需要的)做)。

I think you've also got a potential bug in the my_string_equality... don't you just want to use the regular std::string::operator==()? AFAIK you should be doing a comparison of the actual object values, not a comparison of their hash (the container already knows the hash value, it could just call my_string_hash_function and compare the results if that was what it needed to do).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文