与 std::hash 发生意外冲突

发布于 2024-12-13 06:32:42 字数 397 浏览 0 评论 0原文

我知道将无限数量的字符串散列到 32b int 中一定会产生冲突，但我期望散列函数能得到一些不错的分布。

这两个字符串具有相同的哈希值，这不是很奇怪吗？

size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
//hash0 == hash1

我知道我可以使用 boost::hash 或其他，但我想知道 std::hash 有什么问题。难道是我用错了？我不应该以某种方式“播种”它吗？

原文

I know hashing infinite number of string into 32b int must generate collision, but I expect from hashing function some nice distribution.

Isn't it weird that these 2 strings have the same hash?

size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
//hash0 == hash1

I know I can use boost::hash<std::string> or others, but I want to know what is wrong with std::hash. Am I using it wrong? Shouldn't I somehow "seed" it?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜还是长夜 2024-12-20 06:32:42

您使用 std::hash 没有任何问题。问题在于，与 Visual Studio 2010 捆绑在一起的标准库实现提供的专门化 std::hash 仅采用字符串字符的子集来确定哈希值（大概是为了性能原因）。巧合的是，14 个字符的字符串的最后一个字符不属于该集合，这就是两个字符串产生相同哈希值的原因。

据我所知，这种行为符合标准，该标准仅要求使用相同参数多次调用哈希函数必须始终返回相同的值。然而，哈希冲突的概率应该很小。 VS2010 实现满足了强制部分，但未能考虑可选部分。

有关详细信息，请参阅头文件 xfunction 中的实现（从我的副本中的第 869 行开始）和 C++ 标准的 §17.6.3.4 (最新公开草案）。

如果您绝对需要一个更好的字符串哈希函数，您应该自己实现它。实际上没那么难。

回复收藏 0 原文

灯角 2024-12-20 06:32:42

标准没有指定确切的哈希算法，因此结果
会有所不同。 VC10使用的算法似乎并没有采取所有的
如果字符串长度超过 10 个字符，则考虑字符；它
以 1 + s.size() / 10 的增量前进。这是合法的，
尽管从 QoI 的角度来看，相当令人失望；这样的哈希码
众所周知，对于某些典型的数据集（例如
网址）。我强烈建议您将其替换为 FNV 哈希或
一个基于梅森素数：

FNV 哈希：

struct hash
{
    size_t operator()( std::string const& s ) const
    {
        size_t result = 2166136261U ;
        std::string::const_iterator end = s.end() ;
        for ( std::string::const_iterator iter = s.begin() ;
              iter != end ;
              ++ iter ) {
            result = (16777619 * result)
                    ^ static_cast< unsigned char >( *iter ) ;
        }
        return result ;
    }
};

梅森素数哈希：

struct hash
{
    size_t operator()( std::string const& s ) const
    {
        size_t result = 2166136261U ;
        std::string::const_iterator end = s.end() ;
        for ( std::string::const_iterator iter = s.begin() ;
              iter != end ;
              ++ iter ) {
            result = 127 * result
                   + static_cast< unsigned char >( *iter ) ;
        }
        return result ;
    }
};

（FNV 哈希据说更好，但梅森素数哈希将是
在很多机器上更快，因为乘以 127 通常是
明显快于乘以 16777619。）

The exact hash algorithm isn't specified by the standard, so the results
will vary. The algorithm used by VC10 doesn't seem to take all of the
characters into account if the string is longer than 10 characters; it
advances with an increment of 1 + s.size() / 10. This is legal,
albeit from a QoI point of view, rather disappointing; such hash codes
are known to perform very poorly for some typical sets of data (like
URLs). I'd strongly suggest you replace it with either a FNV hash or
one based on a Mersenne prime:

FNV hash:

struct hash
{
    size_t operator()( std::string const& s ) const
    {
        size_t result = 2166136261U ;
        std::string::const_iterator end = s.end() ;
        for ( std::string::const_iterator iter = s.begin() ;
              iter != end ;
              ++ iter ) {
            result = (16777619 * result)
                    ^ static_cast< unsigned char >( *iter ) ;
        }
        return result ;
    }
};

Mersenne prime hash:

struct hash
{
    size_t operator()( std::string const& s ) const
    {
        size_t result = 2166136261U ;
        std::string::const_iterator end = s.end() ;
        for ( std::string::const_iterator iter = s.begin() ;
              iter != end ;
              ++ iter ) {
            result = 127 * result
                   + static_cast< unsigned char >( *iter ) ;
        }
        return result ;
    }
};

(The FNV hash is supposedly better, but the Mersenne prime hash will be
faster on a lot of machines, because multiplying by 127 is often
significantly faster than multiplying by 16777619.)

回复收藏 0 原文

司马昭之心 2024-12-20 06:32:42

您可能会得到不同的哈希值。我得到不同的哈希值（GCC 4.5）：

hashtest.cpp

#include <string>
#include <iostream>
#include <functional>
int main(int argc, char** argv)
{
size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
std::cout << hash0 << (hash0 == hash1 ? " == " : " != ") << hash1 << "\n";
return 0;
}

输出

# g++ hashtest.cpp -o hashtest -std=gnu++0x
# ./hashtest
16797002355621538189 != 16797001256109909978

You should likely get different hash values. I get different hash values (GCC 4.5):

hashtest.cpp

#include <string>
#include <iostream>
#include <functional>
int main(int argc, char** argv)
{
size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
std::cout << hash0 << (hash0 == hash1 ? " == " : " != ") << hash1 << "\n";
return 0;
}

Output

# g++ hashtest.cpp -o hashtest -std=gnu++0x
# ./hashtest
16797002355621538189 != 16797001256109909978

回复收藏 0 原文

给妤﹃绝世温柔 2024-12-20 06:32:42

你不播种哈希函数，你最多只能给“它们”加盐。

如果以正确的方式使用该功能，这种碰撞可能只是偶然的。

除非您使用随机密钥执行大量测试，否则您无法判断哈希函数是否分布不均匀。

回复收藏 0 原文

蓝礼 2024-12-20 06:32:42

TR1 哈希函数和最新标准为字符串等内容定义了适当的重载。当我使用 std::tr1::hash (g++ 4.1.2) 运行此代码时，我得到这两个字符串的不同哈希值。

回复收藏 0 原文

~没有更多了~

关于作者

不喜欢何必死缠烂打

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

与 std::hash 发生意外冲突

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

hashtest.cpp

输出

hashtest.cpp

Output

关于作者

相关话题

热门标签

推荐作者

小瓶盖

wxsp_Ukbq8xGR

1638627670

仅一夜美梦

夜访吸血鬼

近卫軍团

友情链接

与 std::hash 发生意外冲突

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

hashtest.cpp

输出

hashtest.cpp

Output

关于作者

相关话题

热门标签

推荐作者

小瓶盖

wxsp_Ukbq8xGR

1638627670

仅一夜美梦

夜访吸血鬼

近卫軍团

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。