与 std::hash 发生意外冲突
我知道将无限数量的字符串散列到 32b int 中一定会产生冲突,但我期望散列函数能得到一些不错的分布。
这两个字符串具有相同的哈希值,这不是很奇怪吗?
size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
//hash0 == hash1
我知道我可以使用 boost::hash
或其他,但我想知道 std::hash
有什么问题。难道是我用错了?我不应该以某种方式“播种”它吗?
I know hashing infinite number of string into 32b int must generate collision, but I expect from hashing function some nice distribution.
Isn't it weird that these 2 strings have the same hash?
size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
//hash0 == hash1
I know I can use boost::hash<std::string>
or others, but I want to know what is wrong with std::hash
. Am I using it wrong? Shouldn't I somehow "seed" it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您使用
std::hash
没有任何问题。问题在于,与 Visual Studio 2010 捆绑在一起的标准库实现提供的专门化std::hash
仅采用字符串字符的子集来确定哈希值(大概是为了性能原因)。巧合的是,14 个字符的字符串的最后一个字符不属于该集合,这就是两个字符串产生相同哈希值的原因。据我所知,这种行为符合标准,该标准仅要求使用相同参数多次调用哈希函数必须始终返回相同的值。然而,哈希冲突的概率应该很小。 VS2010 实现满足了强制部分,但未能考虑可选部分。
有关详细信息,请参阅头文件
xfunction
中的实现(从我的副本中的第 869 行开始)和 C++ 标准的 §17.6.3.4 (最新公开草案)。如果您绝对需要一个更好的字符串哈希函数,您应该自己实现它。实际上没那么难。
There's nothing wrong with your usage of
std::hash
. The problem is that the specializationstd::hash<std::string>
provided by the standard library implementation bundled with Visual Studio 2010 only takes a subset of the string's characters to determine the hash value (presumably for performance reasons). Coincidentally the last character of a string with 14 characters is not part of this set, which is why both strings yield the same hash value.As far as I know this behaviour is in conformance with the standard, which demands only that multiple calls to the hash function with the same argument must always return the same value. However, the probability of a hash collision should be minimal. The VS2010 implementation fulfills the mandatory part, yet fails to account for the optional one.
For details, see the implementation in the header file
xfunctional
(starting at line 869 in my copy) and §17.6.3.4 of the C++ standard (latest public draft).If you absolutely need a better hash function for strings, you should implement it yourself. It's actually not that hard.
标准没有指定确切的哈希算法,因此结果
会有所不同。 VC10使用的算法似乎并没有采取所有的
如果字符串长度超过 10 个字符,则考虑字符;它
以
1 + s.size() / 10
的增量前进。这是合法的,尽管从 QoI 的角度来看,相当令人失望;这样的哈希码
众所周知,对于某些典型的数据集(例如
网址)。我强烈建议您将其替换为 FNV 哈希或
一个基于梅森素数:
FNV 哈希:
梅森素数哈希:
(FNV 哈希据说更好,但梅森素数哈希将是
在很多机器上更快,因为乘以 127 通常是
明显快于乘以 16777619。)
The exact hash algorithm isn't specified by the standard, so the results
will vary. The algorithm used by VC10 doesn't seem to take all of the
characters into account if the string is longer than 10 characters; it
advances with an increment of
1 + s.size() / 10
. This is legal,albeit from a QoI point of view, rather disappointing; such hash codes
are known to perform very poorly for some typical sets of data (like
URLs). I'd strongly suggest you replace it with either a FNV hash or
one based on a Mersenne prime:
FNV hash:
Mersenne prime hash:
(The FNV hash is supposedly better, but the Mersenne prime hash will be
faster on a lot of machines, because multiplying by 127 is often
significantly faster than multiplying by 16777619.)
您可能会得到不同的哈希值。我得到不同的哈希值(GCC 4.5):
hashtest.cpp
输出
You should likely get different hash values. I get different hash values (GCC 4.5):
hashtest.cpp
Output
你不播种哈希函数,你最多只能给“它们”加盐。
如果以正确的方式使用该功能,这种碰撞可能只是偶然的。
除非您使用随机密钥执行大量测试,否则您无法判断哈希函数是否分布不均匀。
You do not seed hashing function, you can just salt "them" at most.
The function is used in the right way and this collision could be just fortuitous.
You cannot tell whether the hashing function is not evenly distributed unless you perform a massive test with random keys.
TR1 哈希函数和最新标准为字符串等内容定义了适当的重载。当我使用 std::tr1::hash (g++ 4.1.2) 运行此代码时,我得到这两个字符串的不同哈希值。
The TR1 hash function and the newest standard define proper overloads for things like strings. When I run this code using std::tr1::hash (g++ 4.1.2), I get different hash values for these two strings.