我想知道在 string
实例上调用 GetHashCode()
方法时获得重复值的概率。例如, 根据这篇博文, blair
和brainless
在 x86 机器上具有相同的哈希码 (1758039503)。
I want to know the probability of getting duplicate values when calling the GetHashCode()
method on string
instances. For instance, according to this blog post, blair
and brainlessness
have the same hashcode (1758039503) on an x86 machine.
发布评论
评论(6)
大
(抱歉乔恩!)
短字符串之间发生哈希冲突的概率极大。给定一组仅由常见单词抽取的一万个不同的短字符串,该组中至少存在一次冲突的概率约为 1%。如果有八万根字符串,则至少发生一次碰撞的概率超过 50%。
有关显示集合大小和碰撞概率之间关系的图表,请参阅我关于该主题的文章:
https://learn.microsoft.com/en-us/archive/blogs/ericlippert/socks-birthdays-and-hash-collisions
Large.
(Sorry Jon!)
The probability of getting a hash collision among short strings is extremely large. Given a set of only ten thousand distinct short strings drawn from common words, the probability of there being at least one collision in the set is approximately 1%. If you have eighty thousand strings, the probability of there being at least one collision is over 50%.
For a graph showing the relationship between set size and probability of collision, see my article on the subject:
https://learn.microsoft.com/en-us/archive/blogs/ericlippert/socks-birthdays-and-hash-collisions
小 - 如果您正在谈论任何两个任意不相等的字符串发生碰撞的机会。 (当然,这取决于字符串的“任意性”程度 - 不同的上下文将使用不同的字符串。)
大 - 如果您谈论的是至少发生一次碰撞的可能性在一个大的任意字符串池中。小的个体概率无法与生日问题相比。
这就是您需要知道的全部内容。肯定存在会发生冲突的情况,并且必须给出只有 232 个可能的哈希码,并且字符串数量不止这些 - 因此 < a href="http://en.wikipedia.org/wiki/Pigeonhole_principle">鸽子洞原理证明至少一个哈希码必须有多个生成它的字符串。但是,您应该相信哈希值的设计是相当合理的。
您可以依赖它作为缩小特定字符串可能匹配范围的好方法。这将是一组不寻常的自然出现的字符串,会产生很多冲突 - 即使存在一些冲突,显然如果您可以缩小候选搜索集的范围从 50K 减少到不到 10 个字符串,这是一个相当大的胜利。但您不得依赖它作为任何字符串的唯一值。
请注意,.NET 4 中使用的算法在 x86 和 x64 之间有所不同,因此该示例可能在这两个平台上都无效。
Small - if you're talking about the chance of any two arbitrary unequal strings having a collision. (It will depend on just how "arbitrary" the strings are, of course - different contexts will be using different strings.)
Large - if you're talking about the chance of there being at least one collision in a large pool of arbitrary strings. The small individual probabilities are no match for the birthday problem.
That's about all you need to know. There are definitely cases where there will be collisions, and there have to be given that there are only 232 possible hash codes, and more than that many strings - so the pigeonhole principle proves that at least one hash code must have more than one string which generates it. However, you should trust that the hash has been designed to be pretty reasonable.
You can rely on it as a pretty good way of narrowing down the possible matches for a particular string. It would be an unusual set of naturally-occurring strings which generated a lot of collisions - and even when there are some collisions, obviously if you can narrow a candidate search set down from 50K to fewer than 10 strings, that's a pretty big win. But you must not rely on it as a unique value for any string.
Note that the algorithm used in .NET 4 differs between x86 and x64, so that example probably isn't valid on both platforms.
我认为可以说的是“小,但有限,而且绝对不为零”——换句话说,您不能依赖
GetHashCode()
来返回唯一值两个不同的实例。在我看来,当您想快速判断两个实例是否不同(而不是相同)时,最好使用哈希码。
换句话说,如果两个对象具有不同的哈希码,您知道它们是不同的,并且不需要进行(可能昂贵的)更深入的比较。
但是,如果两个对象的哈希码相同,您必须继续比较对象本身以查看它们是否实际上相同。
I think all that's possible to say is "small, but finite and definitely not zero" -- in other words you must not rely on
GetHashCode()
ever returning unique values for two different instances.To my mind, hashcodes are best used when you want to tell quickly if two instances are different -- not if they're the same.
In other words, if two objects have different hash codes, you know they are different and need not do a (possibly expensive) deeper comparison.
However, if the hash codes for two objects are the same, you must go on to compare the objects themselves to see if they're actually the same.
我对包含 466k 英语单词的数据库进行了测试,发现与
string.GetHashCode()
发生了 48 次冲突。 MurmurHash 给出了稍微好一点的结果。更多结果如下:https://github.com/jitbit/MurmurHash.netI ran a test on a database of 466k English words and got 48 collisions with
string.GetHashCode()
. MurmurHash gives slightly better results. More results are here: https://github.com/jitbit/MurmurHash.net以防万一您的问题是一组字符串发生碰撞的概率是多少,
对于 n 个可用插槽和 m 个占用项:
问题。第一次插入时无冲突的值为 1。
问题。第二次插入时无冲突的概率为 ( n - 1 ) / n
问题。第三次插入时无冲突的概率为 ( n - 2 ) / n
问题。第 m 次插入不发生冲突的概率为 ( n - ( m - 1 ) ) / n
m 次插入后不发生冲突的概率是上述值的乘积: (n - 1)!/((n - m) ! * n^(m - 1))。
简化为 ( n 选择 k ) / ( n^m )。
每个人都是对的,你不能假设 0 次碰撞,因此,说概率“低”可能是正确的,但不允许你假设不会发生碰撞。如果您正在查看哈希表,我认为标准是当哈希表大约已满 2/3 时,您就会开始遇到重大冲突。
Just in case your question is meant to be what is the probability of a collision in a group of strings,
For n available slots and m occupying items:
Prob. of no collision on first insertion is 1.
Prob. of no collision on 2nd insertion is ( n - 1 ) / n
Prob. of no collision on 3rd insertion is ( n - 2 ) / n
Prob. of no collision on mth insertion is ( n - ( m - 1 ) ) / n
The probability of no collision after m insertions is the product of the above values: (n - 1)!/((n - m)! * n^(m - 1)).
which simplifies to ( n choose k ) / ( n^m ).
And everybody is right, you can't assume 0 collisions, so, saying the probability is "low" may be true but doesn't allow you to assume that there will be no collisions. If you're looking at a hashtable, I think the standard is you begin to have trouble with significant collisions when you're hashtable is about 2/3rds full.
如果散列是完美的,则两个随机选择的字符串之间发生冲突的概率为
1 / 2^(散列码中的位)
,但这是不太可能或不可能的。The probability of a collision between two randomly chosen strings is
1 / 2^(bits in hash code)
, if the hash is perfect, which is unlikely or impossible.