与此哈希码函数发生哈希码冲突的可能性有多大？

发布于 2024-10-23 15:00:12 字数 707 浏览 2 评论 0原文

在以下场景中，与下面的函数发生 HashCode 冲突的可能性有多大。

使用 key[0]、key[1]、key[2]、key[3] 的随机 int 值使用
具有以下约束的随机键值
- 密钥[0] <1,000,000
- 密钥[1] <10,000
- 密钥[2] <1,000
- 密钥[3] <1,000

假设我们有 1000 万个对象。

int[] key=new int[4];    
public override int GetHashCode()
{
    // Use large prime multiples to create a unique hash key
    // Create the hash offsets using a "even powers of 2 minus 1" method, which gives 
    // primes most of the time.  
    int hashKey = 0;
    hashKey += 2047 * key[0];
    hashKey += 8191 * key[1];
    hashKey += 32767 * key[2];
    hashKey += 131071 * key[3];
    return hashKey;
}

原文

How likely is it to get a HashCode collision with the function below in following scenarios.

With random int values for key[0],key[1], key[2], key[3]
With random key values with the following constraints
- key[0] <1,000,000
- key[1] <10,000
- key[2] <1,000
- key[3] <1,000

Assume we have 10 Million objects.

int[] key=new int[4];    
public override int GetHashCode()
{
    // Use large prime multiples to create a unique hash key
    // Create the hash offsets using a "even powers of 2 minus 1" method, which gives 
    // primes most of the time.  
    int hashKey = 0;
    hashKey += 2047 * key[0];
    hashKey += 8191 * key[1];
    hashKey += 32767 * key[2];
    hashKey += 131071 * key[3];
    return hashKey;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

哑剧 2024-10-30 15:00:12

这是一个奇怪的问题。让我们从代码中明显的错误开始：

// Use large prime multiples to create a unique hash key     
// Create the hash offsets using a "even powers of 2 minus 1" method, which gives      
// primes most of the time.

首先，这些都是 2 减 1 的奇数次方；它们甚至都不是二减一的幂。

其次，在您选择作为“大质数倍数”的四个乘数中，其中一半不是质数。 2047 和 32767 是复合数。

第三，如果我们“纠正”——我谨慎地使用这个词——这个陈述是“2 减 1 的奇次幂，大多数时候给出素数”，那么这个陈述就是错误的。这种形式的素数称为梅森素数，已知的梅森素数只有 47 个。我向你保证，梅森素数的密度远低于二分之一。这么说吧：在 2^1-1 和 2^43112609−1 之间的奇次方梅森数中，已知有 46 个是素数，大约是五十万分之一，而不是一半。

第四，你认为素数与什么有什么关系？素数有什么神话般的力量？当然，重要的是哈希码的分布往往不会产生哈希表大小的倍数。由于哈希表大小选择为质数，因此这似乎可能加剧问题。

第五，哈希键不唯一；你的问题是关于它们何时发生碰撞，所以显然它们不可能是唯一的。

第六，假设您的哈希函数在 32 位整数空间中具有完全随机分布。根据生日“悖论”，当从 32 位空间中随机抽取一千万个数字时，您会期望至少发生一次碰撞的可能性远远大于 99%。事实上，预计碰撞次数约为十或两万次。（我们可以计算出预期碰撞的确切数量，但谁在乎它到底是什么；它就是这个数量级。）

碰撞太多了吗？很难比随机分布做得更好。如果您需要更少的冲突，那么您首先就不应该使用 32 位哈希算法。

第七，谁关心哈希函数在其整个范围内有多少次冲突？当然，实际的问题应该是“这个哈希如何处理大表中的实际数据？”与我们不同，您可以通过尝试来回答这个问题。如果它满足您的性能预算，那就太好了，不用担心其他事情。如果没有，请在开始指责哈希函数之前找出原因。

我对这个问题以及您希望从其答案中获得什么感到非常困惑。你能解释一下吗？

This is kind of a strange question. Let's start with the obvious errors in the code:

// Use large prime multiples to create a unique hash key     
// Create the hash offsets using a "even powers of 2 minus 1" method, which gives      
// primes most of the time.

First off, those are all odd powers of two minus one; none of them are even powers of two minus one.

Second, of the four multipliers you've chosen as "large prime multiples", half of them are not prime. 2047 and 32767 are composite.

Third, if we "correct" -- and I use the word advisedly -- the statement to be "odd powers of 2 minus one which gives primes most of the time" that statement is absurdly wrong. A prime of that form is known as a Mersenne prime, and there are only 47 known Mersenne primes. I assure you that the density of Mersenne primes is considerably lower than one half. Put it this way: of the odd-power Mersenne numbers between 2^1-1 and 2^43112609−1, 46 of them are known to be prime numbers, which is about one in a half a million, not half.

Fourth, what do you imagine prime numbers have to do with anything? What mythological power do prime numbers have? Surely what matters is that the distribution of hash codes tends to not produce multiples of the hash table size. Since the hash table size is chosen to be a prime number, it seems like this is potentially exacerbating the problem.

Fifth, hash keys are not unique; your question is about when they collide, so clearly they cannot be unique.

Sixth, suppose your hash function had a perfectly random distribution across the space of 32 bit integers. By the birthday "paradox" you'd expect there to be a far greater than 99% chance of at least one collision when drawing ten million numbers at random from a 32 bit space. In fact, the expected number of collisions would be on the order of ten or twenty thousand. (We could work out the exact number of expected collisions, but who cares what it is exactly; it is in that order of magnitude.)

Is that too many collisions? It is going to be very difficult to do better than a random distribution. If you require fewer collisions than that, then you shouldn't be using a 32 bit hashing algorithm in the first place.

Seventh, who cares how many collisions a hash function has across its full range? Surely the practical question ought to really be "how does this hash perform with realistic data in a large table?" You, unlike us, can answer that question by trying it. If it meets your performance budget, great, worry about something else. If it doesn't, figure out why not before you start blaming the hash function.

I am very confused by this question and what you hope to gain from its answer. Can you explain?

回复收藏 0 原文

泪冰清 2024-10-30 15:00:12

我编写了一个快速脚本来测试这一点。

import random

def hash(key):
    hashKey = 0
    hashKey += 2047 * key[0]
    hashKey += 8191 * key[1]
    hashKey += 32767 * key[2]
    hashKey += 131071 * key[3]
    return hashKey

seen = set()
collisions = 0
for i in range(0,10000000):
    x = hash([random.randint(0,1000000), random.randint(0,10000), random.randint(0,1000), random.randint(0,1000)])
    if x in seen:
        collisions += 1
    else:
        seen.add(x)

print collisions

当我运行它时，它告诉我发生了 23735 次碰撞。
我还在 100 万个元素上进行了尝试，得到了 247 次碰撞。这两个数字都是 4 次运行的平均值。

I wrote a quick script to test this.

import random

def hash(key):
    hashKey = 0
    hashKey += 2047 * key[0]
    hashKey += 8191 * key[1]
    hashKey += 32767 * key[2]
    hashKey += 131071 * key[3]
    return hashKey

seen = set()
collisions = 0
for i in range(0,10000000):
    x = hash([random.randint(0,1000000), random.randint(0,10000), random.randint(0,1000), random.randint(0,1000)])
    if x in seen:
        collisions += 1
    else:
        seen.add(x)

print collisions

When I ran it, it told me I got 23735 collisions.
I also tried it on one million elements, and I got 247 collisions. Both numbers are averages over 4 runs.

回复收藏 0 原文

浅笑轻吟梦一曲 2024-10-30 15:00:12

我本来想说你应该使用，

int hashKey = key[0].GetHashCode();
hashKey ^= key[1].GetHashCode();
hashKey ^= key[2].GetHashCode();
hashKey ^= key[3].GetHashCode();

因为它会产生更好的结果，但当我测试它时，我感到非常惊讶。无论如何都要发布结果，因为作为一名科学家“你没有想到的结果仍然是结果”。

Collisions1是你的方法，Collisions2是我的方法，这是4次运行的结果

Collisions1: 23744
Collisions2: 8996107

Collisions1: 23825
Collisions2: 8996215

Collisions1: 23771
Collisions2: 8996119

Collisions1: 24031
Collisions2: 8996157

I was going to say you should use

int hashKey = key[0].GetHashCode();
hashKey ^= key[1].GetHashCode();
hashKey ^= key[2].GetHashCode();
hashKey ^= key[3].GetHashCode();

as it would give better results, but when I tested it I was thoroughly surprised. Posting the results anyway because as a scientist "results you did not expect are still results".

Collisions1 is your method, Collisions2 is my method, this is the results of 4 runs

Collisions1: 23744
Collisions2: 8996107

Collisions1: 23825
Collisions2: 8996215

Collisions1: 23771
Collisions2: 8996119

Collisions1: 24031
Collisions2: 8996157

回复收藏 0 原文

~没有更多了~