hashtable c hash-collision quadratic-probing

从线性探测转向二次探测（哈希碰撞）

发布于 2024-08-22 19:51:16 字数 1753 浏览 9 评论 0原文

我当前的哈希表实现是使用线性探测，现在我想转向二次探测（后来转向链接，也许还有双重哈希）。我读过一些文章、教程、维基百科等......但我仍然不知道我到底应该做什么。

基本上，线性探测的步长为 1，这很容易做到。当从哈希表中搜索、插入或删除元素时，我需要计算哈希值，为此，我这样做：

index = hash_function(key) % table_size;

然后，在搜索、插入或删除时，我循环遍历表，直到找到一个空闲的存储桶，如下所示

do {
    if(/* CHECK IF IT'S THE ELEMENT WE WANT */) {
        // FOUND ELEMENT

        return;
    } else {
        index = (index + 1) % table_size;
    }
while(/* LOOP UNTIL IT'S NECESSARY */);

：对于二次探测，我认为我需要做的是改变“索引”步长的计算方式，但这就是我不明白应该如何做。我看过各种代码，它们都有些不同。

另外，我已经看到了二次探测的一些实现，其中哈希函数被更改以适应这种情况（但不是全部）。是否真的需要进行这种更改，或者我可以避免修改哈希函数并仍然使用二次探测吗？

编辑： 在阅读了 Eli Bendersky 下面指出的所有内容后，我想我已经有了总体想法。这是http://eternallyconfuzzled.com/tuts/datastructs/jsw_tut_hashtable.aspx<的部分代码< /a>：

15   for ( step = 1; table->table[h] != EMPTY; step++ ) {
16     if ( compare ( key, table->table[h] ) == 0 )
17       return 1;
18 
19     /* Move forward by quadratically, wrap if necessary */
20     h = ( h + ( step * step - step ) / 2 ) % table->size;
21   }

有两件事我不明白......他们说二次探测通常是使用c(i)=i^2完成的。然而，在上面的代码中，它所做的事情更像是 c(i)=(i^2-i)/2

我准备在我的代码上实现这个，但我会简单地这样做：

index = (index + (index^index)) % table_size;

.. .而不是：

index = (index + (index^index - index)/2) % table_size;

如果有的话，我会这样做：

index = (index + (index^index)/2) % table_size;

...因为我已经看到其他代码示例下降了两倍。虽然我不明白为什么...

1)为什么要减去步数？
2) 为什么它会减少 2？

原文

My current implementation of an Hash Table is using Linear Probing and now I want to move to Quadratic Probing (and later to chaining and maybe double hashing too). I've read a few articles, tutorials, wikipedia, etc... But I still don't know exactly what I should do.

Linear Probing, basically, has a step of 1 and that's easy to do. When searching, inserting or removing an element from the Hash Table, I need to calculate an hash and for that I do this:

index = hash_function(key) % table_size;

Then, while searching, inserting or removing I loop through the table until I find a free bucket, like this:

do {
    if(/* CHECK IF IT'S THE ELEMENT WE WANT */) {
        // FOUND ELEMENT

        return;
    } else {
        index = (index + 1) % table_size;
    }
while(/* LOOP UNTIL IT'S NECESSARY */);

As for Quadratic Probing, I think what I need to do is change how the "index" step size is calculated but that's what I don't understand how I should do it. I've seen various pieces of code, and all of them are somewhat different.

Also, I've seen some implementations of Quadratic Probing where the hash function is changed to accommodated that (but not all of them). Is that change really needed or can I avoid modifying the hash function and still use Quadratic Probing?

EDIT:
After reading everything pointed out by Eli Bendersky below I think I got the general idea. Here's part of the code at http://eternallyconfuzzled.com/tuts/datastructures/jsw_tut_hashtable.aspx:

15   for ( step = 1; table->table[h] != EMPTY; step++ ) {
16     if ( compare ( key, table->table[h] ) == 0 )
17       return 1;
18 
19     /* Move forward by quadratically, wrap if necessary */
20     h = ( h + ( step * step - step ) / 2 ) % table->size;
21   }

There's 2 things I don't get... They say that quadratic probing is usually done using c(i)=i^2. However, in the code above, it's doing something more like c(i)=(i^2-i)/2

I was ready to implement this on my code but I would simply do:

index = (index + (index^index)) % table_size;

...and not:

index = (index + (index^index - index)/2) % table_size;

If anything, I would do:

index = (index + (index^index)/2) % table_size;

...cause I've seen other code examples diving by two. Although I don't understand why...

1) Why is it subtracting the step?
2) Why is it diving it by 2?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

九公里浅绿 2024-08-29 19:51:16

如果表大小是 2 的幂，则有一种特别简单而优雅的方法来实现二次探测：

step = 1;

do {
    if(/* CHECK IF IT'S THE ELEMENT WE WANT */) {
        // FOUND ELEMENT

        return;
    } else {
        index = (index + step) % table_size;
        step++;
    }
} while(/* LOOP UNTIL IT'S NECESSARY */);

不是查看原始索引的偏移量 0、1、2、3、4...，而是查看偏移量 0 , 1, 3, 6, 10...（第 i^th 探头的偏移量为 (i*(i+1))/2，即它是二次的）。

这保证会命中哈希表中的每个位置（因此，如果有的话，您一定会找到一个空桶）假设表大小是 2 的幂。

这是证明的草图：

给定一个 n 的表大小，我们想要表明我们将得到 n 个不同的值 (i*(i+1))/2 (mod n)，其中 i = 0 ... n-1。
我们可以用反证法来证明这一点。假设有少于 n 个不同的值：如果是这样，则 i 在 [0, n-1] 范围内必须至少有两个不同的整数值，使得 (i*(i+1))/2 (mod n ）是一样的。称这些为 p 和 q，其中 p < q.
即(p * (p+1)) / 2 = (q * (q+1)) / 2 (mod n)
=> (p² + p) / 2 = (q² + q) / 2 (mod n)
=> p² + p = q² + q (mod 2n)
=> q² - p² + q - p = 0 (mod 2n)
因式分解 => (q - p) (p + q + 1) = 0 (mod 2n)
(q - p) = 0 是简单的情况 p = q。
(p + q + 1) = 0 (mod 2n) 是不可能的：我们的 p 和 q 值在 [0, n-1] 范围内，并且 q > n p，因此 (p + q + 1) 必须在 [2, 2n-2] 范围内。
当我们对 2n 求模时，我们还必须处理两个因子都非零但相乘得到 0 (mod 2n) 的棘手情况：
- 观察两个因子 (q - p) 和 (p + q + 1) 之间的差是 (2p + 1)，这是一个奇数 - 因此其中一个因子必须是偶数，另一个必须是偶数保持古怪。
- (q - p) (p + q + 1) = 0 (mod 2n) => (q - p) (p + q + 1) 可被 2n 整除。 如果 n（因此 2n）是 2 的幂，则要求偶因数是 2n 的倍数（因为 2n 的所有素因数都是 2，而 2n 的素因数都不为 2）我们的奇怪因素是）。
- 但是 (q - p) 的最大值为 n-1，而 (p + q + 1) 的最大值为 2n-2（如步骤 9 所示），因此两者都不能是 2n 的倍数.
- 所以这种情况也是不可能的。
因此，（步骤 2 中）少于 n 个不同值的假设必定是错误的。

（如果表大小不是 2 的幂，则会在第 10 步失败。）

There is a particularly simple and elegant way to implement quadratic probing if your table size is a power of 2:

step = 1;

do {
    if(/* CHECK IF IT'S THE ELEMENT WE WANT */) {
        // FOUND ELEMENT

        return;
    } else {
        index = (index + step) % table_size;
        step++;
    }
} while(/* LOOP UNTIL IT'S NECESSARY */);

Instead of looking at offsets 0, 1, 2, 3, 4... from the original index, this will look at offsets 0, 1, 3, 6, 10... (the i^th probe is at offset (i*(i+1))/2, i.e. it's quadratic).

This is guaranteed to hit every position in the hash table (so you are guaranteed to find an empty bucket if there is one) provided the table size is a power of 2.

Here is a sketch of a proof:

Given a table size of n, we want to show that we will get n distinct values of (i*(i+1))/2 (mod n) with i = 0 ... n-1.
We can prove this by contradiction. Assume that there are fewer than n distinct values: if so, there must be at least two distinct integer values for i in the range [0, n-1] such that (i*(i+1))/2 (mod n) is the same. Call these p and q, where p < q.
i.e. (p * (p+1)) / 2 = (q * (q+1)) / 2 (mod n)
=> (p² + p) / 2 = (q² + q) / 2 (mod n)
=> p² + p = q² + q (mod 2n)
=> q² - p² + q - p = 0 (mod 2n)
Factorise => (q - p) (p + q + 1) = 0 (mod 2n)
(q - p) = 0 is the trivial case p = q.
(p + q + 1) = 0 (mod 2n) is impossible: our values of p and q are in the range [0, n-1], and q > p, so (p + q + 1) must be in the range [2, 2n-2].
As we are working modulo 2n, we must also deal with the tricky case where both factors are non-zero, but multiply to give 0 (mod 2n):
- Observe that the difference between the two factors (q - p) and (p + q + 1) is (2p + 1), which is an odd number - so one of the factors must be even, and the other must be odd.
- (q - p) (p + q + 1) = 0 (mod 2n) => (q - p) (p + q + 1) is divisible by 2n. If n (and hence 2n) is a power of 2, this requires the even factor to be a multiple of 2n (because all of the prime factors of 2n are 2, whereas none of the prime factors of our odd factor are).
- But (q - p) has a maximum value of n-1, and (p + q + 1) has a maximum value of 2n-2 (as seen in step 9), so neither can be a multiple of 2n.
- So this case is impossible as well.
Therefore the assumption that there are fewer than n distinct values (in step 2) must be false.