从线性探测转向二次探测(哈希碰撞)

发布于 2024-08-22 19:51:16 字数 1753 浏览 5 评论 0原文

我当前的哈希表实现是使用线性探测,现在我想转向二次探测(后来转向链接,也许还有双重哈希)。我读过一些文章、教程、维基百科等......但我仍然不知道我到底应该做什么。

基本上,线性探测的步长为 1,这很容易做到。当从哈希表中搜索、插入或删除元素时,我需要计算哈希值,为此,我这样做:

index = hash_function(key) % table_size;

然后,在搜索、插入或删除时,我循环遍历表,直到找到一个空闲的存储桶,如下所示

do {
    if(/* CHECK IF IT'S THE ELEMENT WE WANT */) {
        // FOUND ELEMENT

        return;
    } else {
        index = (index + 1) % table_size;
    }
while(/* LOOP UNTIL IT'S NECESSARY */);

:对于二次探测,我认为我需要做的是改变“索引”步长的计算方式,但这就是我不明白应该如何做。我看过各种代码,它们都有些不同。

另外,我已经看到了二次探测的一些实现,其中哈希函数被更改以适应这种情况(但不是全部)。是否真的需要进行这种更改,或者我可以避免修改哈希函数并仍然使用二次探测吗?

编辑: 在阅读了 Eli Bendersky 下面指出的所有内容后,我想我已经有了总体想法。这是http://eternallyconfuzzled.com/tuts/datastructs/jsw_tut_hashtable.aspx<的部分代码< /a>:

15   for ( step = 1; table->table[h] != EMPTY; step++ ) {
16     if ( compare ( key, table->table[h] ) == 0 )
17       return 1;
18 
19     /* Move forward by quadratically, wrap if necessary */
20     h = ( h + ( step * step - step ) / 2 ) % table->size;
21   }

有两件事我不明白......他们说二次探测通常是使用c(i)=i^2完成的。然而,在上面的代码中,它所做的事情更像是 c(i)=(i^2-i)/2

我准备在我的代码上实现这个,但我会简单地这样做:

index = (index + (index^index)) % table_size;

.. .而不是:

index = (index + (index^index - index)/2) % table_size;

如果有的话,我会这样做:

index = (index + (index^index)/2) % table_size;

...因为我已经看到其他代码示例下降了两倍。虽然我不明白为什么...

1)为什么要减去步数?
2) 为什么它会减少 2?

My current implementation of an Hash Table is using Linear Probing and now I want to move to Quadratic Probing (and later to chaining and maybe double hashing too). I've read a few articles, tutorials, wikipedia, etc... But I still don't know exactly what I should do.

Linear Probing, basically, has a step of 1 and that's easy to do. When searching, inserting or removing an element from the Hash Table, I need to calculate an hash and for that I do this:

index = hash_function(key) % table_size;

Then, while searching, inserting or removing I loop through the table until I find a free bucket, like this:

do {
    if(/* CHECK IF IT'S THE ELEMENT WE WANT */) {
        // FOUND ELEMENT

        return;
    } else {
        index = (index + 1) % table_size;
    }
while(/* LOOP UNTIL IT'S NECESSARY */);

As for Quadratic Probing, I think what I need to do is change how the "index" step size is calculated but that's what I don't understand how I should do it. I've seen various pieces of code, and all of them are somewhat different.

Also, I've seen some implementations of Quadratic Probing where the hash function is changed to accommodated that (but not all of them). Is that change really needed or can I avoid modifying the hash function and still use Quadratic Probing?

EDIT:
After reading everything pointed out by Eli Bendersky below I think I got the general idea. Here's part of the code at http://eternallyconfuzzled.com/tuts/datastructures/jsw_tut_hashtable.aspx:

15   for ( step = 1; table->table[h] != EMPTY; step++ ) {
16     if ( compare ( key, table->table[h] ) == 0 )
17       return 1;
18 
19     /* Move forward by quadratically, wrap if necessary */
20     h = ( h + ( step * step - step ) / 2 ) % table->size;
21   }

There's 2 things I don't get... They say that quadratic probing is usually done using c(i)=i^2. However, in the code above, it's doing something more like c(i)=(i^2-i)/2

I was ready to implement this on my code but I would simply do:

index = (index + (index^index)) % table_size;

...and not:

index = (index + (index^index - index)/2) % table_size;

If anything, I would do:

index = (index + (index^index)/2) % table_size;

...cause I've seen other code examples diving by two. Although I don't understand why...

1) Why is it subtracting the step?
2) Why is it diving it by 2?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

九公里浅绿 2024-08-29 19:51:16

如果表大小是 2 的幂,则有一种特别简单而优雅的方法来实现二次探测:

step = 1;

do {
    if(/* CHECK IF IT'S THE ELEMENT WE WANT */) {
        // FOUND ELEMENT

        return;
    } else {
        index = (index + step) % table_size;
        step++;
    }
} while(/* LOOP UNTIL IT'S NECESSARY */);

不是查看原始索引的偏移量 0、1、2、3、4...,而是查看偏移量 0 , 1, 3, 6, 10...(第 ith 探头的偏移量为 (i*(i+1))/2,即它是二次的)。

这保证会命中哈希表中的每个位置(因此,如果有的话,您一定会找到一个空桶)假设表大小是 2 的幂。


这是证明的草图:

  1. 给定一个 n 的表大小,我们想要表明我们将得到 n 个不同的值 (i*(i+1))/2 (mod n),其中 i = 0 ... n-1。
  2. 我们可以用反证法来证明这一点。假设有少于 n 个不同的值:如果是这样,则 i 在 [0, n-1] 范围内必须至少有两个不同的整数值,使得 (i*(i+1))/2 (mod n )是一样的。称这些为 p 和 q,其中 p < q.
  3. 即(p * (p+1)) / 2 = (q * (q+1)) / 2 (mod n)
  4. => (p2 + p) / 2 = (q2 + q) / 2 (mod n)
  5. => p2 + p = q2 + q (mod 2n)
  6. => q2 - p2 + q - p = 0 (mod 2n)
  7. 因式分解 => (q - p) (p + q + 1) = 0 (mod 2n)
  8. (q - p) = 0 是简单的情况 p = q。
  9. (p + q + 1) = 0 (mod 2n) 是不可能的:我们的 p 和 q 值在 [0, n-1] 范围内,并且 q > n p,因此 (p + q + 1) 必须在 [2, 2n-2] 范围内。
  10. 当我们对 2n 求模时,我们还必须处理两个因子都非零但相乘得到 0 (mod 2n) 的棘手情况:
    • 观察两个因子 (q - p) 和 (p + q + 1) 之间的差是 (2p + 1),这是一个奇数 - 因此其中一个因子必须是偶数,另一个必须是偶数保持古怪。
    • (q - p) (p + q + 1) = 0 (mod 2n) => (q - p) (p + q + 1) 可被 2n 整除。 如果 n(因此 2n)是 2 的幂,则要求偶因数是 2n 的倍数(因为 2n 的所有素因数都是 2,而 2n 的素因数都不为 2)我们的奇怪因素是)。
    • 但是 (q - p) 的最大值为 n-1,而 (p + q + 1) 的最大值为 2n-2(如步骤 9 所示),因此两者都不能是 2n 的倍数.
    • 所以这种情况也是不可能的。
  11. 因此,(步骤 2 中)少于 n 个不同值的假设必定是错误的。

(如果表大小不是 2 的幂,则会在第 10 步失败。)

There is a particularly simple and elegant way to implement quadratic probing if your table size is a power of 2:

step = 1;

do {
    if(/* CHECK IF IT'S THE ELEMENT WE WANT */) {
        // FOUND ELEMENT

        return;
    } else {
        index = (index + step) % table_size;
        step++;
    }
} while(/* LOOP UNTIL IT'S NECESSARY */);

Instead of looking at offsets 0, 1, 2, 3, 4... from the original index, this will look at offsets 0, 1, 3, 6, 10... (the ith probe is at offset (i*(i+1))/2, i.e. it's quadratic).

This is guaranteed to hit every position in the hash table (so you are guaranteed to find an empty bucket if there is one) provided the table size is a power of 2.


Here is a sketch of a proof:

  1. Given a table size of n, we want to show that we will get n distinct values of (i*(i+1))/2 (mod n) with i = 0 ... n-1.
  2. We can prove this by contradiction. Assume that there are fewer than n distinct values: if so, there must be at least two distinct integer values for i in the range [0, n-1] such that (i*(i+1))/2 (mod n) is the same. Call these p and q, where p < q.
  3. i.e. (p * (p+1)) / 2 = (q * (q+1)) / 2 (mod n)
  4. => (p2 + p) / 2 = (q2 + q) / 2 (mod n)
  5. => p2 + p = q2 + q (mod 2n)
  6. => q2 - p2 + q - p = 0 (mod 2n)
  7. Factorise => (q - p) (p + q + 1) = 0 (mod 2n)
  8. (q - p) = 0 is the trivial case p = q.
  9. (p + q + 1) = 0 (mod 2n) is impossible: our values of p and q are in the range [0, n-1], and q > p, so (p + q + 1) must be in the range [2, 2n-2].
  10. As we are working modulo 2n, we must also deal with the tricky case where both factors are non-zero, but multiply to give 0 (mod 2n):
    • Observe that the difference between the two factors (q - p) and (p + q + 1) is (2p + 1), which is an odd number - so one of the factors must be even, and the other must be odd.
    • (q - p) (p + q + 1) = 0 (mod 2n) => (q - p) (p + q + 1) is divisible by 2n. If n (and hence 2n) is a power of 2, this requires the even factor to be a multiple of 2n (because all of the prime factors of 2n are 2, whereas none of the prime factors of our odd factor are).
    • But (q - p) has a maximum value of n-1, and (p + q + 1) has a maximum value of 2n-2 (as seen in step 9), so neither can be a multiple of 2n.
    • So this case is impossible as well.
  11. Therefore the assumption that there are fewer than n distinct values (in step 2) must be false.

(If the table size is not a power of 2, this falls apart at step 10.)

林空鹿饮溪 2024-08-29 19:51:16

您不必修改二次探测的哈希函数。二次探测的最简单形式实际上只是将后续平方添加到计算位置,而不是线性 1、2、3。

有一个很好的资源 此处。以下内容摘自那里。这是使用简单多项式 c(i) = i^2 时最简单的二次探测形式:

alt text

在更一般的情况下,公式为:

alt text

您可以选择常量。

但请记住,二次探测仅在某些情况下有用。正如维基百科条目所述:

二次探测提供良好的记忆
缓存,因为它保留了一些
参考地点;然而,线性
探测具有更大的局部性,并且,
因此,更好的缓存性能。
二次探测更好地避免了
可能发生的聚类问题
线性探测,尽管它不是
免疫。


编辑:像计算机科学中的许多事情一样,二次探测的精确常数和多项式是启发式的。是的,最简单的形式是i^2,但您可以选择任何其他多项式。维基百科给出了 h(k,i) = (h(k) + i + i^2)(mod m) 的示例。

因此,很难回答你的“为什么”问题。这里唯一的“为什么”是为什么你需要二次探测?对于其他形式的探测和获取聚簇表有问题吗?或者只是一项家庭作业,或者自学?

请记住,到目前为止,哈希表最常见的冲突解决技术是链接或线性探测。二次探测是一种适用于特殊情况的启发式选项,除非您非常清楚自己在做什么,否则我不建议使用它。

You don't have to modify the hash function for quadratic probing. The simplest form of quadratic probing is really just adding consequent squares to the calculated position instead of linear 1, 2, 3.

There's a good resource here. The following is taken from there. This is the simplest form of quadratic probing when the simple polynomial c(i) = i^2 is used:

alt text

In the more general case the formula is:

alt text

And you can pick your constants.

Keep, in mind, however, that quadratic probing is useful only in certain cases. As the Wikipedia entry states:

Quadratic probing provides good memory
caching because it preserves some
locality of reference; however, linear
probing has greater locality and,
thus, better cache performance.
Quadratic probing better avoids the
clustering problem that can occur with
linear probing, although it is not
immune.


EDIT: Like many things in computer science, the exact constants and polynomials of quadratic probing are heuristic. Yes, the simplest form is i^2, but you may choose any other polynomial. Wikipedia gives the example with h(k,i) = (h(k) + i + i^2)(mod m).

Therefore, it is difficult to answer your "why" question. The only "why" here is why do you need quadratic probing at all? Having problems with other forms of probing and getting a clustered table? Or is it just a homework assignment, or self-learning?

Keep in mind that by far the most common collision resolution technique for hash tables is either chaining or linear probing. Quadratic probing is a heuristic option available for special cases, and unless you know what you're doing very well, I wouldn't recommend using it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文