对于哈希码计算来说,什么是合理的素数?

发布于 2024-08-13 16:41:59 字数 1376 浏览 2 评论 0原文

Eclipse 3.5 有一个非常好的功能来生成 Java hashCode() 函数。它将生成例如(稍微缩短:)

class HashTest {
    int i;
    int j;        
    public int hashCode() {
        final int prime = 31;
        int result = prime + i;
        result = prime * result + j;
        return result;
    }
}

(如果类中有更多属性,则为每个附加属性重复 result = prime * result + attribute.hashCode();。对于 ints .hashCode () 可以省略。)

这看起来不错,但对于质数选择 31 来说。它可能取自 Java 的 hashCode 实现String,用于性能方面的原因在引入硬件乘法器后早已不复存在。这里,对于 i 和 j 的小值,存在许多哈希码冲突:例如 (0,0) 和 (-1,31) 具有相同的值。我认为这是一件坏事(TM),因为小值经常出现。对于 String.hashCode,您还会发现许多具有相同哈希码的短字符串,例如“Ca”和“DB”。如果你取一个大素数,如果你选择正确的素数,这个问题就会消失。

所以我的问题是:选择什么是好的素数?您采用什么标准来查找它?

这是一个一般性问题 - 所以我不想给出 i 和 j 的范围。但我认为在大多数应用中,相对较小的值比较大的值更频繁地出现。 (如果你有很大的值,素数的选择可能并不重要。)这可能不会产生太大的影响,但更好的选择是改进这一点的简单而明显的方法 - 那么为什么不这样做呢?公共语言 HashCodeBuilder还暗示了奇怪的小值。

澄清:这不是为什么 Java 中 String 中的 hashCode() 使用 31 作为乘数? 因为我的问题与 JDK 中 31 的历史无关,而是与什么有关使用相同的基本模板在新代码中会具有更好的价值。)

Eclipse 3.5 has a very nice feature to generate Java hashCode() functions. It would generate for example (slightly shortened:)

class HashTest {
    int i;
    int j;        
    public int hashCode() {
        final int prime = 31;
        int result = prime + i;
        result = prime * result + j;
        return result;
    }
}

(If you have more attributes in the class, result = prime * result + attribute.hashCode(); is repeated for each additional attribute. For ints .hashCode() can be omitted.)

This seems fine but for the choice 31 for the prime. It is probably taken from the hashCode implementation of Java String, which was used for performance reasons that are long gone after the introduction of hardware multipliers. Here you have many hashcode collisions for small values of i and j: for example (0,0) and (-1,31) have the same value. I think that is a Bad Thing(TM), since small values occur often. For String.hashCode you'll also find many short strings with the same hashcode, for instance "Ca" and "DB". If you take a large prime, this problem disappears if you choose the prime right.

So my question: what is a good prime to choose? What criteria do you apply to find it?

This is meant as a general question - so I do not want to give a range for i and j. But I suppose in most applications relatively small values occur more often than large values. (If you have large values the choice of the prime is probably unimportant.) It might not make much of a difference, but a better choice is an easy and obvious way to improve this - so why not do it? Commons lang HashCodeBuilder also suggests curiously small values.

(Clarification: this is not a duplicate of Why does Java's hashCode() in String use 31 as a multiplier? since my question is not concerned with the history of the 31 in the JDK, but on what would be a better value in new code using the same basic template. None of the answers there try to answer that.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

凉世弥音 2024-08-20 16:41:59

我建议使用92821。原因如下。

要对此给出有意义的答案,您必须了解 ij 的可能值。我唯一能想到的是,在许多情况下,小值比大值更常见。 (15 作为一个值出现在程序中的几率比 438281923 等要好得多。)因此,通过选择适当的素数来使最小的哈希码冲突尽可能大似乎是个好主意。对于 31,这相当糟糕 - 对于 i=-1j=31 来说,您已经拥有与 i=0相同的哈希值代码>j=0。

因为这很有趣,所以我编写了一个小程序,在整个 int 范围中搜索这个意义上的最佳素数。也就是说,对于每个素数,我在具有以下条件的所有 i,j 值中搜索 Math.abs(i) + Math.abs(j) 的最小值与 0,0 相同的哈希码,然后取该最小值尽可能大的质数。

鼓声:这个意义上的最佳素数是 486187739(最小碰撞是 i=-25486, j=67194)。几乎同样好且更容易记住的是 92821,最小的冲突是 i=-46272 和 j=46016。

如果您赋予“小”另一种含义,并希望成为尽可能大的碰撞的 Math.sqrt(i*i+j*j) 的最小值,则结果会略有不同:最好的值是 1322837333,i=-6815 和 j=70091,但我最喜欢的 92821(最小碰撞 -46272,46016)几乎与最佳值一样好。

我确实承认这些计算在实践中是否有意义是相当有争议的。但我确实认为以 92821 作为素数比 31 更有意义,除非你有充分的理由不这样做。

I recommend using 92821. Here's why.

To give a meaningful answer to this you have to know something about the possible values of i and j. The only thing I can think of in general is, that in many cases small values will be more common than large values. (The odds of 15 appearing as a value in your program are much better than, say, 438281923.) So it seems a good idea to make the smallest hashcode collision as large as possible by choosing an appropriate prime. For 31 this rather bad - already for i=-1 and j=31 you have the same hash value as for i=0 and j=0.

Since this is interesting, I've written a little program that searched the whole int range for the best prime in this sense. That is, for each prime I searched for the minimum value of Math.abs(i) + Math.abs(j) over all values of i,j that have the same hashcode as 0,0, and then took the prime where this minimum value is as large as possible.

Drumroll: the best prime in this sense is 486187739 (with the smallest collision being i=-25486, j=67194). Nearly as good and much easier to remember is 92821 with the smallest collision being i=-46272 and j=46016.

If you give "small" another meaning and want to be the minimum of Math.sqrt(i*i+j*j) for the collision as large as possible, the results are a little different: the best would be 1322837333 with i=-6815 and j=70091, but my favourite 92821 (smallest collision -46272,46016) is again almost as good as the best value.

I do acknowledge that it is quite debatable whether these calculation make much sense in practice. But I do think that taking 92821 as prime makes much more sense than 31, unless you have good reasons not to.

冷月断魂刀 2024-08-20 16:41:59

实际上,如果你取一个很大的素数以至于接近INT_MAX,那么由于模运算,你也会遇到同样的问题。如果您希望对长度为 2 的字符串进行散列,也许 INT_MAX 的平方根附近的素数会是最好的,如果您散列的字符串较长,则没有那么重要,并且无论如何冲突都是不可避免的...

Actually, if you take a prime so large that it comes close to INT_MAX, you have the same problem because of modulo arithmetic. If you expect to hash mostly strings of length 2, perhaps a prime near the square root of INT_MAX would be best, if the strings you hash are longer it doesn't matter so much and collisions are unavoidable anyway...

残龙傲雪 2024-08-20 16:41:59

冲突可能不是一个大问题...哈希的主要目标是避免使用 equals 进行 1:1 比较。
如果您有一个实现,其中 equals 对于具有冲突哈希的对象“通常”非常便宜,那么这根本不是问题。

最后,什么是最好的哈希方法取决于您要比较的内容。在 int 对的情况下(如您的示例),使用基本的按位运算符就足够了(如使用 & 或 ^)。

Collisions may not be such a big issue... The primary goal of the hash is to avoid using equals for 1:1 comparisons.
If you have an implementation where equals is "generally" extremely cheap for objects that have collided hashs, then this is not an issue (at all).

In the end, what is the best way of hashing depends on what you are comparing. In the case of an int pair (as in your example), using basic bitwise operators could be sufficient (as using & or ^).

难以启齿的温柔 2024-08-20 16:41:59

您需要定义 i 和 j 的范围。您可以对两者都使用素数。

public int hashCode() {
   http://primes.utm.edu/curios/ ;)
   return 97654321 * i ^ 12356789 * j;
}

You need to define your range for i and j. You could use a prime number for both.

public int hashCode() {
   http://primes.utm.edu/curios/ ;)
   return 97654321 * i ^ 12356789 * j;
}
单挑你×的.吻 2024-08-20 16:41:59

我会选择 7243。足够大以避免与小数字发生冲突。不会很快溢出到小数字。

I'd choose 7243. Large enough to avoid collissions with small numbers. Doesn't overflow to small numbers quickly.

冷︶言冷语的世界 2024-08-20 16:41:59

我只是想指出,哈希码与素数无关。
在JDK实现中,

for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }

我发现如果将31替换为27,结果非常相似。

I just want to point out that hashcode has nothing to do with prime.
In JDK implementation

for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }

I found if you replace 31 with 27, the result are very similar.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文