Java 中多重集的高效哈希码
我定义了 java.util.Collection 的子接口,它实际上是一个多重集(又名包)。它可能不包含 null
元素,尽管这对我的问题并不重要。接口定义的 equals 契约正如您所期望的那样:
obj instanceof MyInterface
obj
包含与this
相同的元素(通过equals
)obj
对于每个元素包含相同数量的重复项,- 元素的顺序被忽略
现在我想编写我的 hashCode
方法。我最初的想法是:
int hashCode = 1;
for( Object o : this ) {
hashCode += o.hashCode();
}
但是,我注意到 com.google.common.collect.Multiset
(来自 Guava)将哈希代码定义如下:
int hashCode = 0;
for( Object o : elementSet() ) {
hashCode += ((o == null) ? 0 : o.hashCode()) ^ count(o);
}
空的 Multiset 会有哈希代码,这让我感到奇怪0,但更重要的是,我不明白 ^ count(o)
相对于简单地添加每个重复项的哈希码的好处。也许这是关于不要多次计算相同的哈希码,但是为什么不使用 * count(o)
呢?
我的问题:什么是有效的哈希码计算?就我而言,元素的计数并不能保证以低廉的价格获得。
I have defined a subinterface of java.util.Collection
that effectively is a multiset (aka bag). It may not contain null
elements, although that's not crucial to my question. The equals contract defined by the interface is as you would expect:
obj instanceof MyInterface
obj
contains the same elements asthis
(byequals
)obj
contains the same number of duplicates for each element- order of elements is disregarded
Now I want to write my hashCode
method. My initial idea was:
int hashCode = 1;
for( Object o : this ) {
hashCode += o.hashCode();
}
However, I noticed that com.google.common.collect.Multiset
(from Guava) defines the hash code as follows:
int hashCode = 0;
for( Object o : elementSet() ) {
hashCode += ((o == null) ? 0 : o.hashCode()) ^ count(o);
}
It strikes me as odd that an empty Multiset would have hash code 0, but more importantly I don't understand the benefit of ^ count(o)
over simply adding up the hash codes of every duplicate. Maybe it's about not calculating the same hash code more than once, but then why not * count(o)
?
My question: what would be an efficient hash code calculation? In my case the count for an element is not guaranteed to be cheap to obtain.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
更新
因此,您必须在所有条目出现时对其进行处理,不能使用
count
,也不能假设条目按已知顺序出现。我考虑的一般功能是
一些观察:
g
。h
可用于改善结果,但这并不是很重要,因为这种情况已经发生在例如HashMap.hash(int)
中。f
是最重要的函数,不幸的是,它非常有限,因为它显然必须既具有结合性又具有交换性。f
在两个参数中都应该是双射的,否则会产生不必要的冲突。在任何情况下,我都不会推荐
f(x, y) = x^y
因为它会使一个元素的两次出现被抵消。使用加法效果更好。像其中
A
是常量这样的东西满足上述所有条件。这可能是值得的。对于
A=0
它退化为加法,使用偶数A
不好,因为它会将x*y
的位移出。使用
A=1
就可以了,并且可以使用x86
架构上的单个指令来计算表达式2*x+1
。如果成员的哈希值分布不均,使用较大的奇数
A
可能效果更好。如果您选择一个不平凡的
hashCode()
您应该测试它是否正常工作。您应该测量程序的性能,也许您会发现简单的加法就足够了。否则,我会选择NULL_HASH=1
、g=h=identity
和A=1
。我的旧答案
可能是出于效率原因。对于某些实现来说,调用
count
可能会很昂贵,但可以使用entrySet
来代替。但它可能会更贵,我无法判断。我为 Guava 的 hashCode 和 Rinke 以及我自己的建议做了一个简单的碰撞基准测试:
碰撞计数代码如下:
并打印出来
因此,在这个简单的示例中,Guava 的 hashCode 表现非常糟糕(63 种可能的碰撞中,有 45 次碰撞)。然而,我并不认为我的例子与现实生活有很大关系。
Update
So you have to process all the entries as they come, you can't use
count
, and can't assume the entries come in a known order.The general function I'd consider is
Some observations:
NULL_HASH=0
since this would ignore null values.g
can be used in case you expect the hashes of the members to be in a small range (which can happen in case they are e.g., single characters).h
can be used to improve the result, which is not very important since this already happens e.g. inHashMap.hash(int)
.f
is the most important one, unfortunately, it's quite limited as it obviously must be both associative and commutative.f
should be bijective in both arguments, otherwise you'd generate unnecessary collisions.In no case I'd recommend
f(x, y) = x^y
since it'd make two occurrences of an element to cancel out. Using addition is better. Something likewhere
A
is a constant satisfies all the above conditions. It may be worth it.For
A=0
it degenerates to addition, using an evenA
is not good as it shift bits ofx*y
out.Using
A=1
is fine, and the expression2*x+1
can be computed using a single instruction on thex86
architecture.Using a larger odd
A
might work better in case the hashes of the members are badly distributed.In case you go for a non-trivial
hashCode()
you should test if it works correctly. You should measure the performance of your program, maybe you'll find simple addition sufficient. Otherwise, I'd for forNULL_HASH=1
,g=h=identity
, andA=1
.My old answer
It may be for efficiency reasons. Calling
count
may be expensive for some implementations, butentrySet
may be used instead. Still it might be more costly, I can't tell.I did a simple collision benchmark for Guava's hashCode and Rinke's and my own proposals:
The collision counting code went as follows:
and printed
So in this simple example Guava's hashCode performed really bad (45 collisions out of 63 possible). However, I don't claim my example is of much relevance for real life.
如果计数很昂贵,就不要这样做。你知道它太贵了吗?您始终可以对多个实现进行编码,并使用您希望代表应用程序的数据来分析其性能。然后你就会知道答案而不是猜测。
至于为什么使用 XOR,请参阅“使用 XOR 计算聚合哈希码”。
If count is expensive, don't do it. Do you know it's too expensive? You can always code several implementations and profile their performance with data you expect to be representative of your application. Then you'll know the answer instead of guessing.
As for why you use XOR, see 'Calculating Aggregate hashCodes with XOR'.
为什么?所有空集合的哈希码可能都是 0。即使不是,它也必须是一个固定值(因为所有空集合都是相等的),那么 0 有什么问题呢?
你的效率更高(这意味着计算速度更快),而且在有效性方面也没有那么糟糕(这意味着产生效果良好的结果)。如果我理解正确的话,它会将所有元素的哈希码相加(重复元素被添加两次)。这正是常规 Set 的作用,因此如果没有重复项,您将获得与 Set 相同的 hashCode,这可能是一个优势(如果您将空集修复为具有 hashCode 0,而不是 1)。
我想谷歌的版本有点复杂,以避免一些其他频繁的冲突。当然,它可能会导致其他一些被认为不太频繁发生的碰撞。
特别是,使用 XOR 将 hashCode 分布在整个可用范围内,即使单个输入 hashCode 不会(例如,对于有限范围内的整数而言,它们不会这样做,这是一种常见的用例)。
考虑集合 [1,2,3] 的 hashCode。为 6。可能与相似的 Set 发生冲突,例如 [ 6]、[ 4, 2] 、[5, 1]。在其中添加一些 XOR 会有所帮助。如果有必要并且值得,那么额外的成本是您必须做出的权衡。
Why? All empty collections probably have hash code 0. Even if not, it would have to be a fixed value (since all empty collections are equal), so what is wrong with 0?
Yours is more efficient (which just means faster to calculate), not too that bad in terms of effectiveness (which means yielding results that work well), either. If I understand it correctly, it adds up the hash codes of all elements (with duplicate elements being added twice). This is exactly what a regular Set does, so if you have no duplicates, you get the same hashCode as with a Set, which could be an advantage (if you fix the empty set to have hashCode 0, not 1).
Google's version is a little more complicated, I suppose in order to avoid some otherwise frequent collisions. Of course it probably causes some other collisions that are considered less frequent to happen instead.
In particular, using XOR spreads the hashCodes all over the available range, even if the individual input hashCodes do not (which they for example do not for Integers from a limited range, which is a frequent use-case).
Consider the hashCode for the Set [ 1, 2, 3]. It is 6. Likely to collide with similar Sets, for example [ 6], [ 4, 2] , [5, 1]. Throwing some XOR in there helps. If it is necessary and worth the extra cost is a tradeoff you have to make.
我观察到 java.util.Map 使用或多或少相同的逻辑: java.util.Map.hashCode() 被指定返回 map.entrySet().hashCode(),而 Map.Entry 指定其 hashCode() 是条目.getKey().hashCode() ^ 条目.getValue().hashCode()。接受从 Multiset 到 Map 的类比,这正是您所期望的 hashCode 实现。
I observe that java.util.Map uses more or less the same logic: java.util.Map.hashCode() is specified to return map.entrySet().hashCode(), and Map.Entry specifies that its hashCode() is entry.getKey().hashCode() ^ entry.getValue().hashCode(). Accepting the analogy from Multiset to Map, this is exactly the hashCode implementation you'd expect.