Java 中多重集的高效哈希码

发布于 2024-12-05 09:13:19 字数 900 浏览 4 评论 0原文

我定义了 java.util.Collection 的子接口，它实际上是一个多重集（又名包）。它可能不包含 null 元素，尽管这对我的问题并不重要。接口定义的 equals 契约正如您所期望的那样：

obj instanceof MyInterface
obj 包含与 this 相同的元素（通过 equals)
obj 对于每个元素包含相同数量的重复项，
元素的顺序被忽略

现在我想编写我的 hashCode 方法。我最初的想法是：

int hashCode = 1;
for( Object o : this ) {
    hashCode += o.hashCode();
}

但是，我注意到 com.google.common.collect.Multiset （来自 Guava）将哈希代码定义如下：

int hashCode = 0;
for( Object o : elementSet() ) {
    hashCode += ((o == null) ? 0 : o.hashCode()) ^ count(o);
}

空的 Multiset 会有哈希代码，这让我感到奇怪0，但更重要的是，我不明白 ^ count(o) 相对于简单地添加每个重复项的哈希码的好处。也许这是关于不要多次计算相同的哈希码，但是为什么不使用 * count(o) 呢？

我的问题：什么是有效的哈希码计算？就我而言，元素的计数并不能保证以低廉的价格获得。

原文

I have defined a subinterface of java.util.Collection that effectively is a multiset (aka bag). It may not contain null elements, although that's not crucial to my question. The equals contract defined by the interface is as you would expect:

obj instanceof MyInterface
obj contains the same elements as this (by equals)
obj contains the same number of duplicates for each element
order of elements is disregarded

Now I want to write my hashCode method. My initial idea was:

int hashCode = 1;
for( Object o : this ) {
    hashCode += o.hashCode();
}

However, I noticed that com.google.common.collect.Multiset (from Guava) defines the hash code as follows:

int hashCode = 0;
for( Object o : elementSet() ) {
    hashCode += ((o == null) ? 0 : o.hashCode()) ^ count(o);
}

It strikes me as odd that an empty Multiset would have hash code 0, but more importantly I don't understand the benefit of ^ count(o) over simply adding up the hash codes of every duplicate. Maybe it's about not calculating the same hash code more than once, but then why not * count(o)?

My question: what would be an efficient hash code calculation? In my case the count for an element is not guaranteed to be cheap to obtain.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

记忆里有你的影子 2024-12-12 09:13:19

更新

例如，我们有一个数组，我们希望将其视为多重集。

因此，您必须在所有条目出现时对其进行处理，不能使用 count，也不能假设条目按已知顺序出现。

我考虑的一般功能是

int hashCode() {
    int x = INITIAL_VALUE;
    for (Object o : this) {
        x = f(x, o==null ? NULL_HASH : g(o.hashCode()));
    }
    return h(x);
}

一些观察：

正如其他答案中已经指出的， INITIAL_VALUE 并不重要。
我不会选择 NULL_HASH=0，因为这会忽略空值。
如果您希望成员的哈希值在较小的范围内（例如，如果它们是单个字符，则可能会发生这种情况），可以使用函数g。
函数h可用于改善结果，但这并不是很重要，因为这种情况已经发生在例如HashMap.hash(int)中。
函数 f 是最重要的函数，不幸的是，它非常有限，因为它显然必须既具有结合性又具有交换性。
函数 f 在两个参数中都应该是双射的，否则会产生不必要的冲突。

在任何情况下，我都不会推荐 f(x, y) = x^y 因为它会使一个元素的两次出现被抵消。使用加法效果更好。像

f(x, y) = x + (2*A*x + 1) * y

其中 A 是常量这样的东西满足上述所有条件。这可能是值得的。
对于 A=0 它退化为加法，使用偶数 A 不好，因为它会将 x*y 的位移出。
使用 A=1 就可以了，并且可以使用 x86 架构上的单个指令来计算表达式 2*x+1。
如果成员的哈希值分布不均，使用较大的奇数 A 可能效果更好。

如果您选择一个不平凡的 hashCode() 您应该测试它是否正常工作。您应该测量程序的性能，也许您会发现简单的加法就足够了。否则，我会选择 NULL_HASH=1、g=h=identity 和 A=1。

我的旧答案

可能是出于效率原因。对于某些实现来说，调用 count 可能会很昂贵，但可以使用 entrySet 来代替。但它可能会更贵，我无法判断。

我为 Guava 的 hashCode 和 Rinke 以及我自己的建议做了一个简单的碰撞基准测试：

enum HashCodeMethod {
    GUAVA {
        @Override
        public int hashCode(Multiset<?> multiset) {
            return multiset.hashCode();
        }
    },
    RINKE {
        @Override
        public int hashCode(Multiset<?> multiset) {
            int result = 0;
            for (final Object o : multiset.elementSet()) {
                result += (o==null ? 0 : o.hashCode()) * multiset.count(o);
            }
            return result;
        }
    },
    MAAARTIN {
        @Override
        public int hashCode(Multiset<?> multiset) {
            int result = 0;
            for (final Multiset.Entry<?> e : multiset.entrySet()) {
                result += (e.getElement()==null ? 0 : e.getElement().hashCode()) * (2*e.getCount()+123);
            }
            return result;
        }
    }
    ;
    public abstract int hashCode(Multiset<?> multiset);
}

碰撞计数代码如下：

private void countCollisions() throws Exception {
    final String letters1 = "abcdefgh";
    final String letters2 = "ABCDEFGH";
    final int total = letters1.length() * letters2.length();
    for (final HashCodeMethod hcm : HashCodeMethod.values()) {
        final Multiset<Integer> histogram = HashMultiset.create();
        for (final String s1 : Splitter.fixedLength(1).split(letters1)) {
            for (final String s2 : Splitter.fixedLength(1).split(letters2)) {
                histogram.add(hcm.hashCode(ImmutableMultiset.of(s1, s2, s2)));
            }
        }
        System.out.println("Collisions " + hcm + ": " + (total-histogram.elementSet().size()));
    }
}

并打印出来

Collisions GUAVA: 45
Collisions RINKE: 42
Collisions MAAARTIN: 0

因此，在这个简单的示例中，Guava 的 hashCode 表现非常糟糕（63 种可能的碰撞中，有 45 次碰撞）。然而，我并不认为我的例子与现实生活有很大关系。

Update

Let's say, for example, in the case where we'd have an array that we want to treat as a multiset.

So you have to process all the entries as they come, you can't use count, and can't assume the entries come in a known order.

The general function I'd consider is

int hashCode() {
    int x = INITIAL_VALUE;
    for (Object o : this) {
        x = f(x, o==null ? NULL_HASH : g(o.hashCode()));
    }
    return h(x);
}

Some observations:

As already stated in the other answers, the INITIAL_VALUE doesn't matter much.
I wouldn't go for NULL_HASH=0 since this would ignore null values.
The function g can be used in case you expect the hashes of the members to be in a small range (which can happen in case they are e.g., single characters).
The function h can be used to improve the result, which is not very important since this already happens e.g. in HashMap.hash(int).
The function f is the most important one, unfortunately, it's quite limited as it obviously must be both associative and commutative.
The function f should be bijective in both arguments, otherwise you'd generate unnecessary collisions.

In no case I'd recommend f(x, y) = x^y since it'd make two occurrences of an element to cancel out. Using addition is better. Something like

f(x, y) = x + (2*A*x + 1) * y

where A is a constant satisfies all the above conditions. It may be worth it.
For A=0 it degenerates to addition, using an even A is not good as it shift bits of x*y out.
Using A=1 is fine, and the expression 2*x+1 can be computed using a single instruction on the x86 architecture.
Using a larger odd A might work better in case the hashes of the members are badly distributed.

In case you go for a non-trivial hashCode() you should test if it works correctly. You should measure the performance of your program, maybe you'll find simple addition sufficient. Otherwise, I'd for for NULL_HASH=1, g=h=identity, and A=1.

My old answer

It may be for efficiency reasons. Calling count may be expensive for some implementations, but entrySet may be used instead. Still it might be more costly, I can't tell.

I did a simple collision benchmark for Guava's hashCode and Rinke's and my own proposals:

enum HashCodeMethod {
    GUAVA {
        @Override
        public int hashCode(Multiset<?> multiset) {
            return multiset.hashCode();
        }
    },
    RINKE {
        @Override
        public int hashCode(Multiset<?> multiset) {
            int result = 0;
            for (final Object o : multiset.elementSet()) {
                result += (o==null ? 0 : o.hashCode()) * multiset.count(o);
            }
            return result;
        }
    },
    MAAARTIN {
        @Override
        public int hashCode(Multiset<?> multiset) {
            int result = 0;
            for (final Multiset.Entry<?> e : multiset.entrySet()) {
                result += (e.getElement()==null ? 0 : e.getElement().hashCode()) * (2*e.getCount()+123);
            }
            return result;
        }
    }
    ;
    public abstract int hashCode(Multiset<?> multiset);
}

The collision counting code went as follows:

private void countCollisions() throws Exception {
    final String letters1 = "abcdefgh";
    final String letters2 = "ABCDEFGH";
    final int total = letters1.length() * letters2.length();
    for (final HashCodeMethod hcm : HashCodeMethod.values()) {
        final Multiset<Integer> histogram = HashMultiset.create();
        for (final String s1 : Splitter.fixedLength(1).split(letters1)) {
            for (final String s2 : Splitter.fixedLength(1).split(letters2)) {
                histogram.add(hcm.hashCode(ImmutableMultiset.of(s1, s2, s2)));
            }
        }
        System.out.println("Collisions " + hcm + ": " + (total-histogram.elementSet().size()));
    }
}

and printed

Collisions GUAVA: 45
Collisions RINKE: 42
Collisions MAAARTIN: 0

So in this simple example Guava's hashCode performed really bad (45 collisions out of 63 possible). However, I don't claim my example is of much relevance for real life.

回复收藏 0 原文

何以心动 2024-12-12 09:13:19

如果计数很昂贵，就不要这样做。你知道它太贵了吗？您始终可以对多个实现进行编码，并使用您希望代表应用程序的数据来分析其性能。然后你就会知道答案而不是猜测。

至于为什么使用 XOR，请参阅“使用 XOR 计算聚合哈希码”。

回复收藏 0 原文

恬淡成诗 2024-12-12 09:13:19

令我感到奇怪的是，空的 Multiset 的哈希码为 0

为什么？所有空集合的哈希码可能都是 0。即使不是，它也必须是一个固定值（因为所有空集合都是相等的），那么 0 有什么问题呢？

什么是有效的哈希码计算？

你的效率更高（这意味着计算速度更快），而且在有效性方面也没有那么糟糕（这意味着产生效果良好的结果）。如果我理解正确的话，它会将所有元素的哈希码相加（重复元素被添加两次）。这正是常规 Set 的作用，因此如果没有重复项，您将获得与 Set 相同的 hashCode，这可能是一个优势（如果您将空集修复为具有 hashCode 0，而不是 1）。

我想谷歌的版本有点复杂，以避免一些其他频繁的冲突。当然，它可能会导致其他一些被认为不太频繁发生的碰撞。

特别是，使用 XOR 将 hashCode 分布在整个可用范围内，即使单个输入 hashCode 不会（例如，对于有限范围内的整数而言，它们不会这样做，这是一种常见的用例）。

考虑集合 [1,2,3] 的 hashCode。为 6。可能与相似的 Set 发生冲突，例如 [ 6]、[ 4, 2] 、[5, 1]。在其中添加一些 XOR 会有所帮助。如果有必要并且值得，那么额外的成本是您必须做出的权衡。

回复收藏 0 原文

夜访吸血鬼 2024-12-12 09:13:19

我观察到 java.util.Map 使用或多或少相同的逻辑： java.util.Map.hashCode() 被指定返回 map.entrySet().hashCode()，而 Map.Entry 指定其 hashCode() 是条目.getKey().hashCode() ^ 条目.getValue().hashCode()。接受从 Multiset 到 Map 的类比，这正是您所期望的 hashCode 实现。

回复收藏 0 原文

~没有更多了~