Java HashMap 检测冲突

发布于 2024-09-13 19:11:35 字数 145 浏览 5 评论 0原文

有没有办法检测 Java Hash-map 中的冲突?任何人都可以指出某些可能发生大量碰撞的情况吗?当然,如果你重写一个对象的哈希码并简单地返回一个常量值,那么肯定会发生冲突。我不是在谈论这个。我想知道除了前面提到的之外,在什么情况下会发生大量的冲突无需修改默认的哈希码实现。

Is there a way to detect collision in Java Hash-map ? Can any one point out some situation's where lot of collision's can take place. Of-course if you override the hashcode for an object and simply return a constant value collision is sure to occur.I'm not talking about that.I want to know in what all situations other that the previously mentioned do huge number of collisions occur without modifying the default hashcode implementation.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

送舟行 2024-09-20 19:11:35

我创建了一个项目来对此类事情进行基准测试:http://code.google.com/p /hashingbench/ (适用于具有链接、开放寻址和布隆过滤器的哈希表)。

除了密钥的 hashCode() 之外,您还需要了解哈希表的“涂抹”(或“加扰”,我在该项目中称之为)函数。从此列表来看,HashMap 的涂抹函数相当于:

public int scramble(int h) {
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

因此,对于 HashMap 中发生的冲突,必要充分条件如下:scramble(k1.hashCode()) == 打乱(k2.hashCode())如果 k1.hashCode() == k2.hashCode() 则始终为真(否则,涂抹/加扰函数不会函数),因此这是发生碰撞的充分条件,但不是必要条件。

编辑:实际上,上述充要条件应该是 compress(scramble(k1.hashCode())) == compress(scramble(k2.hashCode())) - compress 函数采用一个整数并将其映射到 {0, ..., N-1},其中 N 是数字桶,所以它基本上选择一个桶。通常,这可以简单地实现为 hash % N,或者当哈希表大小是 2 的幂时(这实际上是使用 2 的幂的哈希表大小的动机),如 hash & N(更快)。 (“压缩”是 Goodrich 和 Tamassia 在Java 中的数据结构和算法< /a>)。感谢 ILMTitan 发现了我的马虎行为。

其他哈希表实现(ConcurrentHashMap、IdentityHashMap 等)有其他需求并使用另一种涂抹/加扰函数,因此您需要知道您正在谈论哪一个。

(例如,HashMap 的涂抹功能之所以到位,是因为人们将 HashMap 与具有最差类型的 hashCode() 的对象一起使用,以实现没有涂抹的 HashMap 的旧的、二幂表实现 - 稍微不同的对象,或者根本不存在,在用于选择存储桶的低位中 - 例如 new Integer(1 * 1024)new Integer(2 * 1024) *,等等。正如你所看到的,HashMap 的涂抹函数尽力让所有位影响低位位)。

不过,所有这些都旨在在常见情况下正常工作 - 一个特殊情况是继承系统 hashCode() 的对象。

PS:实际上,促使实现者插入涂抹函数的最丑陋的情况是Floats/Doubles的hashCode(),以及作为值的键的用法:1.0、2.0、3.0、4.0 ...,它们都有相同的(零)低位。这是相关的旧错误报告: https://bugs.java.com/bugdatabase/ view_bug?bug_id=4669519

I have created a project to benchmark these sort of things: http://code.google.com/p/hashingbench/ (For hashtables with chaining, open-addressing and bloom filters).

Apart from the hashCode() of the key, you need to know the "smearing" (or "scrambling", as I call it in that project) function of the hashtable. From this list, HashMap's smearing function is the equivalent of:

public int scramble(int h) {
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

So for a collision to occur in a HashMap, the necessary and sufficient condition is the following : scramble(k1.hashCode()) == scramble(k2.hashCode()). This is always true if k1.hashCode() == k2.hashCode() (otherwise, the smearing/scrambling function wouldn't be a function), so that's a sufficient, but not necessary condition for a collision to occur.

Edit: Actually, the above necessary and sufficient condition should have been compress(scramble(k1.hashCode())) == compress(scramble(k2.hashCode())) - the compress function takes an integer and maps it to {0, ..., N-1}, where N is the number of buckets, so it basically selects a bucket. Usually, this is simply implemented as hash % N, or when the hashtable size is a power of two (and that's actually a motivation for having power-of-two hashtable sizes), as hash & N (faster). ("compress" is the name Goodrich and Tamassia used to describe this step, in the Data Structures and Algorithms in Java). Thanks go to ILMTitan for spotting my sloppiness.

Other hashtable implementations (ConcurrentHashMap, IdentityHashMap, etc) have other needs and use another smearing/scrambling function, so you need to know which one you're talking about.

(For example, HashMap's smearing function was put into place because people were using HashMap with objects with the worst type of hashCode() for the old, power-of-two-table implementation of HashMap without smearing - objects that differ a little, or not at all, in their low-order bits which were used to select a bucket - e.g. new Integer(1 * 1024), new Integer(2 * 1024) *, etc. As you can see, the HashMap's smearing function tries its best to have all bits affect the low-order bits).

All of them, though, are meant to work well in common cases - a particular case is objects that inherit the system's hashCode().

PS: Actually, the absolutely ugly case which prompted the implementors to insert the smearing function is the hashCode() of Floats/Doubles, and the usage as keys of values: 1.0, 2.0, 3.0, 4.0 ..., all of them having the same (zero) low-order bits. This is the related old bug report: https://bugs.java.com/bugdatabase/view_bug?bug_id=4669519

魂ガ小子 2024-09-20 19:11:35

简单示例:对 Long 进行哈希处理。显然输入有 64 位,输出只有 32 位。 Long 的哈希值被记录为:

(int)(this.longValue()^(this.longValue()>>>32))

即想象它是两个彼此相邻的 int 值,并对它们进行异或。

因此,所有这些都将具有 0 的哈希码:

0
1L | (1L << 32)
2L | (2L << 32)
3L | (3L << 32)

等等,

我不知道这是否算作“大量冲突”,但这是一个很容易制造冲突的示例。

显然,任何超过 232 个可能值的哈希都会发生冲突,但在许多情况下,它们更难生成。例如,虽然我确实在仅使用 ASCII 值的 String 上看到过哈希冲突,但它们比上面的更难生成。

Simple example: hashing a Long. Obviously there are 64 bits of input and only 32 bits of output. The hash of Long is documented to be:

(int)(this.longValue()^(this.longValue()>>>32))

i.e. imagine it's two int values stuck next to each other, and XOR them.

So all of these will have a hashcode of 0:

0
1L | (1L << 32)
2L | (2L << 32)
3L | (3L << 32)

etc

I don't know whether that counts as a "huge number of collisions" but it's one example where collisions are easy to manufacture.

Obviously any hash where there are more than 232 possible values will have collisions, but in many cases they're harder to produce. For example, while I've certainly seen hash collisions on String using just ASCII values, they're slightly harder to produce than the above.

玩物 2024-09-20 19:11:35

我认为其他两个答案很好,但我只是想分享一下,测试 hashCode()HashMap 中的行为的最佳方法是实际生成一个类中的大量对象,将它们作为键放入特定的 HashMap 实现中并测试 CPU 和内存负载。 1 或 200 万个条目是一个很好的测量数字,但如果使用预期的地图大小进行测试,您将获得最佳结果。

我刚刚看了一个类,我怀疑它的哈希函数。所以我决定用该类型的随机对象填充 HashMap 并测试碰撞次数。我测试了所调查类的两个 hashCode() 实现。因此,我用 groovy 编写了您在底部看到的扩展 HashMap 的 openjdk 实现的类,以计算 HashMap 中的冲突数量(请参阅 countCollidingEntries())。请注意,这些并不是整个哈希的真正冲突,而是保存条目的数组中的冲突。数组索引的计算方式为hash & (length-1) 这意味着该数组的大小越短,发生的冲突就越多。这个数组的大小取决于HashMapinitialCapacityloadFactor(当put()更多时它会增加数据)。

但最后我认为看这些数字没有什么意义。事实上,如果 hashCode() 方法不好,HashMap 就会变慢,这意味着只需对 Map 中数据的插入和检索进行基准测试,您就可以有效地知道哪种 hashCode() 实现更好。

public class TestHashMap extends HashMap {

   public TestHashMap(int size) {
      super(size);
   }

   public TestHashMap() {
      super();
   }

   public int countCollidingEntries() {
      def fs = this.getClass().getSuperclass().getDeclaredFields();
      def table;
      def count =0 ;
      for ( java.lang.reflect.Field field: fs ) {
         if (field.getName() == "table") {
            field.setAccessible(true);
            table = field.get(super);
            break;
         }
      }
      for(Object e: table) {
         if (e != null) {
            while (e.next != null) {
               count++
               e = e.next;
            }
         }
      }
      return count;
   }
}

The other two answers I see a good IMO but I just wanted to share that the best way to test how well your hashCode() behaves in a HashMap is to actually generate a big number of objects from your class, put them in the particular HashMap implementation as the key and test CPU and memory load. 1 or 2 million entries are a good number to measure but you get best results if you test with your anticipated Map sizes.

I just looked at a class that I doubted its hashing function. So I decided to fill in a HashMap with random objects of that type and test number of collisions. I tested two hashCode() implementations of the class under investigation. So I wrote in groovy the class you see at the bottom extending openjdk implementation of HashMap to count number of collisions into the HashMap (see countCollidingEntries()). Note that these are not real collisions of the whole hash but collisions in the array holding the entries. Array index is calculated as hash & (length-1) which means that as short the size of this array is, the more collisions you get. And size of this array depends on initialCapacity and loadFactor of the HashMap (it can increase when put() more data).

At the end though I considered that looking at these numbers does little sense. The fact that HashMap is slower with bad hashCode() method means that by just benchmarking insertion and retrieval of data from the Map you effectively know which hashCode() implementation is better.

public class TestHashMap extends HashMap {

   public TestHashMap(int size) {
      super(size);
   }

   public TestHashMap() {
      super();
   }

   public int countCollidingEntries() {
      def fs = this.getClass().getSuperclass().getDeclaredFields();
      def table;
      def count =0 ;
      for ( java.lang.reflect.Field field: fs ) {
         if (field.getName() == "table") {
            field.setAccessible(true);
            table = field.get(super);
            break;
         }
      }
      for(Object e: table) {
         if (e != null) {
            while (e.next != null) {
               count++
               e = e.next;
            }
         }
      }
      return count;
   }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文