Java HashMap 检测冲突
有没有办法检测 Java Hash-map 中的冲突?任何人都可以指出某些可能发生大量碰撞的情况吗?当然,如果你重写一个对象的哈希码并简单地返回一个常量值,那么肯定会发生冲突。我不是在谈论这个。我想知道除了前面提到的之外,在什么情况下会发生大量的冲突无需修改默认的哈希码实现。
Is there a way to detect collision in Java Hash-map ? Can any one point out some situation's where lot of collision's can take place. Of-course if you override the hashcode for an object and simply return a constant value collision is sure to occur.I'm not talking about that.I want to know in what all situations other that the previously mentioned do huge number of collisions occur without modifying the default hashcode implementation.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我创建了一个项目来对此类事情进行基准测试:http://code.google.com/p /hashingbench/ (适用于具有链接、开放寻址和布隆过滤器的哈希表)。
除了密钥的 hashCode() 之外,您还需要了解哈希表的“涂抹”(或“加扰”,我在该项目中称之为)函数。从此列表来看,HashMap 的涂抹函数相当于:
因此,对于 HashMap 中发生的冲突,必要和充分条件如下:
scramble(k1.hashCode()) == 打乱(k2.hashCode())
。 如果k1.hashCode() == k2.hashCode()
则始终为真(否则,涂抹/加扰函数不会函数),因此这是发生碰撞的充分条件,但不是必要条件。编辑:实际上,上述充要条件应该是
compress(scramble(k1.hashCode())) == compress(scramble(k2.hashCode()))
-compress
函数采用一个整数并将其映射到{0, ..., N-1}
,其中N
是数字桶,所以它基本上选择一个桶。通常,这可以简单地实现为hash % N
,或者当哈希表大小是 2 的幂时(这实际上是使用 2 的幂的哈希表大小的动机),如hash & N(更快)。 (“压缩”是 Goodrich 和 Tamassia 在Java 中的数据结构和算法< /a>)。感谢 ILMTitan 发现了我的马虎行为。
其他哈希表实现(ConcurrentHashMap、IdentityHashMap 等)有其他需求并使用另一种涂抹/加扰函数,因此您需要知道您正在谈论哪一个。
(例如,HashMap 的涂抹功能之所以到位,是因为人们将 HashMap 与具有最差类型的 hashCode() 的对象一起使用,以实现没有涂抹的 HashMap 的旧的、二幂表实现 - 稍微不同的对象,或者根本不存在,在用于选择存储桶的低位中 - 例如
new Integer(1 * 1024)
、new Integer(2 * 1024)
*,等等。正如你所看到的,HashMap 的涂抹函数尽力让所有位影响低位位)。不过,所有这些都旨在在常见情况下正常工作 - 一个特殊情况是继承系统 hashCode() 的对象。
PS:实际上,促使实现者插入涂抹函数的最丑陋的情况是Floats/Doubles的hashCode(),以及作为值的键的用法:1.0、2.0、3.0、4.0 ...,它们都有相同的(零)低位。这是相关的旧错误报告: https://bugs.java.com/bugdatabase/ view_bug?bug_id=4669519
I have created a project to benchmark these sort of things: http://code.google.com/p/hashingbench/ (For hashtables with chaining, open-addressing and bloom filters).
Apart from the hashCode() of the key, you need to know the "smearing" (or "scrambling", as I call it in that project) function of the hashtable. From this list, HashMap's smearing function is the equivalent of:
So for a collision to occur in a HashMap, the necessary and sufficient condition is the following :
scramble(k1.hashCode()) == scramble(k2.hashCode())
. This is always true ifk1.hashCode() == k2.hashCode()
(otherwise, the smearing/scrambling function wouldn't be a function), so that's a sufficient, but not necessary condition for a collision to occur.Edit: Actually, the above necessary and sufficient condition should have been
compress(scramble(k1.hashCode())) == compress(scramble(k2.hashCode()))
- thecompress
function takes an integer and maps it to{0, ..., N-1}
, whereN
is the number of buckets, so it basically selects a bucket. Usually, this is simply implemented ashash % N
, or when the hashtable size is a power of two (and that's actually a motivation for having power-of-two hashtable sizes), ashash & N
(faster). ("compress" is the name Goodrich and Tamassia used to describe this step, in the Data Structures and Algorithms in Java). Thanks go to ILMTitan for spotting my sloppiness.Other hashtable implementations (ConcurrentHashMap, IdentityHashMap, etc) have other needs and use another smearing/scrambling function, so you need to know which one you're talking about.
(For example, HashMap's smearing function was put into place because people were using HashMap with objects with the worst type of hashCode() for the old, power-of-two-table implementation of HashMap without smearing - objects that differ a little, or not at all, in their low-order bits which were used to select a bucket - e.g.
new Integer(1 * 1024)
,new Integer(2 * 1024)
*, etc. As you can see, the HashMap's smearing function tries its best to have all bits affect the low-order bits).All of them, though, are meant to work well in common cases - a particular case is objects that inherit the system's hashCode().
PS: Actually, the absolutely ugly case which prompted the implementors to insert the smearing function is the hashCode() of Floats/Doubles, and the usage as keys of values: 1.0, 2.0, 3.0, 4.0 ..., all of them having the same (zero) low-order bits. This is the related old bug report: https://bugs.java.com/bugdatabase/view_bug?bug_id=4669519
简单示例:对
Long
进行哈希处理。显然输入有 64 位,输出只有 32 位。Long
的哈希值被记录为:即想象它是两个彼此相邻的
int
值,并对它们进行异或。因此,所有这些都将具有 0 的哈希码:
等等,
我不知道这是否算作“大量冲突”,但这是一个很容易制造冲突的示例。
显然,任何超过 232 个可能值的哈希都会发生冲突,但在许多情况下,它们更难生成。例如,虽然我确实在仅使用 ASCII 值的
String
上看到过哈希冲突,但它们比上面的更难生成。Simple example: hashing a
Long
. Obviously there are 64 bits of input and only 32 bits of output. The hash ofLong
is documented to be:i.e. imagine it's two
int
values stuck next to each other, and XOR them.So all of these will have a hashcode of 0:
etc
I don't know whether that counts as a "huge number of collisions" but it's one example where collisions are easy to manufacture.
Obviously any hash where there are more than 232 possible values will have collisions, but in many cases they're harder to produce. For example, while I've certainly seen hash collisions on
String
using just ASCII values, they're slightly harder to produce than the above.我认为其他两个答案很好,但我只是想分享一下,测试
hashCode()
在HashMap
中的行为的最佳方法是实际生成一个类中的大量对象,将它们作为键放入特定的HashMap
实现中并测试 CPU 和内存负载。 1 或 200 万个条目是一个很好的测量数字,但如果使用预期的地图大小进行测试,您将获得最佳结果。我刚刚看了一个类,我怀疑它的哈希函数。所以我决定用该类型的随机对象填充 HashMap 并测试碰撞次数。我测试了所调查类的两个 hashCode() 实现。因此,我用 groovy 编写了您在底部看到的扩展 HashMap 的 openjdk 实现的类,以计算 HashMap 中的冲突数量(请参阅
countCollidingEntries()
)。请注意,这些并不是整个哈希的真正冲突,而是保存条目的数组中的冲突。数组索引的计算方式为hash & (length-1)
这意味着该数组的大小越短,发生的冲突就越多。这个数组的大小取决于HashMap
的initialCapacity
和loadFactor
(当put()
更多时它会增加数据)。但最后我认为看这些数字没有什么意义。事实上,如果
hashCode()
方法不好,HashMap 就会变慢,这意味着只需对 Map 中数据的插入和检索进行基准测试,您就可以有效地知道哪种hashCode()
实现更好。The other two answers I see a good IMO but I just wanted to share that the best way to test how well your
hashCode()
behaves in aHashMap
is to actually generate a big number of objects from your class, put them in the particularHashMap
implementation as the key and test CPU and memory load. 1 or 2 million entries are a good number to measure but you get best results if you test with your anticipated Map sizes.I just looked at a class that I doubted its hashing function. So I decided to fill in a HashMap with random objects of that type and test number of collisions. I tested two hashCode() implementations of the class under investigation. So I wrote in groovy the class you see at the bottom extending openjdk implementation of HashMap to count number of collisions into the HashMap (see
countCollidingEntries()
). Note that these are not real collisions of the whole hash but collisions in the array holding the entries. Array index is calculated ashash & (length-1)
which means that as short the size of this array is, the more collisions you get. And size of this array depends oninitialCapacity
andloadFactor
of theHashMap
(it can increase whenput()
more data).At the end though I considered that looking at these numbers does little sense. The fact that HashMap is slower with bad
hashCode()
method means that by just benchmarking insertion and retrieval of data from the Map you effectively know whichhashCode()
implementation is better.