是什么导致 java.util.HashSet 和 HashMap.keySet() 类的 iterator() 顺序稍微不可预测?
六年前,我花了几天时间试图找出我的完美确定性框架随机响应的地方。在仔细追踪整个框架并确保它全部使用相同的 Random 实例之后,我继续通过单步代码进行追踪。这是高度重复的迭代自调用代码。更糟糕的是,这种该死的效果只有在完成大量迭代之后才会显现出来。 6 个小时后,当我在 javadoc 中发现 HashSet.iterator() 的一行表明它不能保证返回元素的顺序时,我终于束手无策。然后,我检查了整个代码库,并将 HashSet 的所有实例替换为 LinkedHashSet。低头一看,我的框架一下子就进入了确定性的生活!啊啊!
我现在刚刚再次经历了同样的怪异影响(至少这次只有 3 个小时)。不管出于什么原因,我错过了一个小细节,即 HashMap 的 keySet() 的行为方式恰好相同。
这是关于这个主题的一个主题,尽管讨论从未完全回答我的问题: HashSet 的迭代顺序
所以,我很好奇为什么会发生这种情况。考虑到两次我都有一个巨大的单线程 java 应用程序在同一台计算机上使用完全相同的 JVM 参数(从同一个批处理文件多次运行)爬行完全相同的实例化/插入空间,几乎没有其他东西运行,什么可能会扰乱JVM 使得 HashSet 和 HashMap 在经过大量迭代后表现不可预测(并不像 javadoc 所说的不依赖顺序那样不一致)?
来自源代码(java.util 中这些类的实现)或来自您对 JVM 的了解(也许某些 GC 会影响内部 java 类在分配内部内存空间时获得非零内存的位置)对此有何想法?
Six years ago, I burned several days trying to hunt down where my perfectly deterministic framework was responding randomly. After meticulously chasing the entire framework ensuring that it was all using the same instance of Random, I then kept chasing by single stepping code. It was highly repetitive iterative self-calling code. Worse, the damn effect would only show up after a huge number of iterations were completed. And after +6 hours, I was finally at wits end when I discovered a line in the javadoc for HashSet.iterator() indicating it doesn't guarantee the order in which it will return elements. I then went through my entire code base and replaced all instances of HashSet with LinkedHashSet. And low-and-behold, my framework sprang right to deterministic life! ARGH!
I have now just experienced this same FREAKIN affect, again (at least it was only 3 hours this time). For whatever reason, I missed the small detail that HashMap happens to BEHAVE THE SAME WAY for its keySet().
Here's an SO thread on this subject, although the discussion never quite answers my question: Iteration order of HashSet
So, I am curious as to why this might occur. Given both times I had a huge single threaded java application crawling through exactly the same instantiation/insertion space with exactly the same JVM parameters (multiple runs from the same batch file) on the same computer with almost nothing else running, what could possibly perturb the JVM such that HashSet and HashMap would, after an enormous number of iterations, behave unpredictably (not inconsistenly as the javadoc says not to depend upon the order)?
Any ideas around this from either the source code (implementation of these classes in java.util) or from your knowledge of the JVM (perhaps some GC affect where internal java classes get non-zeroed memory when allocating internal memory spaces)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
简短回答
有一个权衡。如果您希望以分摊常数时间 O(1) 访问元素,则迄今为止的技术依赖于散列等随机方案。如果您想要对元素进行有序访问,则最佳的工程权衡只能为您提供 O(ln(n)) 性能。对于您的情况,也许这并不重要,但是恒定时间和对数时间之间的差异即使从相对较小的结构开始也会产生很大的差异。
所以,是的,您可以查看代码并仔细检查,但这归结为一个相当实用的理论事实。现在是掸去Cormen(或这里的 Googly Bookiness)支撑着你房子地基下垂的一角,看看第 11 章(哈希表)和第 13 章(红黑树)。这些将分别让您了解 JDK 的 HashMap 和 TreeMap 实现。
长答案
您不需要
Map
或Set
返回键/成员的有序列表。那不是他们的目的。映射和集合结构不像底层数学概念那样有序,并且它们提供不同的性能。这些数据结构的目标(正如 @thejh 指出的那样)是有效地摊销插入、包含和获取时间,而不是维护顺序。您可以研究如何维护哈希数据结构以了解权衡是什么。查看有关 哈希函数 和 哈希表(具有讽刺意味的是,请注意“无序映射”的 Wiki 条目重定向到后者)或计算机科学/数据结构文本。请记住:不要依赖 ADT(特别是集合)的属性,例如排序、不变性、线程安全性或其他任何内容,除非您仔细查看契约是什么。请注意,对于 Map,Javadoc 明确指出:
和
Set.iterator()
有类似的:如果您想要这些的有序视图,请使用以下方法之一:
Set
,也许您确实需要一个SortedSet
,例如TreeSet
TreeMap
,它允许键的自然排序或通过Comparator
SortedSet
键以及地图
,在摊销时间内表现会更好。Map.keySet()
(或只是您感兴趣的Set
)并将其放入SortedSet
例如TreeSet
,使用自然排序或特定的比较器
。Map.Entry< K,V>
使用Map.entrySet().iterator()
,在排序后。例如for (final Map.Entryentry : new TreeSet(map.entrySet())) { }
有效访问键和值。Arrays.sort()
,具有不同的性能配置文件(空间和时间)。源链接
如果您想查看 juHashSet 和 juHashMap,它们可以在 GrepCode 上找到。请注意,HashSet 只是 HashMap 的糖衣。为什么不总是使用排序版本?嗯,正如我上面提到的,性能有所不同,这在某些应用程序中很重要。请参阅此处的相关 SO 问题。您还可以在此处底部看到一些具体的性能数据(我没有仔细查看以验证这些是否准确,但它们恰好证实了我的观点,所以我会愉快地传递链接:-)。
Short Answer
There's a tradeoff. If you want amortized constant time O(1) access to elements, the techniques to date rely upon a randomized scheme like hashing. If you want ordered access to elements, the best engineering tradeoff gives you only O(ln(n)) performance. For your case, perhaps this doesn't matter, but the difference between constant time and logarithmic time makes a very big difference starting even with relatively small structures.
So yes, you can go look at the code and inspect carefully, but it boils down to a rather practical theoretical fact. Now is a good time to brush the dust off that copy of Cormen (or Googly Bookiness here) that's propping up the drooping corner of your house's foundation and take a look at Chapters 11 (Hash Tables) and 13 (Red-Black Trees). These will fill you in on the JDK's implementation of HashMap and TreeMap, respectively.
Long Answer
You don't want a
Map
orSet
to return ordered lists of keys/members. That's not what they're for. Maps and Sets structures are not ordered just like the underlying mathematical concepts, and they provide different performance. The objective of these data structures (as @thejh points out) is efficient amortizedinsert
,contains
, andget
time, not maintaining ordering. You can look into how a hashed data structure is maintained to know what the tradeoffs are. Take a look at the Wikipedia entries on Hash Functions and Hash Tables (ironically, note that the Wiki entry for "unordered map" redirects to the latter) or a computer science / data structures text.Remember: Don't depend on properties of ADTs (and specifically collections) such as ordering, immutability, thread safety or anything else unless you look carefully at what the contract is. Note that for Map, the Javadoc says clearly:
And
Set.iterator()
has the similar:If you want an ordered view of these, use one of the following approaches:
Set
, maybe you really want aSortedSet
such as aTreeSet
TreeMap
, which allows either natural ordering of keys or a specific ordering viaComparator
SortedSet
of keys as well as aMap
, which will perform better in amortized time.Map.keySet()
(or just theSet
you're interested in) and put it into aSortedSet
such asTreeSet
, either using the natural ordering or a specificComparator
.Map.Entry<K,V>
usingMap.entrySet().iterator()
, after it has been sorted. E.g.for (final Map.Entry<K,V> entry : new TreeSet(map.entrySet())) { }
to efficiently access both keys and values.Arrays.sort()
, which has a different performance profile (space and time).Links to the Source
If you would like to look at the source for j.u.HashSet and j.u.HashMap, they are available on GrepCode. Note that a HashSet is just sugar for a HashMap. Why not always use the sorted versions? Well, as I allude above, the performance differs and that matters in some applications. See the related SO question here. You can also see some concrete performance numbers at the bottom here (I haven't looked closely to verify these are accurate, but they happen to substantiate my point, so I'll blithely pass along the link. :-)
我以前遇到过这个问题,顺序并不重要,但确实影响了结果。
Java 的多线程本质意味着,使用完全相同的输入进行重复运行可能会受到细微的时间差异的影响(例如,分配新内存块需要多长时间),这有时可能需要将前一个内存块调出到磁盘。内容,而在其他时候则不需要。当考虑系统对象时,不使用该页面的其他一些线程可能会继续进行,并且最终可能会得到不同的对象创建顺序。
这可能会影响 JVM 不同运行中等效对象的
Object.hashCode()
结果。对我来说,我决定添加使用 LinkedHashMap 的小开销,以便能够重现我正在运行的测试的结果。
I've struck this before, where the order wasn't important, but did affect the results.
The multi-threaded nature of Java means that repeated runs with exactly the same inputs can be affected by slight timing differences in (for example) how long it takes to allocate a new block of memory, which might sometimes require paging out to disk the previous contents, and at other times that isn't needed. Some other thread not using that page may proceed, and you could end up with a different order of object creation, when System objects are taken into account.
That can affect the
Object.hashCode()
result for the equivalent object in different runs of the JVM.For me, I decided to add the small overhead of using a
LinkedHashMap
, in order to be able to reproduce the results of the tests I was running.http://download.oracle.com /javase/1.4.2/docs/api/java/lang/Object.html#hashCode() 说:
那么内部地址可能会改变吗?
这也意味着您可以通过为所有应该充当键的内容编写自己的
hashCode()
方法来修复它,而不会影响速度。http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Object.html#hashCode() says:
So maybe the internal address changes?
This also means that you could propably fix it without giving up speed by writing your own
hashCode()
method for everything that should act as a key.你永远不应该依赖哈希映射的顺序。
如果你想要一个具有确定性排序的 Map,我建议你使用像 TreeMap/TreeSet 这样的 SortedMap/SortedSet 或使用 LinkedHashMap/LinkedHashSet。我经常使用后者,不是因为程序需要排序,而是因为它更容易读取日志/调试地图的状态。即当你添加一个键时,它每次都会走到最后。
您可以创建两个具有相同元素的 HashMap/HashSet,但根据集合的容量获得不同的顺序。代码运行方式可能存在细微差异,从而触发不同的最终存储桶大小,从而触发不同的顺序。
例如
prints
这里你有 HashSet,其相同的值以相同的顺序添加,导致不同的迭代器顺序。您可能没有使用构造函数,但您的应用程序可能会间接导致不同的存储桶大小。
You should NEVER depend on the order of a hash map.
If you want a Map with a deterministic ordering, I suggest you use a SortedMap/SortedSet like TreeMap/TreeSet or use LinkedHashMap/LinkedHashSet. I use the later often, not because the program needs the ordering, but because its easier to read logs/debug the state of the map. i.e. when you add a key, it goes to the end every time.
You can create two HashMap/HashSet with the same elements but get different orders depending on the capacity of the collection. It is possible for subtle differences in how your code runs to trigger a different final bucket size and therefor a different order.
e.g.
prints
Here you have HashSet(s) with the same values added in the same order resulting in different iterator orders. You may not be playing with the constructor, but your application could cause a different bucket size indirectly.