是什么导致 java.util.HashSet 和 HashMap.keySet() 类的 iterator() 顺序稍微不可预测?

发布于 2024-10-07 08:05:32 字数 788 浏览 0 评论 0原文

六年前,我花了几天时间试图找出我的完美确定性框架随机响应的地方。在仔细追踪整个框架并确保它全部使用相同的 Random 实例之后,我继续通过单步代码进行追踪。这是高度重复的迭代自调用代码。更糟糕的是,这种该死的效果只有在完成大量迭代之后才会显现出来。 6 个小时后,当我在 javadoc 中发现 HashSet.iterator() 的一行表明它不能保证返回元素的顺序时,我终于束手无策。然后,我检查了整个代码库,并将 HashSet 的所有实例替换为 LinkedHashSet。低头一看,我的框架一下子就进入了确定性的生活!啊啊!

我现在刚刚再次经历了同样的怪异影响(至少这次只有 3 个小时)。不管出于什么原因,我错过了一个小细节,即 HashMap 的 keySet() 的行为方式恰好相同。

这是关于这个主题的一个主题,尽管讨论从未完全回答我的问题: HashSet 的迭代顺序

所以,我很好奇为什么会发生这种情况。考虑到两次我都有一个巨大的单线程 java 应用程序在同一台计算机上使用完全相同的 JVM 参数(从同一个批处理文件多次运行)爬行完全相同的实例化/插入空间,几乎没有其他东西运行,什么可能会扰乱JVM 使得 HashSet 和 HashMap 在经过大量迭代后表现不可预测(并不像 javadoc 所说的不依赖顺序那样不一致)?

来自源代码(java.util 中这些类的实现)或来自您对 JVM 的了解(也许某些 GC 会影响内部 java 类在分配内部内存空间时获得非零内存的位置)对此有何想法?

Six years ago, I burned several days trying to hunt down where my perfectly deterministic framework was responding randomly. After meticulously chasing the entire framework ensuring that it was all using the same instance of Random, I then kept chasing by single stepping code. It was highly repetitive iterative self-calling code. Worse, the damn effect would only show up after a huge number of iterations were completed. And after +6 hours, I was finally at wits end when I discovered a line in the javadoc for HashSet.iterator() indicating it doesn't guarantee the order in which it will return elements. I then went through my entire code base and replaced all instances of HashSet with LinkedHashSet. And low-and-behold, my framework sprang right to deterministic life! ARGH!

I have now just experienced this same FREAKIN affect, again (at least it was only 3 hours this time). For whatever reason, I missed the small detail that HashMap happens to BEHAVE THE SAME WAY for its keySet().

Here's an SO thread on this subject, although the discussion never quite answers my question: Iteration order of HashSet

So, I am curious as to why this might occur. Given both times I had a huge single threaded java application crawling through exactly the same instantiation/insertion space with exactly the same JVM parameters (multiple runs from the same batch file) on the same computer with almost nothing else running, what could possibly perturb the JVM such that HashSet and HashMap would, after an enormous number of iterations, behave unpredictably (not inconsistenly as the javadoc says not to depend upon the order)?

Any ideas around this from either the source code (implementation of these classes in java.util) or from your knowledge of the JVM (perhaps some GC affect where internal java classes get non-zeroed memory when allocating internal memory spaces)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

谜兔 2024-10-14 08:05:32

简短回答

有一个权衡。如果您希望以分摊常数时间 O(1) 访问元素,则迄今为止的技术依赖于散列等随机方案。如果您想要对元素进行有序访问,则最佳的工程权衡只能为您提供 O(ln(n)) 性能。对于您的情况,也许这并不重要,但是恒定时间和对数时间之间的差异即使从相对较小的结构开始也会产生很大的差异。

所以,是的,您可以查看代码并仔细检查,但这归结为一个相当实用的理论事实。现在是掸去Cormen(或这里的 Googly Bookiness)支撑着你房子地基下垂的一角,看看第 11 章(哈希表)和第 13 章(红黑树)。这些将分别让您了解 JDK 的 HashMap 和 TreeMap 实现。

长答案

您不需要 MapSet 返回键/成员的有序列表。那不是他们的目的。映射和集合结构不像底层数学概念那样有序,并且它们提供不同的性能。这些数据结构的目标(正如 @thejh 指出的那样)是有效地摊销插入、包含和获取时间,而不是维护顺序。您可以研究如何维护哈希数据结构以了解权衡是什么。查看有关 哈希函数哈希表(具有讽刺意味的是,请注意“无序映射”的 Wiki 条目重定向到后者)或计算机科学/数据结构文本。

请记住:不要依赖 ADT(特别是集合)的属性,例如排序、不变性、线程安全性或其他任何内容,除非您仔细查看契约是什么。请注意,对于 Map,Javadoc 明确指出:

映射的顺序定义为
迭代器的顺序
地图的集合视图返回它们的
元素。一些地图实现,
像TreeMap类一样,具体化
对其订单的保证;其他的,
像 HashMap 类一样,不需要。

Set.iterator() 有类似的:

返回元素的迭代器
在这组中。返回元素
没有特定的顺序(除非这个
set 是某个类的实例
提供保证)。

如果您想要这些的有序视图,请使用以下方法之一:

  • 如果它只是一个 Set,也许您确实需要一个 SortedSet,例如 TreeSet
  • 使用 TreeMap,它允许键的自然排序或通过 Comparator
  • 抽象您的数据结构(如果这是您想要的行为,这可能是特定于应用程序的事情),并维护 SortedSet 键以及地图 ,在摊销时间内表现会更好。
  • 获取 Map.keySet() (或只是您感兴趣的 Set)并将其放入 SortedSet 例如 TreeSet,使用自然排序或特定的 比较器
  • 迭代 Map.Entry< K,V> 使用 Map.entrySet().iterator(),在排序后。例如 for (final Map.Entryentry : new TreeSet(map.entrySet())) { } 有效访问键和值。
  • 如果您只是偶尔执行此操作,则可以从结构中获取值数组并使用 Arrays.sort(),具有不同的性能配置文件(空间和时间)。

源链接

如果您想查看 juHashSetjuHashMap,它们可以在 GrepCode 上找到。请注意,HashSet 只是 HashMap 的糖衣。为什么不总是使用排序版本?嗯,正如我上面提到的,性能有所不同,这在某些应用程序中很重要。请参阅此处的相关 SO 问题。您还可以在此处底部看到一些具体的性能数据(我没有仔细查看以验证这些是否准确,但它们恰好证实了我的观点,所以我会愉快地传递链接:-)。

Short Answer

There's a tradeoff. If you want amortized constant time O(1) access to elements, the techniques to date rely upon a randomized scheme like hashing. If you want ordered access to elements, the best engineering tradeoff gives you only O(ln(n)) performance. For your case, perhaps this doesn't matter, but the difference between constant time and logarithmic time makes a very big difference starting even with relatively small structures.

So yes, you can go look at the code and inspect carefully, but it boils down to a rather practical theoretical fact. Now is a good time to brush the dust off that copy of Cormen (or Googly Bookiness here) that's propping up the drooping corner of your house's foundation and take a look at Chapters 11 (Hash Tables) and 13 (Red-Black Trees). These will fill you in on the JDK's implementation of HashMap and TreeMap, respectively.

Long Answer

You don't want a Map or Set to return ordered lists of keys/members. That's not what they're for. Maps and Sets structures are not ordered just like the underlying mathematical concepts, and they provide different performance. The objective of these data structures (as @thejh points out) is efficient amortized insert, contains, and get time, not maintaining ordering. You can look into how a hashed data structure is maintained to know what the tradeoffs are. Take a look at the Wikipedia entries on Hash Functions and Hash Tables (ironically, note that the Wiki entry for "unordered map" redirects to the latter) or a computer science / data structures text.

Remember: Don't depend on properties of ADTs (and specifically collections) such as ordering, immutability, thread safety or anything else unless you look carefully at what the contract is. Note that for Map, the Javadoc says clearly:

The order of a map is defined as the
order in which the iterators on the
map's collection views return their
elements. Some map implementations,
like the TreeMap class, make specific
guarantees as to their order; others,
like the HashMap class, do not.

And Set.iterator() has the similar:

Returns an iterator over the elements
in this set. The elements are returned
in no particular order (unless this
set is an instance of some class that
provides a guarantee).

If you want an ordered view of these, use one of the following approaches:

  • If it's just a Set, maybe you really want a SortedSet such as a TreeSet
  • Use a TreeMap, which allows either natural ordering of keys or a specific ordering via Comparator
  • Abstract your data structure, which probably is an application-specific thing anyway if this is the behavior you want, and maintain both a SortedSet of keys as well as a Map, which will perform better in amortized time.
  • Get the Map.keySet() (or just the Set you're interested in) and put it into a SortedSet such as TreeSet, either using the natural ordering or a specific Comparator.
  • Iterate over the Map.Entry<K,V> using Map.entrySet().iterator(), after it has been sorted. E.g. for (final Map.Entry<K,V> entry : new TreeSet(map.entrySet())) { } to efficiently access both keys and values.
  • If you are only doing this once and awhile, you could just get an array of values out of your structure and use Arrays.sort(), which has a different performance profile (space and time).

Links to the Source

If you would like to look at the source for j.u.HashSet and j.u.HashMap, they are available on GrepCode. Note that a HashSet is just sugar for a HashMap. Why not always use the sorted versions? Well, as I allude above, the performance differs and that matters in some applications. See the related SO question here. You can also see some concrete performance numbers at the bottom here (I haven't looked closely to verify these are accurate, but they happen to substantiate my point, so I'll blithely pass along the link. :-)

眉目亦如画i 2024-10-14 08:05:32

我以前遇到过这个问题,顺序并不重要,但确实影响了结果。

Java 的多线程本质意味着,使用完全相同的输入进行重复运行可能会受到细微的时间差异的影响(例如,分配新内存块需要多长时间),这有时可能需要将前一个内存块调出到磁盘。内容,而在其他时候则不需要。当考虑系统对象时,不使用该页面的其他一些线程可能会继续进行,并且最终可能会得到不同的对象创建顺序。

这可能会影响 JVM 不同运行中等效对象的 Object.hashCode() 结果。

对我来说,我决定添加使用 LinkedHashMap 的小开销,以便能够重现我正在运行的测试的结果。

I've struck this before, where the order wasn't important, but did affect the results.

The multi-threaded nature of Java means that repeated runs with exactly the same inputs can be affected by slight timing differences in (for example) how long it takes to allocate a new block of memory, which might sometimes require paging out to disk the previous contents, and at other times that isn't needed. Some other thread not using that page may proceed, and you could end up with a different order of object creation, when System objects are taken into account.

That can affect the Object.hashCode() result for the equivalent object in different runs of the JVM.

For me, I decided to add the small overhead of using a LinkedHashMap, in order to be able to reproduce the results of the tests I was running.

_失温 2024-10-14 08:05:32

http://download.oracle.com /javase/1.4.2/docs/api/java/lang/Object.html#hashCode() 说:

只要合理可行,
类定义的 hashCode 方法
对象确实返回不同的整数
对于不同的对象。 (这是
通常通过转换来实现
对象的内部地址
变成一个整数,但是这个
实施技术不是
JavaTM 编程所需
语言。)

那么内部地址可能会改变吗?

这也意味着您可以通过为所有应该充当键的内容编写自己的 hashCode() 方法来修复它,而不会影响速度。

http://download.oracle.com/javase/1.4.2/docs/api/java/lang/Object.html#hashCode() says:

As much as is reasonably practical,
the hashCode method defined by class
Object does return distinct integers
for distinct objects. (This is
typically implemented by converting
the internal address of the object
into an integer, but this
implementation technique is not
required by the JavaTM programming
language.)

So maybe the internal address changes?

This also means that you could propably fix it without giving up speed by writing your own hashCode() method for everything that should act as a key.

喜爱皱眉﹌ 2024-10-14 08:05:32

你永远不应该依赖哈希映射的顺序。

如果你想要一个具有确定性排序的 Map,我建议你使用像 TreeMap/TreeSet 这样的 SortedMap/SortedSet 或使用 LinkedHashMap/LinkedHashSet。我经常使用后者,不是因为程序需要排序,而是因为它更容易读取日志/调试地图的状态。即当你添加一个键时,它每次都会走到最后。

您可以创建两个具有相同元素的 HashMap/HashSet,但根据集合的容量获得不同的顺序。代码运行方式可能存在细微差异,从而触发不同的最终存储桶大小,从而触发不同的顺序。

例如

public static void main(String... args) throws IOException {
    printInts(new HashSet<Integer>(8,2));
    printInts(new HashSet<Integer>(16,1));
    printInts(new HashSet<Integer>(32,1));
    printInts(new HashSet<Integer>(64,1));
}

private static void printInts(HashSet<Integer> integers) {
    integers.addAll(Arrays.asList(0,10,20,30,40,50,60,70,80,90,100));
    System.out.println(integers);
}

prints

[0, 50, 100, 70, 40, 10, 80, 20, 90, 60, 30]
[0, 50, 100, 70, 80, 20, 40, 10, 90, 60, 30]
[0, 100, 70, 40, 10, 50, 80, 20, 90, 60, 30]
[0, 70, 10, 80, 20, 90, 30, 100, 40, 50, 60]

这里你有 HashSet,其相同的值以相同的顺序添加,导致不同的迭代器顺序。您可能没有使用构造函数,但您的应用程序可能会间接导致不同的存储桶大小。

You should NEVER depend on the order of a hash map.

If you want a Map with a deterministic ordering, I suggest you use a SortedMap/SortedSet like TreeMap/TreeSet or use LinkedHashMap/LinkedHashSet. I use the later often, not because the program needs the ordering, but because its easier to read logs/debug the state of the map. i.e. when you add a key, it goes to the end every time.

You can create two HashMap/HashSet with the same elements but get different orders depending on the capacity of the collection. It is possible for subtle differences in how your code runs to trigger a different final bucket size and therefor a different order.

e.g.

public static void main(String... args) throws IOException {
    printInts(new HashSet<Integer>(8,2));
    printInts(new HashSet<Integer>(16,1));
    printInts(new HashSet<Integer>(32,1));
    printInts(new HashSet<Integer>(64,1));
}

private static void printInts(HashSet<Integer> integers) {
    integers.addAll(Arrays.asList(0,10,20,30,40,50,60,70,80,90,100));
    System.out.println(integers);
}

prints

[0, 50, 100, 70, 40, 10, 80, 20, 90, 60, 30]
[0, 50, 100, 70, 80, 20, 40, 10, 90, 60, 30]
[0, 100, 70, 40, 10, 50, 80, 20, 90, 60, 30]
[0, 70, 10, 80, 20, 90, 30, 100, 40, 50, 60]

Here you have HashSet(s) with the same values added in the same order resulting in different iterator orders. You may not be playing with the constructor, but your application could cause a different bucket size indirectly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文