有效地迭代多个 Java Map 键集的并集

发布于 2024-11-17 10:53:05 字数 3202 浏览 0 评论 0原文

在我的一个 Java 6 项目中，我有一组 LinkedHashMap 实例作为方法的输入，该方法必须迭代所有键（即通过所有映射的键集的并集）并使用关联的值。并非所有键都存在于所有映射中，并且该方法不应多次遍历每个键或更改输入映射。

我当前的实现如下所示：

Set<Object> keyset = new HashSet<Object>();

for (Map<Object, Object> map : input) {
    for (Object key : map.keySet()) {
        if (keyset.add(key)) {
            ...
        }
    }
}

HashSet实例确保不会对任何键执行多次操作。

不幸的是，这部分代码在性能方面相当关键，因为它被频繁调用。事实上，根据分析器，超过 10% 的 CPU 时间花费在 HashSet.add() 方法。

我正在尝试尽可能优化这段代码。使用 LinkedHashMap 及其更高效的迭代器（与普通的 HashMap 相比）一个重要的提升，但我希望将本质上的簿记时间减少到最低限度。

使用 addAll() 被证明效率较低，因为调用 HashSet。之后包含（）。目前我正在考虑是否可以使用位图（准确地说，是布尔[]）来完全避免 HashSet，但这可能根本不可能，具体取决于我的密钥范围。

有没有更有效的方法来做到这一点？最好是不会对按键造成限制的东西？

编辑：

一些澄清和评论：

我确实需要地图中的所有值 - 我不能删除其中任何一个。
我还需要知道每个值来自哪个地图。我的代码中缺少的部分（...）将是这样的：
```
for (Map
```
了解我需要如何处理地图的一个简单示例是并行打印所有地图，如下所示：
```
按键 Map0 Map1 Map2
F 1 无 2
B 2 3 空
C 空 空 5
...
```
这不是我实际做的，但你应该明白。
输入映射非常可变。事实上，该方法的每次调用都使用不同的一组。因此，我不会通过缓存它们的键的并集来获得任何东西。
我的键都是 String 实例。它们使用单独的 HashMap 在堆上进行了某种程度的保留，因为它们非常重复，因此它们的哈希代码已经被缓存并且大多数哈希验证（当 HashMap 实现在它们的哈希代码之后检查两个键是否实际上相等时） match）归结为身份比较（==）。探查器确认只有 0.5% 的 CPU 时间花费在 String.equals() 和 String.hashCode()< /code>.

编辑2：

根据答案中的建议，我一路上做了一些测试、分析和基准测试。最终我的性能提高了大约 7%。我做了什么：

我将 HashSet 的初始容量设置为所有输入映射的集体大小的两倍。通过消除 HashSet 中的大多数（全部？）resize() 调用，这为我带来了 1-2% 的收益。
我对当前正在迭代的地图使用了Map.entrySet()。我最初避免使用这种方法，因为需要额外的代码，并且担心额外的检查和 Map.Entry getter 方法调用会超过任何优点。事实证明，整体代码稍微快了一些。
我确信有些人会开始对我尖叫，但这就是：原始类型。更具体地说，我在上面的代码中使用了 HashSet 的原始形式。由于我已经使用 Object 作为其内容类型，因此我不会失去任何类型安全性。调用 HashSet.add() 时无用的 checkcast 操作的成本显然非常重要，足以在删除后产生 4% 的性能提升。为什么 JVM 坚持检查对 Object 的强制转换超出了我的理解...

原文

In one of my Java 6 projects I have an array of LinkedHashMap instances as input to a method which has to iterate through all keys (i.e. through the union of the key sets of all maps) and work with the associated values. Not all keys exist in all maps and the method should not go through each key more than once or alter the input maps.

My current implementation looks like this:

Set<Object> keyset = new HashSet<Object>();

for (Map<Object, Object> map : input) {
    for (Object key : map.keySet()) {
        if (keyset.add(key)) {
            ...
        }
    }
}

The HashSet instance ensures that no key will be acted upon more than once.

Unfortunately this part of the code is rather critical performance-wise, as it is called very frequently. In fact, according to the profiler over 10% of the CPU time is spent in the HashSet.add() method.

I am trying to optimise this code us much as possible. The use of LinkedHashMap with its more efficient iterators (in comparison to the plain HashMap) was a significant boost, but I was hoping to reduce what is essentially book-keeping time to the minimum.

Putting all the keys in the HashSet before-hand, by using addAll() proved to be less efficient, due to the cost of calling HashSet.contains() afterwards.
At the moment I am looking at whether I can use a bitmap (well, a boolean[] to be exact) to avoid the HashSet completely, but it may not be possible at all, depending on my key range.

Is there a more efficient way to do this? Preferrably something that will not pose restrictions on the keys?

EDIT:

A few clarifications and comments:

I do need all the values from the maps - I cannot drop any of them.
I also need to know which map each value came from. The missing part (...) in my code would be something like this:
```
for (Map<Object, Object> m : input) {
    Object v = m.get(key);

    // Do something with v
}
```
A simple example to get an idea of what I need to do with the maps would be to print all maps in parallel like this:
```
Key Map0 Map1 Map2
F   1    null 2
B   2    3    null
C   null null 5
...
```
That's not what I am actually doing, but you should get the idea.
The input maps are extremely variable. In fact, each call of this method uses a different set of them. Therefore I would not gain anything by caching the union of their keys.
My keys are all String instances. They are sort-of-interned on the heap using a separate HashMap, since they are pretty repetitive, therefore their hash code is already cached and most hash validations (when the HashMap implementation is checking whether two keys are actually equal, after their hash codes match) boil down to an identity comparison (==). The profiler confirms that only 0.5% of the CPU time is spent on String.equals() and String.hashCode().

EDIT 2:

Based on the suggestions in the answers, I made a few tests, profiling and benchmarking along the way. I ended up with roughly a 7% increase in performance. What I did:

I set the initial capacity of the HashSet to double the collective size of all input maps. This gained me something in the region of 1-2%, by eliminating most (all?) resize() calls in the HashSet.
I used Map.entrySet() for the map I am currently iterating. I had originally avoided this approach due to the additional code and the fear that the extra checks and Map.Entry getter method calls would outweigh any advantages. It turned out that the overall code was slightly faster.
I am sure that some people will start screaming at me, but here it is: Raw types. More specifically I used the raw form of HashSet in the code above. Since I was already using Object as its content type, I do not lose any type safety. The cost of that useless checkcast operation when calling HashSet.add() was apparently important enough to produce a 4% increase in performance when removed. Why the JVM insists on checking casts to Object is beyond me...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

若言繁花未落 2024-11-24 10:53:05

无法提供您的方法的替代方案，但提供一些（稍微）优化现有代码的建议。

考虑使用容量（所有映射的大小之和）初始化哈希集。这可以避免/减少在添加操作期间调整集合大小。
请考虑不要使用 keySet()，因为它始终会在后台创建一个新集合。使用 entrySet()，应该会快得多
看一下 equals() 和 hashCode() 的实现 - 如果它们是“昂贵”，那么你就会对 add 方法产生负面影响。

回复收藏 0 原文

唱一曲作罢 2024-11-24 10:53:05

如何避免使用 HashSet 取决于您正在做什么。

每次更改输入时，我只会计算并集一次。与查找次数相比，这种情况应该相对较少。

// on an update.
Map<Key, Value> union = new LinkedHashMap<Key, Value>();
for (Map<Key, Value> map : input) 
    union.putAll(map);


// on a lookup.
Value value = union.get(key);
// process each key once
for(Entry<Key, Value> entry: union) {
   // do something.
}

How you avoid using a HashSet depends on what you are doing.

I would only calculate the union once each time the input is changed. This should be relatively rare conmpared with the number of lookups.

// on an update.
Map<Key, Value> union = new LinkedHashMap<Key, Value>();
for (Map<Key, Value> map : input) 
    union.putAll(map);


// on a lookup.
Value value = union.get(key);
// process each key once
for(Entry<Key, Value> entry: union) {
   // do something.
}

回复收藏 0 原文