有效地迭代多个 Java Map 键集的并集

发布于 2024-11-17 10:53:05 字数 3202 浏览 0 评论 0原文

在我的一个 Java 6 项目中,我有一组 LinkedHashMap 实例作为方法的输入,该方法必须迭代所有键(即通过所有映射的键集的并集)并使用关联的值。并非所有键都存在于所有映射中,并且该方法不应多次遍历每个键或更改输入映射。

我当前的实现如下所示:

Set<Object> keyset = new HashSet<Object>();

for (Map<Object, Object> map : input) {
    for (Object key : map.keySet()) {
        if (keyset.add(key)) {
            ...
        }
    }
}

HashSet实例确保不会对任何键执行多次操作。

不幸的是,这部分代码在性能方面相当关键,因为它被频繁调用。事实上,根据分析器,超过 10% 的 CPU 时间花费在 HashSet.add() 方法。

我正在尝试尽可能优化这段代码。使用 LinkedHashMap 及其更高效的迭代器(与普通的 HashMap 相比)一个重要的提升,但我希望将本质上的簿记时间减少到最低限度。

使用 addAll() 被证明效率较低,因为调用 HashSet。之后包含()。 目前我正在考虑是否可以使用位图(准确地说,是布尔[])来完全避免 HashSet,但这可能根本不可能,具体取决于我的密钥范围。

有没有更有效的方法来做到这一点?最好是不会对按键造成限制的东西?

编辑:

一些澄清和评论:

  • 我确实需要地图中的所有值 - 我不能删除其中任何一个。

  • 我还需要知道每个值来自哪个地图。我的代码中缺少的部分(...)将是这样的:

    for (Map m : 输入) {
        对象 v = m.get(key);
    
        // 用 v 做一些事情
    }
    

    了解我需要如何处理地图的一个简单示例是并行打印所有地图,如下所示:

    按键 Map0 Map1 Map2
    F 1 无 2
    B 2 3 空
    C 空 空 5
    ...
    

    这不是我实际做的,但你应该明白。

  • 输入映射非常可变。事实上,该方法的每次调用都使用不同的一组。因此,我不会通过缓存它们的键的并集来获得任何东西。

  • 我的键都是 String 实例。它们使用单​​独的 HashMap 在堆上进行了某种程度的保留,因为它们非常重复,因此它们的哈希代码已经被缓存并且大多数哈希验证(当 HashMap 实现在它们的哈希代码之后检查两个键是否实际上相等时) match)归结为身份比较(==)。探查器确认只有 0.5% 的 CPU 时间花费在 String.equals()String.hashCode()< /code>.

编辑2:

根据答案中的建议,我一路上做了一些测试、分析和基准测试。最终我的性能提高了大约 7%。我做了什么:

  • 我将 HashSet 的初始容量设置为所有输入映射的集体大小的两倍。通过消除 HashSet 中的大多数(全部?)resize() 调用,这为我带来了 1-2% 的收益。

  • 我对当前正在迭代的地图使用了Map.entrySet()。我最初避免使用这种方法,因为需要额外的代码,并且担心额外的检查和 Map.Entry getter 方法调用会超过任何优点。事实证明,整体代码稍微快了一些。

  • 我确信有些人会开始对我尖叫,但这就是:原始类型。更具体地说,我在上面的代码中使用了 HashSet 的原始形式。由于我已经使用 Object 作为其内容类型,因此我不会失去任何类型安全性。调用 HashSet.add() 时无用的 checkcast 操作的成本显然非常重要,足以在删除后产生 4% 的性能提升。为什么 JVM 坚持检查对 Object 的强制转换超出了我的理解...

In one of my Java 6 projects I have an array of LinkedHashMap instances as input to a method which has to iterate through all keys (i.e. through the union of the key sets of all maps) and work with the associated values. Not all keys exist in all maps and the method should not go through each key more than once or alter the input maps.

My current implementation looks like this:

Set<Object> keyset = new HashSet<Object>();

for (Map<Object, Object> map : input) {
    for (Object key : map.keySet()) {
        if (keyset.add(key)) {
            ...
        }
    }
}

The HashSet instance ensures that no key will be acted upon more than once.

Unfortunately this part of the code is rather critical performance-wise, as it is called very frequently. In fact, according to the profiler over 10% of the CPU time is spent in the HashSet.add() method.

I am trying to optimise this code us much as possible. The use of LinkedHashMap with its more efficient iterators (in comparison to the plain HashMap) was a significant boost, but I was hoping to reduce what is essentially book-keeping time to the minimum.

Putting all the keys in the HashSet before-hand, by using addAll() proved to be less efficient, due to the cost of calling HashSet.contains() afterwards.
At the moment I am looking at whether I can use a bitmap (well, a boolean[] to be exact) to avoid the HashSet completely, but it may not be possible at all, depending on my key range.

Is there a more efficient way to do this? Preferrably something that will not pose restrictions on the keys?

EDIT:

A few clarifications and comments:

  • I do need all the values from the maps - I cannot drop any of them.

  • I also need to know which map each value came from. The missing part (...) in my code would be something like this:

    for (Map<Object, Object> m : input) {
        Object v = m.get(key);
    
        // Do something with v
    }
    

    A simple example to get an idea of what I need to do with the maps would be to print all maps in parallel like this:

    Key Map0 Map1 Map2
    F   1    null 2
    B   2    3    null
    C   null null 5
    ...
    

    That's not what I am actually doing, but you should get the idea.

  • The input maps are extremely variable. In fact, each call of this method uses a different set of them. Therefore I would not gain anything by caching the union of their keys.

  • My keys are all String instances. They are sort-of-interned on the heap using a separate HashMap, since they are pretty repetitive, therefore their hash code is already cached and most hash validations (when the HashMap implementation is checking whether two keys are actually equal, after their hash codes match) boil down to an identity comparison (==). The profiler confirms that only 0.5% of the CPU time is spent on String.equals() and String.hashCode().

EDIT 2:

Based on the suggestions in the answers, I made a few tests, profiling and benchmarking along the way. I ended up with roughly a 7% increase in performance. What I did:

  • I set the initial capacity of the HashSet to double the collective size of all input maps. This gained me something in the region of 1-2%, by eliminating most (all?) resize() calls in the HashSet.

  • I used Map.entrySet() for the map I am currently iterating. I had originally avoided this approach due to the additional code and the fear that the extra checks and Map.Entry getter method calls would outweigh any advantages. It turned out that the overall code was slightly faster.

  • I am sure that some people will start screaming at me, but here it is: Raw types. More specifically I used the raw form of HashSet in the code above. Since I was already using Object as its content type, I do not lose any type safety. The cost of that useless checkcast operation when calling HashSet.add() was apparently important enough to produce a 4% increase in performance when removed. Why the JVM insists on checking casts to Object is beyond me...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

若言繁花未落 2024-11-24 10:53:05

无法提供您的方法的替代方案,但提供一些(稍微)优化现有代码的建议。

  1. 考虑使用容量(所有映射的大小之和)初始化哈希集。这可以避免/减少在添加操作期间调整集合大小。
  2. 请考虑不要使用 keySet(),因为它始终会在后台创建一个新集合。使用 entrySet(),应该会快得多
  3. 看一下 equals()hashCode() 的实现 - 如果它们是“昂贵”,那么你就会对 add 方法产生负面影响。

Can't provide a replacement for your approach but a few suggestions to (slightly) optimize the existing code.

  1. Consider initializing the hash set with a capacity (the sum of the sizes of all maps). This avoids/reduces resizing of the set during an add operation
  2. Consider not using the keySet() as it will always create a new set in the background. Use the entrySet(), that should be much faster
  3. Have a look at the implementations of equals() and hashCode() - if they are "expensive", then you have a negative impact on the add method.
唱一曲作罢 2024-11-24 10:53:05

如何避免使用 HashSet 取决于您正在做什么。

每次更改输入时,我只会计算并集一次。与查找次数相比,这种情况应该相对较少。

// on an update.
Map<Key, Value> union = new LinkedHashMap<Key, Value>();
for (Map<Key, Value> map : input) 
    union.putAll(map);


// on a lookup.
Value value = union.get(key);
// process each key once
for(Entry<Key, Value> entry: union) {
   // do something.
}

How you avoid using a HashSet depends on what you are doing.

I would only calculate the union once each time the input is changed. This should be relatively rare conmpared with the number of lookups.

// on an update.
Map<Key, Value> union = new LinkedHashMap<Key, Value>();
for (Map<Key, Value> map : input) 
    union.putAll(map);


// on a lookup.
Value value = union.get(key);
// process each key once
for(Entry<Key, Value> entry: union) {
   // do something.
}
尸血腥色 2024-11-24 10:53:05

选项 A 是使用 .values() 方法并迭代它。但我想你已经想到了。

如果代码被频繁调用,那么可能值得创建额外的结构(取决于数据更改的频率)。创建一个新的HashMap;任何哈希图中的每个键都是该哈希图中的一个键,并且该列表保留该键出现的位置的哈希图。

如果数据有些静态(与查询频率相关),因此管理结构的过载相对较小,并且键空间不是很密集(键在不同的 HashMap 中不会重复很多),这将有所帮助,因为它将节省大量不需要的 contains()。

当然,如果您要混合数据结构,最好将所有数据封装在您自己的数据结构中。

Option A is to use the .values() method and iterate through it. But I suppose you already had thought of it.

If the code is called so often, then it might be worth creating additional structures (depending of how often the data is changed). Create a new HashMap; every key in any of your hashmaps is a key in this one and the list keeps the HashMaps where that key appears.

This will help if the data is somewhat static (related to the frequency of queries), so the overload from managing the structure is relatively small, and if the key space is not very dense (keys do not repeat themselves a lot in different HashMaps), as it will save a lot of unneeded contains().

Of course, if you are mixing data structures it is better if you encapsulate all in your own data structure.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文