不同初始容量和负载因子下HashMap的性能
这是我的情况。我使用两个 java.util.HashMap 在 Tomcat 上运行的 Java Web 应用程序中存储一些常用数据。我知道每个哈希映射的确切条目数。键分别是字符串和整数。
我的问题是,设置初始容量和负载系数的最佳方法是什么?
我应该将容量设置为等于它将拥有的元素数量并将负载容量设置为 1.0 吗?我希望在不使用太多内存的情况下获得绝对最佳的性能。然而,我担心该表不会以最佳方式填充。对于所需大小的表,是否会发生键冲突,导致(通常很短)扫描找到正确的元素?
假设(这是一个延伸)散列函数是整数键的简单 mod 5,这是否意味着键 5、10、15 将命中同一个存储桶,然后导致查找以填充旁边的存储桶他们?较大的初始容量会提高性能吗?
另外,如果有比哈希图更好的数据结构,我也对此完全持开放态度。
Here is my situation. I am using two java.util.HashMap to store some frequently used data in a Java web app running on Tomcat. I know the exact number of entries into each Hashmap. The keys will be strings, and ints respectively.
My question is, what is the best way to set the initial capacity and loadfactor?
Should I set the capacity equal to the number of elements it will have and the load capacity to 1.0? I would like the absolute best performance without using too much memory. I am afraid however, that the table would not fill optimally. With a table of the exact size needed, won't there be key collision, causing a (usually short) scan to find the correct element?
Assuming (and this is a stretch) that the hash function is a simple mod 5 of the integer keys, wouldn't that mean that keys 5, 10, 15 would hit the same bucket and then cause a seek to fill the buckets next to them? Would a larger initial capacity increase performance?
Also, if there is a better datastructure than a hashmap for this, I am completely open to that as well.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您的数据缺乏完美的哈希函数,并且假设这实际上不是对无关紧要的事物的微观优化,我会尝试以下操作:
假设 HashMap 使用的默认负载容量 (.75)在大多数情况下是一个很好的值。既然如此,您可以使用它,并根据您自己对 HashMap 将容纳多少项的了解来设置 HashMap 的初始容量 - 设置它以便初始容量 x .75 = 项数(向上取整)。
如果它是一个较大的地图,在高速查找非常关键的情况下,我建议使用某种 trie 而不是哈希映射。对于大型映射中的长字符串,您可以通过使用更面向字符串的数据结构(例如 trie)来节省空间,有时还可以节省时间。
In the absence of a perfect hashing function for your data, and assuming that this is really not a micro-optimization of something that really doesn't matter, I would try the following:
Assume the default load capacity (.75) used by HashMap is a good value in most situations. That being the case, you can use it, and set the initial capacity of your HashMap based on your own knowledge of how many items it will hold - set it so that initial-capacity x .75 = number of items (round up).
If it were a larger map, in a situation where high-speed lookup was really critical, I would suggest using some sort of trie rather than a hash map. For long strings, in large maps, you can save space, and some time, by using a more string-oriented data structure, such as a trie.
假设您的哈希函数“良好”,最好的办法是将初始大小设置为预期的元素数量,假设您可以廉价地获得良好的估计。这样做是个好主意,因为当 HashMap 调整大小时,它必须重新计算表中每个键的哈希值。
将负载系数保留为
0.75
。0.75
的值是根据经验选择的,作为哈希查找性能和主哈希数组的空间使用之间的良好折衷。当您提高负载系数时,平均查找时间将显着增加。如果您想深入研究哈希表行为的数学原理:Donald Knuth (1998)。计算机编程的艺术”。 3:排序和搜索(第二版)。艾迪生-韦斯利。第 513–558 页。 ISBN 0-201-89685-0。
Assuming that your hash function is "good", the best thing to do is to set the initial size to the expected number of elements, assuming that you can get a good estimate cheaply. It is a good idea to do this because when a HashMap resizes it has to recalculate the hash values for every key in the table.
Leave the load factor at
0.75
. The value of0.75
has been chosen empirically as a good compromise between hash lookup performance and space usage for the primary hash array. As you push the load factor up, the average lookup time will increase significantly.If you want to dig into the mathematics of hash table behaviour: Donald Knuth (1998). The Art of Computer Programming'. 3: Sorting and Searching (2nd ed.). Addison-Wesley. pp. 513–558. ISBN 0-201-89685-0.
我发现最好不要摆弄默认设置,除非我确实需要这样做。
Hotspot 在为您进行优化方面做得非常出色。
任何状况之下;我会首先使用探查器(例如 Netbeans Profiler)来衡量问题。
我们通常会存储包含 10000 个元素的映射,如果您有良好的 equals 和 hashcode 实现(字符串和整数也有!),这将比您可能做出的任何负载更改更好。
I find it best not to fiddle around with default settings unless I really really need to.
Hotspot does a great job of doing optimizations for you.
In any case; I would use a profiler (Say Netbeans Profiler) to measure the problem first.
We routinely store maps with 10000s of elements and if you have a good equals and hashcode implementation (and strings and Integers do!) this will be better than any load changes you may make.
它不是。来自 HashMap.java:
我什至不会假装我理解这一点,但看起来它就是为了处理这种情况而设计的。
另请注意,无论您要求什么大小,桶的数量也始终是 2 的幂。
It's not. From HashMap.java:
I'm not even going to pretend I understand that, but it looks like that's designed to handle just that situation.
Note also that the number of buckets is also always a power of 2, no matter what size you ask for.
条目以类似随机的方式分配到桶中。因此,即使您的存储桶与条目一样多,某些存储桶也会发生冲突。
如果你有更多的桶,你就会减少碰撞。然而,更多的存储桶意味着在内存中分散,因此速度更慢。一般来说,0.7-0.8 范围内的负载系数大致是最佳的,因此可能不值得更改。
与以往一样,在您沉迷于微调这些东西之前,可能值得进行分析。
Entries are allocated to buckets in a random-like way. So even if you as many buckets as entries, some of the buckets will have collisions.
If you have more buckets, you'll have fewer collisions. However, more buckets means spreading out in memory and therefore slower. Generally a load factor in the range 0.7-0.8 is roughly optimal, so it is probably not worth changing.
As ever, it's probably worth profiling before you get hung up on microtuning these things.