为什么越是‘1’呢?我的 Key 中的位越长,放置在 HashMap 中的时间就越长?
我正在为一个类做一个项目,该项目的重点是在内存中存储一个大部分为 0 值的巨大矩阵,并对其执行一些矩阵数学运算。我的第一个想法是使用HashMap
来存储矩阵元素,并且只存储非零的元素,以避免使用大量内存。
我想为 HashMap
创建一个键,它可以表示元素的行号和列号,这样当我访问映射中的该条目时,我可以重新提取这两个值。我不了解 Java 和 C# - 在 C# 中,我会创建一个带有 Row
和 Column
成员的 struct
,但在 Java 中我很快意识到没有用户价值类型。随着最后期限的临近,我做了一个安全的赌注,并把Key
定为长。我使用一些非常简单的位移位将行数据(32 位 int)存储在前 32 位中,将列数据存储在后 32 位中。 [编辑:我还想指出,我的 HashMap 是用特定的初始大小初始化的,该大小准确地代表了我存储在其中的值的数量,并且从未超过。]
[旁注:我希望能够再次提取行/列数据就是大大提高矩阵乘法的效率,从O(n^2)
变为O(n)
,并且更小的n
to boot]
之后我注意到了什么实现这个结构的原因是,从一个只给出非零元素的文本文件中读取一个 23426 x 23426 矩阵需要花费 7 秒的时间,但计算我们需要给出的特征值只需要 2 秒!在对方法进行选择性注释之后,我得出的结论是,这 7 秒的时间跨度的大部分时间都花在了将我的值存储在 HashMap
中。
public void Set(double value, int row, int column) {
//assemble the long key, placing row and column in adjacent sets of bits
long key = (long)row << SIZE_BIT_MAX; //(SIZE_BIT_MAX is 32)
key += column;
elements.put(key, value);
}
这是设置值的代码。如果我改用这个方法:
public void Set(double value, int row, int column) {
//create a distinct but smaller key (around 32 bits max)
long key = (long)(row * matrixSize) + column;
elements.put(key, value);
}
读取只需要2秒。这两个版本的密钥对于每个元素都是不同的,都是长类型,并且创建它们中的任何一个的实际代码的复杂性都很小。正是 elements.put(key, value)
造成了 7 秒和 2 秒的差异。
我的问题是,为什么?我发现这些关键版本之间的区别在于,第一个版本始终且更频繁地将位设置为 1,而第二个版本的所有最高 32 位都设置为 0。我是否在追逐红鲱鱼,或者这是相当显着的差异在性能上是 HashMap.put
方法内部某些内容的结果?
I'm doing a project for a class which focuses on storing a huge matrix with mostly 0 values in memory and performing some matrix math on it. My first thought was to use a HashMap
to store the matrix elements, and only store the elements which are non-zero, in order to avoid using huge quantities of memory.
I wanted to make a key for the HashMap
which would represent both the row and column number of the element in a way that, when I accessed that entry in the map, I could re-extract both values. I don't know Java as well as C#- in C# I would make a struct
with Row
and Column
members, but in Java I quickly realized there are no User Value Types. With a deadline looming I went with a safe bet and made the Key
a long. I stored the row data (32-bit int) in the first 32 bits and the column data in the last 32 using some very simple bit shifting. [EDIT: I'd also like to note that my HashMap is initialized with a specific initial size which exactly represents the number of values I store in it, which is never exceeded.]
[Side note: the reason I want to be able to extract the row/column data again is to greatly increase the efficiency of matrix multiplication, from O(n^2)
to O(n)
, and a smaller n
to boot]
What I noticed after implementing this structure is that it takes a whopping 7 seconds to read a 23426 x 23426 matrix from a text file in which only non-zero elements are given, but it only takes 2 seconds to calculate the eigen values we are required to give! After selective commenting-out of methods, I have concluded that the bulk of this 7 second timespan is spent storing my values in the HashMap
.
public void Set(double value, int row, int column) {
//assemble the long key, placing row and column in adjacent sets of bits
long key = (long)row << SIZE_BIT_MAX; //(SIZE_BIT_MAX is 32)
key += column;
elements.put(key, value);
}
That is the code for setting a value. If I use this method instead:
public void Set(double value, int row, int column) {
//create a distinct but smaller key (around 32 bits max)
long key = (long)(row * matrixSize) + column;
elements.put(key, value);
}
The reading only takes 2 seconds. Both of these versions of the key are distinct for every element, both are long type, and the actual code to create either of them is minimal in complexity. It's the elements.put(key, value)
which makes the difference between 7 seconds and 2.
My question is, why? The difference I see between these key versions is that the first one has bits set to 1 throughout and more frequently, while the second has all of its highest 32 bits set to 0. Am I chasing a red herring, or is this fairly dramatic difference in performance the result of something internal in the HashMap.put
method?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
看看
Long
如何实现hashCode()
方法(至少在 OpenJDK 7 中):这意味着您的密钥将被重新填充为 32 位;所有较低的位经常相互抵消,导致大量冲突,这需要 HashMap 花费额外的时间在存储桶中寻找空闲槽。第二种方法可以避免该问题,因为每个键生成的哈希码都是唯一值(因为您只有 23426 x 23426 = 548777476 个项目,非常适合 32 位)。
因此,原因是您的密钥选择,而不是设置的位数。
但是,“用户价值类型”到底是什么意思?
一旦您实现了
hashCode()
和equals()
,这个类就可以为 Java 中的Map
提供完美的键。只需确保您没有像Long
那样实现其hashCode
方法即可。 :)Take a look at how
Long
implements thehashCode()
method (at least in OpenJDK 7):This means that your key gets stuffed back into 32 bits; all the lower bits are cancelling each other out quite often, resulting in a lot of collisions which requires the
HashMap
to spend extra time looking for a free slot in a bucket. Your second method avoids that problem because every key’s generated hash code is a unique value (because you only have 23426 x 23426 = 548777476 items which fits well into 32 bits).So, the resaon is your key selection but not the number of set bits.
However, what exactly do you mean with “user value types?”
This class can make a perfectly good key for a
Map
in Java once you implementhashCode()
andequals()
. Just make sure that you don’t implement itshashCode
method the wayLong
does. :)来自 JDK 6 文档
Long.hashCode()
(请注意,您的long
原语会自动装箱为Long
对象 - 而在 C# 原语中,实际上是对象):我认为给出这个定义,这解释了为什么:
当您引入更多熵并因此通过(编辑:我读错了顺序,所以这里是下面的反驳)long 的上半部分更多地分散它时,碰撞率会降低
value.当扩展到
long范围 - 毕竟,在Java中,hashCodes只有
int
大小,所以只能有有限的均等分配。如果您知道它在int
范围内“均匀”分布,那么您的碰撞就会减少。如果您将其分散在长
范围内,那么它会大大增加碰撞的机会。这是来自
HashMap
Java 的 文档(强调我的):旁注:通过调整 ,您会发现更大的性能提升code>初始容量 和
负载因子
- 查看HashMap
文档以获取更多信息。From the JDK 6 documentation for
Long.hashCode()
(note that yourlong
primitive is autoboxed into aLong
object - whereas in C# primitives actually are objects):I think given this definition, this explains why:
the collision rate is reduced when you introduce more entropy and thus disperse it more via the upper half of the(edit: I read the order wrong, so here's the counter-argument below)long
value.The collisions might be more likely when extending into the
long
range - after all, in Java, hashCodes are onlyint
size, so you can only have a limited amount of equal distribution. If you know it's "evenly" distributed over anint
range then your collisions are reduced. If you spread that out across thelong
range, then it greatly increases your chance of collision.Here's from the
HashMap
Java documentation (emphasis mine):Side note: you'll find even greater performance gains by tuning the
initial capacity
andload factor
- check theHashMap
documentation for more information.根据实现的不同,您可能会遇到哈希冲突。
如果所有哈希值最终都在同一个“桶”中,则实现通常会将它们放入某种类型的列表中。如果是这种情况,您的访问时间将受到严重影响。
Depending on the implementation, you could be hitting hash collisions.
If all of your hash values end up in the same "bucket", the implementation will normally throw them onto a list of some type. If this is the case your access times will suffer significantly.