为什么越是‘1’呢？我的 Key 中的位越长，放置在 HashMap 中的时间就越长？

发布于 2025-01-05 21:43:14 字数 1565 浏览 2 评论 0原文

我正在为一个类做一个项目，该项目的重点是在内存中存储一个大部分为 0 值的巨大矩阵，并对其执行一些矩阵数学运算。我的第一个想法是使用HashMap来存储矩阵元素，并且只存储非零的元素，以避免使用大量内存。

我想为 HashMap 创建一个键，它可以表示元素的行号和列号，这样当我访问映射中的该条目时，我可以重新提取这两个值。我不了解 Java 和 C# - 在 C# 中，我会创建一个带有 Row 和 Column 成员的 struct，但在 Java 中我很快意识到没有用户价值类型。随着最后期限的临近，我做了一个安全的赌注，并把Key定为长。我使用一些非常简单的位移位将行数据（32 位 int）存储在前 32 位中，将列数据存储在后 32 位中。 [编辑：我还想指出，我的 HashMap 是用特定的初始大小初始化的，该大小准确地代表了我存储在其中的值的数量，并且从未超过。]

[旁注：我希望能够再次提取行/列数据就是大大提高矩阵乘法的效率，从O(n^2)变为O(n)，并且更小的n to boot]

之后我注意到了什么实现这个结构的原因是，从一个只给出非零元素的文本文件中读取一个 23426 x 23426 矩阵需要花费 7 秒的时间，但计算我们需要给出的特征值只需要 2 秒！在对方法进行选择性注释之后，我得出的结论是，这 7 秒的时间跨度的大部分时间都花在了将我的值存储在 HashMap 中。

public void Set(double value, int row, int column) {
    //assemble the long key, placing row and column in adjacent sets of bits
    long key = (long)row << SIZE_BIT_MAX; //(SIZE_BIT_MAX is 32)
    key += column;
    elements.put(key, value);
}

这是设置值的代码。如果我改用这个方法：

public void Set(double value, int row, int column) {
    //create a distinct but smaller key (around 32 bits max)
    long key = (long)(row * matrixSize) + column;
    elements.put(key, value);
}

读取只需要2秒。这两个版本的密钥对于每个元素都是不同的，都是长类型，并且创建它们中的任何一个的实际代码的复杂性都很小。正是 elements.put(key, value) 造成了 7 秒和 2 秒的差异。

我的问题是，为什么？我发现这些关键版本之间的区别在于，第一个版本始终且更频繁地将位设置为 1，而第二个版本的所有最高 32 位都设置为 0。我是否在追逐红鲱鱼，或者这是相当显着的差异在性能上是 HashMap.put 方法内部某些内容的结果？

原文

I'm doing a project for a class which focuses on storing a huge matrix with mostly 0 values in memory and performing some matrix math on it. My first thought was to use a HashMap to store the matrix elements, and only store the elements which are non-zero, in order to avoid using huge quantities of memory.

I wanted to make a key for the HashMap which would represent both the row and column number of the element in a way that, when I accessed that entry in the map, I could re-extract both values. I don't know Java as well as C#- in C# I would make a struct with Row and Column members, but in Java I quickly realized there are no User Value Types. With a deadline looming I went with a safe bet and made the Key a long. I stored the row data (32-bit int) in the first 32 bits and the column data in the last 32 using some very simple bit shifting. [EDIT: I'd also like to note that my HashMap is initialized with a specific initial size which exactly represents the number of values I store in it, which is never exceeded.]

[Side note: the reason I want to be able to extract the row/column data again is to greatly increase the efficiency of matrix multiplication, from O(n^2) to O(n), and a smaller n to boot]

What I noticed after implementing this structure is that it takes a whopping 7 seconds to read a 23426 x 23426 matrix from a text file in which only non-zero elements are given, but it only takes 2 seconds to calculate the eigen values we are required to give! After selective commenting-out of methods, I have concluded that the bulk of this 7 second timespan is spent storing my values in the HashMap.

public void Set(double value, int row, int column) {
    //assemble the long key, placing row and column in adjacent sets of bits
    long key = (long)row << SIZE_BIT_MAX; //(SIZE_BIT_MAX is 32)
    key += column;
    elements.put(key, value);
}

That is the code for setting a value. If I use this method instead:

public void Set(double value, int row, int column) {
    //create a distinct but smaller key (around 32 bits max)
    long key = (long)(row * matrixSize) + column;
    elements.put(key, value);
}

The reading only takes 2 seconds. Both of these versions of the key are distinct for every element, both are long type, and the actual code to create either of them is minimal in complexity. It's the elements.put(key, value) which makes the difference between 7 seconds and 2.

My question is, why? The difference I see between these key versions is that the first one has bits set to 1 throughout and more frequently, while the second has all of its highest 32 bits set to 0. Am I chasing a red herring, or is this fairly dramatic difference in performance the result of something internal in the HashMap.put method?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

病女 2025-01-12 21:43:14

看看 Long 如何实现 hashCode() 方法（至少在 OpenJDK 7 中）：

public int hashCode() {
    return (int)(value ^ (value >>> 32));
}

这意味着您的密钥将被重新填充为 32 位；所有较低的位经常相互抵消，导致大量冲突，这需要 HashMap 花费额外的时间在存储桶中寻找空闲槽。第二种方法可以避免该问题，因为每个键生成的哈希码都是唯一值（因为您只有 23426 x 23426 = 548777476 个项目，非常适合 32 位）。

因此，原因是您的密钥选择，而不是设置的位数。

但是，“用户价值类型”到底是什么意思？

public class MatrixKey {
    private final int row;
    private final int column;
    public MatrixKey(int row, int column) {
        this.row = row;
        this.column = column;
    }
    public int getRow() { return row; }
    public int getColumn() { return column; }
}

一旦您实现了 hashCode() 和 equals()，这个类就可以为 Java 中的 Map 提供完美的键。只需确保您没有像 Long 那样实现其 hashCode 方法即可。 :)

Take a look at how Long implements the hashCode() method (at least in OpenJDK 7):

public int hashCode() {
    return (int)(value ^ (value >>> 32));
}

This means that your key gets stuffed back into 32 bits; all the lower bits are cancelling each other out quite often, resulting in a lot of collisions which requires the HashMap to spend extra time looking for a free slot in a bucket. Your second method avoids that problem because every key’s generated hash code is a unique value (because you only have 23426 x 23426 = 548777476 items which fits well into 32 bits).

So, the resaon is your key selection but not the number of set bits.

However, what exactly do you mean with “user value types?”

public class MatrixKey {
    private final int row;
    private final int column;
    public MatrixKey(int row, int column) {
        this.row = row;
        this.column = column;
    }
    public int getRow() { return row; }
    public int getColumn() { return column; }
}

This class can make a perfectly good key for a Map in Java once you implement hashCode() and equals(). Just make sure that you don’t implement its hashCode method the way Long does. :)

回复收藏 0 原文

懒猫 2025-01-12 21:43:14

来自 JDK 6 文档 Long.hashCode()（请注意，您的 long 原语会自动装箱为 Long 对象 - 而在 C# 原语中，实际上是对象）：

返回此 Long 的哈希码。结果是该 Long 对象所持有的原始 long 值的两半的异或。也就是说，hashcode 就是表达式的值：
(int)(this.longValue()^(this.longValue()>>32))

我认为给出这个定义，这解释了为什么：

~~当您引入更多熵并因此通过 long 的上半部分更多地分散它时，碰撞率会降低 value.~~ （编辑：我读错了顺序，所以这里是下面的反驳）

当扩展到long范围 - 毕竟，在Java中，hashCodes只有int大小，所以只能有有限的均等分配。如果您知道它在 int 范围内“均匀”分布，那么您的碰撞就会减少。如果您将其分散在长范围内，那么它会大大增加碰撞的机会。

这是来自 HashMap Java 的文档（强调我的）：

此实现为基本操作（获取和放置）提供恒定时间性能，假设哈希函数将元素正确分散在存储桶中

旁注：通过调整 ，您会发现更大的性能提升code>初始容量和负载因子 - 查看HashMap 文档以获取更多信息。

From the JDK 6 documentation for Long.hashCode() (note that your long primitive is autoboxed into a Long object - whereas in C# primitives actually are objects):

Returns a hash code for this Long. The result is the exclusive OR of the two halves of the primitive long value held by this Long object. That is, the hashcode is the value of the expression:
(int)(this.longValue()^(this.longValue()>>>32))

I think given this definition, this explains why:

~~the collision rate is reduced when you introduce more entropy and thus disperse it more via the upper half of the long value.~~ (edit: I read the order wrong, so here's the counter-argument below)

The collisions might be more likely when extending into the long range - after all, in Java, hashCodes are only int size, so you can only have a limited amount of equal distribution. If you know it's "evenly" distributed over an int range then your collisions are reduced. If you spread that out across the long range, then it greatly increases your chance of collision.

Here's from the HashMap Java documentation (emphasis mine):

This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets

Side note: you'll find even greater performance gains by tuning the initial capacity and load factor - check the HashMap documentation for more information.

回复收藏 0 原文