为什么这些矩阵乘法的性能如此不同?

发布于 2024-09-29 15:18:01 字数 890 浏览 7 评论 0原文

我用 Java 编写了两个矩阵类,只是为了比较它们的矩阵乘法的性能。一个类 (Mat1) 存储一个 double[][] A 成员,其中矩阵的第 i 行是 A[i]。另一个类 (Mat2) 存储 AT,其中 TA 的转置。

假设我们有一个方阵 M,并且需要 M.mult(M) 的乘积。将产品命名为 P

当 M 是 Mat1 实例时,使用的算法是直接的:

P[i][j] += M.A[i][k] * M.A[k][j]
    for k in range(0, M.A.length)

在 M 是 Mat2 的情况下,我使用:

P[i][j] += M.A[i][k] * M.T[j][k]

这是相同的算法,因为 T[j][k]==A[k][j] 。在 1000x1000 矩阵上,第二个算法在我的机器上大约需要 1.2 秒,而第一个算法至少需要 25 秒。我原以为第二个会更快,但没想到这么快。问题是,为什么速度这么快?

我唯一的猜测是,第二个算法更好地利用了 CPU 缓存,因为数据以大于 1 个字的块的形式被拉入缓存,而第二个算法通过仅遍历行而受益,而第一个算法则忽略拉入的数据通过立即转到下面的行(内存中约 1000 个字,因为数组按行主要顺序存储)来进行缓存,没有缓存任何数据。

我问过某人,他认为这是因为更友好的内存访问模式(即第二个版本会导致更少的 TLB 软故障)。我根本没有想到这一点,但我可以看出它是如何减少 TLB 错误的。

那么,是哪一个呢?或者还有其他原因导致性能差异?

I wrote two matrix classes in Java just to compare the performance of their matrix multiplications. One class (Mat1) stores a double[][] A member where row i of the matrix is A[i]. The other class (Mat2) stores A and T where T is the transpose of A.

Let's say we have a square matrix M and we want the product of M.mult(M). Call the product P.

When M is a Mat1 instance the algorithm used was the straightforward one:

P[i][j] += M.A[i][k] * M.A[k][j]
    for k in range(0, M.A.length)

In the case where M is a Mat2 I used:

P[i][j] += M.A[i][k] * M.T[j][k]

which is the same algorithm because T[j][k]==A[k][j]. On 1000x1000 matrices the second algorithm takes about 1.2 seconds on my machine, while the first one takes at least 25 seconds. I was expecting the second one to be faster, but not by this much. The question is, why is it this much faster?

My only guess is that the second one makes better use of the CPU caches, since data is pulled into the caches in chunks larger than 1 word, and the second algorithm benefits from this by traversing only rows, while the first ignores the data pulled into the caches by going immediately to the row below (which is ~1000 words in memory, because arrays are stored in row major order), none of the data for which is cached.

I asked someone and he thought it was because of friendlier memory access patterns (i.e. that the second version would result in fewer TLB soft faults). I didn't think of this at all but I can sort of see how it results in fewer TLB faults.

So, which is it? Or is there some other reason for the performance difference?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

段念尘 2024-10-06 15:18:01

这是因为您的数据的位置。

在 RAM 中,矩阵虽然从您的角度来看是二维的,但它当然存储为连续的字节数组。与一维数组的唯一区别是偏移量是通过对您使用的两个索引进行插值来计算的。

这意味着,如果您访问位置 x,y 处的元素,它将计算 x*row_length + y ,这将是用于引用指定位置处的元素的偏移量。

发生的情况是,一个大矩阵不仅仅存储在内存页面中(这就是操作系统管理 RAM 的方式,通过将其分割成块),因此如果您尝试访问某个内存页面,它必须在 CPU 缓存内加载正确的页面。尚不存在的元素。

只要您连续进行乘法,就不会产生任何问题,因为您主要使用一页的所有系数,然后切换到下一页,但如果您反转索引,会发生的情况是每个元素都可能包含在不同的内存页面,因此每次它都需要向 RAM 请求不同的页面,这几乎适用于您执行的每一次乘法,这就是差异如此巧妙的原因。

(我相当简化了整个解释,只是为了让您了解这个问题的基本概念)

无论如何,我不认为这是由 JVM 本身引起的。它可能与您的操作系统如何管理 Java 进程的内存有关。

This because of locality of your data.

In RAM a matrix, although bidimensional from your point of view, it's of course stored as a contiguous array of bytes. The only difference from a 1D array is that the offset is calculated by interpolating both indices that you use.

This means that if you access element at position x,y it will calculate x*row_length + y and this will be the offset used to reference to the element at position specified.

What happens is that a big matrix isn't stored in just a page of memory (this is how you OS manages the RAM, by splitting it into chunks) so it has to load inside CPU cache the correct page if you try to access an element that is not already present.

As long as you go contiguously doing your multiplication you don't create any problems, since you mainly use all coefficients of a page and then switch to the next one but if you invert indices what happens is that every single element may be contained in a different memory page so everytime it needs to ask to RAM a different page, this almost for every single multiplication you do, this is why the difference is so neat.

(I rather simplified the whole explaination, it's just to give you the basic idea around this problem)

In any case I don't think this is caused by JVM by itself. It maybe related in how your OS manages the memory of the Java process..

泪眸﹌ 2024-10-06 15:18:01

缓存和 TLB 假设都是合理的,但我想查看基准测试的完整代码……而不仅仅是伪代码片段。

另一种可能性是,性能差异是由于您的应用程序在转置版本中为数据数组使用了 50% 以上的内存。如果您的 JVM 堆大小较小,则可能会导致 GC 运行过于频繁。这很可能是使用默认堆大小的结果。 (三批 1000 x 1000 x 8 字节约为 24Mb)

尝试将初始堆大小和最大堆大小设置为(例如)当前最大大小的两倍。如果这没有什么区别,那么这不是一个简单的堆大小问题。

The cache and TLB hypotheses are both reasonable, but I'd like to see the complete code of your benchmark ... not just pseudo-code snippets.

Another possibility is that performance difference is a result of your application using 50% more memory for the data arrays in the version with the transpose. If your JVM's heap size is small, it is possible that this is causing the GC to run too often. This could well be a result of using the default heap size. (Three lots of 1000 x 1000 x 8 bytes is ~24Mb)

Try setting the initial and max heap sizes to (say) double the current max size. If that makes no difference, then this is not a simple heap size issue.

吾性傲以野 2024-10-06 15:18:01

很容易猜测问题可能是局部性的,也许确实如此,但这仍然是一个猜测。

没有必要猜测。有两种技术可能会给您答案:单步执行和随机暂停。

如果您单步运行缓慢的代码,您可能会发现它做了很多您从未梦想过的事情。比如,你问?尝试一下就知道了。在机器语言级别上,您应该看到它所做的就是有效地单步执行内部循环,而没有浪费运动。

如果它实际上正在逐步执行内部循环而没有浪费运动,那么随机暂停将为您提供信息。由于慢速进程比快进程花费的时间长 20 倍,这意味着 95% 的时间它都在做一些不必要的事情。所以看看它是什么。每次你暂停它时,你有 95% 的机会会看到那是什么以及为什么。

如果在慢速情况下,它正在执行的指令看起来与快速情况一样高效,则缓存局部性是对其慢速原因的合理猜测。我确信,一旦您消除了可能发生的任何其他愚蠢行为,缓存位置将占据主导地位。

It's easy to guess that the problem might be locality, and maybe it is, but that's still a guess.

It's not necessary to guess. Two techniques might give you the answer - single stepping and random pausing.

If you single-step the slow code you might find out that it's doing a lot of stuff you never dreamed of. Such as, you ask? Try it and find out. What you should see it doing, at the machine-language level, is efficiently stepping through the inner loop with no waste motion.

If it actually is stepping through the inner loop with no waste motion, then random pausing will give you information. Since the slow one is taking 20 times longer than the fast one, that implies 95% of the time it is doing something it doesn't have to. So see what it is. Each time you pause it, the chance is 95% that you will see what that is, and why.

If in the slow case, the instructions it is executing appear just as efficient as the fast case, then cache locality is a reasonable guess of why it is slow. I'm sure, once you've eliminated any other silliness that may be going on, that cache locality will dominate.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文