为什么这些矩阵乘法的性能如此不同？

发布于 2024-09-29 15:18:01 字数 890 浏览 7 评论 0原文

我用 Java 编写了两个矩阵类，只是为了比较它们的矩阵乘法的性能。一个类 (Mat1) 存储一个 double[][] A 成员，其中矩阵的第 i 行是 A[i]。另一个类 (Mat2) 存储 A 和 T，其中 T 是 A 的转置。

假设我们有一个方阵 M，并且需要 M.mult(M) 的乘积。将产品命名为 P。

当 M 是 Mat1 实例时，使用的算法是直接的：

P[i][j] += M.A[i][k] * M.A[k][j]
    for k in range(0, M.A.length)

在 M 是 Mat2 的情况下，我使用：

P[i][j] += M.A[i][k] * M.T[j][k]

这是相同的算法，因为 T[j][k]==A[k][j] 。在 1000x1000 矩阵上，第二个算法在我的机器上大约需要 1.2 秒，而第一个算法至少需要 25 秒。我原以为第二个会更快，但没想到这么快。问题是，为什么速度这么快？

我唯一的猜测是，第二个算法更好地利用了 CPU 缓存，因为数据以大于 1 个字的块的形式被拉入缓存，而第二个算法通过仅遍历行而受益，而第一个算法则忽略拉入的数据通过立即转到下面的行（内存中约 1000 个字，因为数组按行主要顺序存储）来进行缓存，没有缓存任何数据。

我问过某人，他认为这是因为更友好的内存访问模式（即第二个版本会导致更少的 TLB 软故障）。我根本没有想到这一点，但我可以看出它是如何减少 TLB 错误的。

那么，是哪一个呢？或者还有其他原因导致性能差异？

原文

I wrote two matrix classes in Java just to compare the performance of their matrix multiplications. One class (Mat1) stores a double[][] A member where row i of the matrix is A[i]. The other class (Mat2) stores A and T where T is the transpose of A.

Let's say we have a square matrix M and we want the product of M.mult(M). Call the product P.

When M is a Mat1 instance the algorithm used was the straightforward one:

P[i][j] += M.A[i][k] * M.A[k][j]
    for k in range(0, M.A.length)

In the case where M is a Mat2 I used:

P[i][j] += M.A[i][k] * M.T[j][k]

which is the same algorithm because T[j][k]==A[k][j]. On 1000x1000 matrices the second algorithm takes about 1.2 seconds on my machine, while the first one takes at least 25 seconds. I was expecting the second one to be faster, but not by this much. The question is, why is it this much faster?

My only guess is that the second one makes better use of the CPU caches, since data is pulled into the caches in chunks larger than 1 word, and the second algorithm benefits from this by traversing only rows, while the first ignores the data pulled into the caches by going immediately to the row below (which is ~1000 words in memory, because arrays are stored in row major order), none of the data for which is cached.

I asked someone and he thought it was because of friendlier memory access patterns (i.e. that the second version would result in fewer TLB soft faults). I didn't think of this at all but I can sort of see how it results in fewer TLB faults.

So, which is it? Or is there some other reason for the performance difference?

分享到QQ

分享到微博