为什么矩阵乘法算法中的循环顺序会影响性能?

发布于 2024-12-04 12:08:27 字数 638 浏览 6 评论 0原文

我得到了两个函数来求两个矩阵的乘积:

 void MultiplyMatrices_1(int **a, int **b, int **c, int n){
      for (int i = 0; i < n; i++)
          for (int j = 0; j < n; j++)
              for (int k = 0; k < n; k++)
                  c[i][j] = c[i][j] + a[i][k]*b[k][j];
  }

 void MultiplyMatrices_2(int **a, int **b, int **c, int n){
      for (int i = 0; i < n; i++)
          for (int k = 0; k < n; k++)
              for (int j = 0; j < n; j++)
                  c[i][j] = c[i][j] + a[i][k]*b[k][j];
 }

我使用 gprof 运行并分析了两个可执行文件,除了这个函数之外,每个可执行文件都具有相同的代码。对于大小为 2048 x 2048 的矩阵,第二个要快得多(大约 5 倍)。有什么想法吗?

I am given two functions for finding the product of two matrices:

 void MultiplyMatrices_1(int **a, int **b, int **c, int n){
      for (int i = 0; i < n; i++)
          for (int j = 0; j < n; j++)
              for (int k = 0; k < n; k++)
                  c[i][j] = c[i][j] + a[i][k]*b[k][j];
  }

 void MultiplyMatrices_2(int **a, int **b, int **c, int n){
      for (int i = 0; i < n; i++)
          for (int k = 0; k < n; k++)
              for (int j = 0; j < n; j++)
                  c[i][j] = c[i][j] + a[i][k]*b[k][j];
 }

I ran and profiled two executables using gprof, each with identical code except for this function. The second of these is significantly (about 5 times) faster for matrices of size 2048 x 2048. Any ideas as to why?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

我乃一代侩神 2024-12-11 12:08:27

我相信您看到的是引用位置的影响计算机的内存层次结构。

通常,计算机内存被分为具有不同性能特征的不同类型(这通常称为内存层次结构)。最快的存储器位于处理器的寄存器中,通常可以在单个时钟周期内访问和读取它们。然而,这些寄存器通常只有少数(通常不超过 1KB)。另一方面,计算机的主内存很大(例如 8GB),但访问速度慢得多。为了提高性能,计算机通常在物理结构上具有多层缓存处理器和主存储器。这些缓存比寄存器慢,但比主内存快得多,因此,如果您执行在缓存中查找某些内容的内存访问,它往往比必须访问主内存快得多(通常在 5-25 倍之间)快点)。访问内存时,处理器首先检查内存缓存中是否有该值,然后返回主内存读取该值。如果您始终如一地访问缓存中的值,那么最终会比跳过时获得更好的性能内存,随机访问值。

大多数程序的编写方式是,如果将内存中的单个字节读入内存,程序随后也会从该内存区域周围读取多个不同的值。因此,这些缓存通常被设计为当您从内存中读取单个值时,该单个值周围的内存块(通常在 1KB 到 1MB 之间)也会被拉入缓存。这样,如果您的程序读取附近的值,它们已经在缓存中,您不必访问主内存。

现在,最后一个细节 - 在 C/C++ 中,数组按行优先顺序存储,这意味着矩阵单行中的所有值都彼此相邻存储。因此,在内存中,数组看起来像第一行,然后是第二行,然后是第三行,等等。

鉴于此,让我们看看您的代码。第一个版本如下所示:

  for (int i = 0; i < n; i++)
      for (int j = 0; j < n; j++)
          for (int k = 0; k < n; k++)
              c[i][j] = c[i][j] + a[i][k]*b[k][j];

现在,让我们看看最里面的代码行。在每次迭代中,k 的值都在增加。这意味着在运行最内层循环时,循环的每次迭代在加载 b[k][j] 的值时都可能出现缓存未命中。这样做的原因是,由于矩阵是以行优先顺序存储的,因此每次递增 k 时,您都会跳过矩阵的整行并进一步跳转到内存中,可能远远超过您缓存的值。但是,在查找 c[i][j] 时您不会错过(因为 ij 是相同的),您也可能不会错过 a[i][k],因为这些值按行主序排列,并且如果 a[i][k] 的值已缓存从上一次迭代中,值本次迭代中的a[i][k]读取来自相邻的内存位置。因此,在最内层循环的每次迭代中,很可能会发生一次缓存未命中。

但考虑第二个版本:

  for (int i = 0; i < n; i++)
      for (int k = 0; k < n; k++)
          for (int j = 0; j < n; j++)
              c[i][j] = c[i][j] + a[i][k]*b[k][j];

现在,由于您在每次迭代中都增加了 j ,所以让我们考虑一下最里面的语句可能会有多少次缓存未命中。由于这些值按行优先顺序排列,因此 c[i][j] 的值可能位于缓存中,因为 c[i][j]<上一次迭代的 /code> 可能也被缓存并准备好读取。同样,b[k][j] 可能已缓存,并且由于 ik 没有改变,因此可能是 a [i][k] 也被缓存。这意味着在内循环的每次迭代中,您可能不会出现缓存未命中情况。

总的来说,这意味着代码的第二个版本不太可能在循环的每次迭代中出现缓存未命中,而第一个版本几乎肯定会出现。因此,如您所见,第二个循环可能比第一个循环更快。

有趣的是,许多编译器开始提供原型支持,以检测第二个版本的代码比第一个版本更快。有些人会尝试自动重写代码以最大化并行性。如果您有Purple Dragon Book的副本,第11章将讨论这些编译器如何工作。

此外,您可以使用更复杂的循环进一步优化该循环的性能。例如,可以使用名为阻止的技术通过将数组分成可以在缓存中保存更长时间的子区域,然后对这些块使用多个操作来计算总体结果,显着提高性能。

希望这有帮助!

I believe that what you're looking at is the effects of locality of reference in the computer's memory hierarchy.

Typically, computer memory is segregated into different types that have different performance characteristics (this is often called the memory hierarchy). The fastest memory is in the processor's registers, which can (usually) be accessed and read in a single clock cycle. However, there are usually only a handful of these registers (usually no more than 1KB). The computer's main memory, on the other hand, is huge (say, 8GB), but is much slower to access. In order to improve performance, the computer is usually physically constructed to have several levels of caches in-between the processor and main memory. These caches are slower than registers but much faster than main memory, so if you do a memory access that looks something up in the cache it tends to be a lot faster than if you have to go to main memory (typically, between 5-25x faster). When accessing memory, the processor first checks the memory cache for that value before going back to main memory to read the value in. If you consistently access values in the cache, you will end up with much better performance than if you're skipping around memory, randomly accessing values.

Most programs are written in a way where if a single byte in memory is read into memory, the program later reads multiple different values from around that memory region as well. Consequently, these caches are typically designed so that when you read a single value from memory, a block of memory (usually somewhere between 1KB and 1MB) of values around that single value is also pulled into the cache. That way, if your program reads the nearby values, they're already in the cache and you don't have to go to main memory.

Now, one last detail - in C/C++, arrays are stored in row-major order, which means that all of the values in a single row of a matrix are stored next to each other. Thus in memory the array looks like the first row, then the second row, then the third row, etc.

Given this, let's look at your code. The first version looks like this:

  for (int i = 0; i < n; i++)
      for (int j = 0; j < n; j++)
          for (int k = 0; k < n; k++)
              c[i][j] = c[i][j] + a[i][k]*b[k][j];

Now, let's look at that innermost line of code. On each iteration, the value of k is changing increasing. This means that when running the innermost loop, each iteration of the loop is likely to have a cache miss when loading the value of b[k][j]. The reason for this is that because the matrix is stored in row-major order, each time you increment k, you're skipping over an entire row of the matrix and jumping much further into memory, possibly far past the values you've cached. However, you don't have a miss when looking up c[i][j] (since i and j are the same), nor will you probably miss a[i][k], because the values are in row-major order and if the value of a[i][k] is cached from the previous iteration, the value of a[i][k] read on this iteration is from an adjacent memory location. Consequently, on each iteration of the innermost loop, you are likely to have one cache miss.

But consider this second version:

  for (int i = 0; i < n; i++)
      for (int k = 0; k < n; k++)
          for (int j = 0; j < n; j++)
              c[i][j] = c[i][j] + a[i][k]*b[k][j];

Now, since you're increasing j on each iteration, let's think about how many cache misses you'll likely have on the innermost statement. Because the values are in row-major order, the value of c[i][j] is likely to be in-cache, because the value of c[i][j] from the previous iteration is likely cached as well and ready to be read. Similarly, b[k][j] is probably cached, and since i and k aren't changing, chances are a[i][k] is cached as well. This means that on each iteration of the inner loop, you're likely to have no cache misses.

Overall, this means that the second version of the code is unlikely to have cache misses on each iteration of the loop, while the first version almost certainly will. Consequently, the second loop is likely to be faster than the first, as you've seen.

Interestingly, many compilers are starting to have prototype support for detecting that the second version of the code is faster than the first. Some will try to automatically rewrite the code to maximize parallelism. If you have a copy of the Purple Dragon Book, Chapter 11 discusses how these compilers work.

Additionally, you can optimize the performance of this loop even further using more complex loops. A technique called blocking, for example, can be used to notably increase performance by splitting the array into subregions that can be held in cache longer, then using multiple operations on these blocks to compute the overall result.

Hope this helps!

以歌曲疗慰 2024-12-11 12:08:27

这很可能是内存位置。当您重新排序循环时,最内层循环所需的内存更近并且可以缓存,而在低效版本中,您需要从整个数据集访问内存。

测试这个假设的方法是在这两段代码上运行缓存调试器(例如cachegrind),并查看它们产生了多少缓存未命中。

This may well be the memory locality. When you reorder the loop, the memory that's needed in the inner-most loop is nearer and can be cached, while in the inefficient version you need to access memory from the entire data set.

The way to test this hypothesis is to run a cache debugger (like cachegrind) on the two pieces of code and see how many cache misses they incur.

小矜持 2024-12-11 12:08:27

除了内存局部性之外,还有编译器优化。向量和矩阵运算的关键之一是循环展开。

for (int k = 0; k < n; k++)
   c[i][j] = c[i][j] + a[i][k]*b[k][j];

您可以看到在此内部循环中 ij 没有改变。这意味着它可以重写,因为

for (int k = 0; k < n; k+=4) {
   int * aik = &a[i][k];
   c[i][j] +=
         + aik[0]*b[k][j]
         + aik[1]*b[k+1][j]
         + aik[2]*b[k+2][j]
         + aik[3]*b[k+3][j];
}

您可以看到

  • 循环次数减少了四倍,并且对 c[i][j]
  • a[i][k] 的访问在内存中连续访问
  • ,内存访问和乘法可以流水线化(几乎同时)在CPU中。

如果 n 不是 4、6 或 8 的倍数怎么办? (或者编译器决定将其展开的任何内容)编译器会为您处理此整理工作。 ;)

为了更快地加速这个解决方案,您可以先尝试转置 b 矩阵。这是一些额外的工作和编码,但这意味着对 b 转置的访问在内存中也是连续的。 (当您将 [k] 与 [j] 交换时)

您可以做的另一件事是提高性能,即对乘法进行多线程处理。这可以将 4 核 CPU 的性能提高 3 倍。

最后,您可能会考虑使用 floatdouble 您可能认为 int 会更快,但情况并非总是如此,因为浮点运算可以更严格地优化(在硬件和编译器中)

第二个示例的 c[i][j] 在每次迭代中都会发生变化,这使得优化变得更加困难。

Apart from locality of memory there is also compiler optimisation. A key one for vector and matrix operations is loop unrolling.

for (int k = 0; k < n; k++)
   c[i][j] = c[i][j] + a[i][k]*b[k][j];

You can see in this inner loop i and j do not change. This means it can be rewritten as

for (int k = 0; k < n; k+=4) {
   int * aik = &a[i][k];
   c[i][j] +=
         + aik[0]*b[k][j]
         + aik[1]*b[k+1][j]
         + aik[2]*b[k+2][j]
         + aik[3]*b[k+3][j];
}

You can see there will be

  • four times fewer loops and accesses to c[i][j]
  • a[i][k] is being accessed continuously in memory
  • the memory accesses and multiplies can be pipelined (almost concurrently) in the CPU.

What if n is not a multiple of 4 or 6 or 8? (or whatever the compiler decides to unroll it to) The compiler handles this tidy up for you. ;)

To speed up this solution faster, you could try transposing the b matrix first. This is a little extra work and coding, but it means that accesses to b-transposed are also continuous in memory. (As you are swapping [k] with [j])

Another thing you can do to improve performance is to multi-thread the multiplication. This can improve performance by a factor of 3 on a 4 core CPU.

Lastly you might consider using float or double You might think int would be faster, however that is not always the case as floating point operations can be more heavily optimised (both in hardware and the compiler)

The second example has c[i][j] is changing on each iteration which makes it harder to optimise.

ゝ偶尔ゞ 2024-12-11 12:08:27

可能第二个必须在内存中跳过更多才能访问数组元素。也可能是其他事情——您可以检查编译的代码以了解实际发生的情况。

Probably the second one has to skip around in memory more to access the array elements. It might be something else, too -- you could check the compiled code to see what is actually happening.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文