algorithm performance c matrix-multiplication

矩阵乘法：矩阵大小差异小，时序差异大

发布于 2024-12-11 18:52:42 字数 2101 浏览 2 评论 0原文

我有一个矩阵乘法代码，如下所示：

for(i = 0; i < dimension; i++)
    for(j = 0; j < dimension; j++)
        for(k = 0; k < dimension; k++)
            C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

这里，矩阵的大小由维度表示。现在，如果矩阵的大小为 2000，则运行这段代码需要 147 秒，而如果矩阵的大小为 2048，则需要 447 秒。所以虽然没有区别。乘法的次数是 (2048*2048*2048)/(2000*2000*2000) = 1.073，时间差异是 447/147 = 3。有人能解释为什么会发生这种情况吗？我预计它会线性扩展，但这并没有发生。我并不是想编写最快的矩阵乘法代码，只是想了解为什么会发生这种情况。

规格：AMD Opteron 双核节点 (2.2GHz)、2G RAM、gcc v 4.5.0

程序编译为 gcc -O3 simple.c

我也在 Intel 的 icc 编译器上运行了它，并看到了类似的结果结果。

编辑：

正如评论/答案中所建议的，我运行维度=2060的代码，需要145秒。

完整的程序如下：

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>

/* change dimension size as needed */
const int dimension = 2048;
struct timeval tv; 

double timestamp()
{
        double t;
        gettimeofday(&tv, NULL);
        t = tv.tv_sec + (tv.tv_usec/1000000.0);
        return t;
}

int main(int argc, char *argv[])
{
        int i, j, k;
        double *A, *B, *C, start, end;

        A = (double*)malloc(dimension*dimension*sizeof(double));
        B = (double*)malloc(dimension*dimension*sizeof(double));
        C = (double*)malloc(dimension*dimension*sizeof(double));

        srand(292);

        for(i = 0; i < dimension; i++)
                for(j = 0; j < dimension; j++)
                {   
                        A[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
                        B[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
                        C[dimension*i+j] = 0.0;
                }   

        start = timestamp();
        for(i = 0; i < dimension; i++)
                for(j = 0; j < dimension; j++)
                        for(k = 0; k < dimension; k++)
                                C[dimension*i+j] += A[dimension*i+k] *
                                        B[dimension*k+j];

        end = timestamp();
        printf("\nsecs:%f\n", end-start);

        free(A);
        free(B);
        free(C);

        return 0;
}

原文

I have a matrix multiply code that looks like this:

for(i = 0; i < dimension; i++)
    for(j = 0; j < dimension; j++)
        for(k = 0; k < dimension; k++)
            C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

Here, the size of the matrix is represented by dimension.
Now, if the size of the matrices is 2000, it takes 147 seconds to run this piece of code, whereas if the size of the matrices is 2048, it takes 447 seconds. So while the difference in no. of multiplications is (2048*2048*2048)/(2000*2000*2000) = 1.073, the difference in the timings is 447/147 = 3. Can someone explain why this happens? I expected it to scale linearly, which does not happen. I am not trying to make the fastest matrix multiply code, simply trying to understand why it happens.

Specs: AMD Opteron dual core node (2.2GHz), 2G RAM, gcc v 4.5.0

Program compiled as gcc -O3 simple.c

I have run this on Intel's icc compiler as well, and seen similar results.

EDIT:

As suggested in the comments/answers, I ran the code with dimension=2060 and it takes 145 seconds.

Heres the complete program:

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>

/* change dimension size as needed */
const int dimension = 2048;
struct timeval tv; 

double timestamp()
{
        double t;
        gettimeofday(&tv, NULL);
        t = tv.tv_sec + (tv.tv_usec/1000000.0);
        return t;
}

int main(int argc, char *argv[])
{
        int i, j, k;
        double *A, *B, *C, start, end;

        A = (double*)malloc(dimension*dimension*sizeof(double));
        B = (double*)malloc(dimension*dimension*sizeof(double));
        C = (double*)malloc(dimension*dimension*sizeof(double));

        srand(292);

        for(i = 0; i < dimension; i++)
                for(j = 0; j < dimension; j++)
                {   
                        A[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
                        B[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
                        C[dimension*i+j] = 0.0;
                }   

        start = timestamp();
        for(i = 0; i < dimension; i++)
                for(j = 0; j < dimension; j++)
                        for(k = 0; k < dimension; k++)
                                C[dimension*i+j] += A[dimension*i+k] *
                                        B[dimension*k+j];

        end = timestamp();
        printf("\nsecs:%f\n", end-start);

        free(A);
        free(B);
        free(C);

        return 0;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寒尘 2024-12-18 18:52:42

这是我的大胆猜测：缓存

您可能可以将 2 行 2000 个 double 放入缓存中。略小于 32kb 一级缓存。（同时留出空间其他必要的东西）

但是当你将其提高到 2048 时，它会使用整个缓存（并且你会溢出一些，因为你需要空间来容纳其他东西）

假设缓存策略是LRU，只要一点点溢出缓存就会导致整行被重复刷新并重新加载到L1缓存中。

另一种可能性是由于 2 的幂而导致的缓存关联性。虽然我认为该处理器是 2 路 L1 关联的，所以我认为在这种情况下这并不重要。（但无论如何我都会抛出这个想法）

可能的解释 2：由于 L2 缓存上的超级对齐而导致冲突缓存未命中。

您的 B 数组正在该列上进行迭代。所以访问是跨步的。您的总数据大小为 2k x 2k，每个矩阵约为 32 MB。这比 L2 缓存大得多。

当数据未完美对齐时，B 上将具有良好的空间局部性。尽管您正在跳跃行并且每个缓存行仅使用一个元素，但缓存行仍保留在 L2 缓存中，以便由中间循环的下一次迭代重用。

然而，当数据完美对齐时（2048），这些跃点将全部落在相同的“缓存路径”上，并且将远远超过 L2 缓存关联性。因此，B 所访问的缓存行将不会保留在缓存中以供下一次迭代使用。 相反，它们需要从内存中一直拉入。

回复收藏 0 原文

离去的眼神 2024-12-18 18:52:42

您肯定会得到我所说的缓存共振。这与别名类似，但并不完全相同。让我解释一下。

缓存是一种硬件数据结构，它提取地址的一部分并将其用作表中的索引，与软件中的数组没有什么不同。（事实上，我们在硬件中称它们为数组。）缓存数组包含缓存数据行和标签 - 有时数组中每个索引有一个这样的条目（直接映射），有时有多个这样的条目（N 路集关联性）。提取地址的第二部分并将其与存储在数组中的标签进行比较。索引和标签一起唯一地标识高速缓存行存储器地址。最后，其余地址位标识高速缓存行中的哪些字节被寻址，以及访问的大小。

通常索引和标签是简单的位域。所以内存地址看起来像

<前><代码> ...标签... | ...索引... |缓存行内偏移量

（有时索引和标签是散列，例如，将其他位的一些异或到作为索引的中间范围位中。更罕见的是，有时索引，更罕见的是标签，类似于将缓存行地址取模a这些更复杂的索引计算是为了解决共振问题，我在这里解释了所有这些都会遭受某种形式的共振，但正如您所发现的，最简单的位域提取方案会遭受常见访问模式的共振。

）典型值...有许多不同的型号“Opteron Dual Core”，我在这里没有看到任何指定您拥有哪一个的信息。随机选择一本，我在 AMD 网站上看到的最新手册，Bios 和内核开发人员指南 (BKDG) AMD 系列 15h 型号 00h-0Fh，2012 年 3 月 12 日。

（系列15h = Bulldozer 系列，最新的高端处理器 - BKDG 提到双核，虽然我不知道您所描述的产品编号，但无论如何，共振的相同想法适用于所有处理器，它是。只是缓存大小和关联性等参数可能会有所不同。）

来自 p.33：

AMD 系列 15h 处理器包含一个 16 KB、4 路预测 L1
具有两个 128 位端口的数据缓存。这是一个直写式缓存
每个周期最多支持两个 128 字节加载。它分为16个
银行，每个 16 字节宽。 [...] 只能执行一次加载
在一个周期内给定 L1 缓存的存储体。

总结一下：

64字节缓存行=>高速缓存行内的 6 个偏移位
16KB/4-way =>共振为4KB。
即地址位0-5是缓存行偏移。
16KB / 64B 缓存线 => 2^14/2^6 = 2^8=256 条缓存行。
（错误修复：我最初将其错误计算为 128。我已修复了所有依赖项。）
4 路关联 => 256/4 = 缓存数组中的 64 个索引。我（英特尔）将这些称为“集合”。
即，您可以将缓存视为包含 32 个条目或集合的数组，每个条目包含 4 个缓存行及其标签。（比这更复杂，但没关系）。

（顺便说一下，术语“set”和“way”有不同的定义。）

有6 个索引位，最简单方案中的位 6-11。
这意味着在索引位（位 6-11）中具有完全相同值的任何缓存行都将映射到同一组缓存。

现在看看你的程序。

C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

循环k是最里面的循环。基本类型是double，8字节。如果dimension=2048，即2K，则循环访问的B[dimension*k+j]的连续元素将相距2048 * 8 = 16K字节。它们都将映射到同一组 L1 缓存 - 它们在缓存中都具有相同的索引。这意味着，缓存中不再有 256 个缓存线可供使用，而只有 4 个 - 缓存的“4 路关联性”。

即，围绕该循环，每 4 次迭代您可能会遇到一次缓存未命中。不好。

（其实事情有点复杂。但是上面的内容是一个很好的初步理解。上面提到的B的条目地址是一个虚拟地址。所以物理地址可能会略有不同。而且，Bulldozer有一种预测缓存的方式，可能使用虚拟地址位，这样就不必等待虚拟地址到物理地址的转换，但是，无论如何：您的代码有 16K 的“共振”，L1 数据缓存有 16K 的共振。。）]

如果你稍微改变一下维度，例如2048+1，那么数组B的地址将分布在缓存的所有集合中。而且您的缓存未命中率将会显着减少。

填充阵列是一种相当常见的优化，例如将 2048 更改为 2049，以避免这种共振。但是“缓存阻塞是一项更重要的优化。http://suif.stanford.edu/papers/lam-asplos91 .pdf

除了缓存线共振之外，这里还发生了其他事情，例如，L1 缓存有 16 个存储体，每个存储体有 16 个字节宽，维度 = 2048 时，内循环中的连续 B 访问将发生。总是去所以它们不能并行 - 如果 A 访问碰巧进入同一个银行，

我认为，这不会像缓存共振那么大。并且

，是的，可能存在别名，例如 STLF（存储到加载转发缓冲区）可能仅使用小位字段进行比较，并得到错误匹配

（实际上，如果您考虑一下，缓存中会产生共振。就像别名一样，与使用有关位域。共振是由映射同一组的多个高速缓存线引起的，而未散布是由于基于不完整的地址位的匹配引起的。）

总体而言，我的调整建议：

尝试高速缓存阻塞，而不进行任何进一步的分析。我这样说是因为缓存阻塞很容易，而且很可能这就是您需要做的全部。
之后，使用 VTune 或 OProf。或者 Cachegrind。或者...
更好的是，使用经过良好调整的库例程来进行矩阵乘法。

You are definitely getting what I call a cache resonance. This is similar to aliasing, but not exactly the same. Let me explain.

Caches are hardware data structures that extract one part of the address and use it as an index in a table, not unlike an array in software. (In fact, we call them arrays in hardware.) The cache array contains cache lines of data, and tags - sometimes one such entry per index in the array (direct mapped), sometimes several such (N-way set associativity). A second part of the address is extracted and compared to the tag stored in the array. Together, the index and tag uniquely identify a cache line memory address. Finally, the rest of the address bits identifies which bytes in the cache line are addressed, along with the size of the access.

Usually the index and tag are simple bitfields. So a memory address looks like

  ...Tag... | ...Index... | Offset_within_Cache_Line

(Sometimes the index and tag are hashes, e.g. a few XORs of other bits into the mid-range bits that are the index. Much more rarely, sometimes the index, and more rarely the tag, are things like taking cache line address modulo a prime number. These more complicated index calculations are attempts to combat the problem of resonance, which I explain here. All suffer some form of resonance, but the simplest bitfield extraction schemes suffer resonance on common access patterns, as you have found.)

So, typical values... there are many different models of "Opteron Dual Core", and I do not see anything here that specifies which one you have. Choosing one at random, the most recent manual I see on AMD's website, Bios and Kernel Developer's Guide (BKDG) for AMD Family 15h Models 00h-0Fh, March 12, 2012.

(Family 15h = Bulldozer family, the most recent high end processor - the BKDG mentions dual core, although I don't know the product number that is exactly what you describe. But, anyway, the same idea of resonance applies to all processors, it is just that the parameters like cache size and associativity may vary a bit.)

From p.33:

The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1
data cache with two 128- bit ports. This is a write-through cache that
supports up to two 128 Byte loads per cycle. It is divided into 16
banks, each 16 bytes wide. [...] Only one load can be performed from a
given bank of the L1 cache in a single cycle.

To sum up:

64 byte cache line => 6 offset bits within cache line
16KB/4-way => the resonance is 4KB.
I.e. address bits 0-5 are the cache line offset.
16KB / 64B cache lines => 2^14/2^6 = 2^8=256 cache lines in the cache.
(Bugfix: I originally miscalculated this as 128. that I have fixed all dependencies.)
4 way associative => 256/4 = 64 indexes in the cache array. I (Intel) call these "sets".
i.e. you can consider the cache to be an array of 32 entries or sets, each entry containing 4 cache lines ad their tags. (It's more complicated than this, but that's okay).

(By the way, the terms "set" and "way" have varying definitions.)

there are 6 index bits, bits 6-11 in the simplest scheme.
This means that any cache lines that have exactly the same values in the index bits, bits 6-11, will map to the same set of the cache.

Now look at your program.

C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

Loop k is the innermost loop. The base type is double, 8 bytes. If dimension=2048, i.e. 2K, then successive elements of B[dimension*k+j] accessed by the loop will be 2048 * 8 = 16K bytes apart. They will all map to the same set of the L1 cache - they will all have the same index in the cache. Which means that, instead of there being 256 cache lines in the cache available for use there will only be 4 - the "4-way associativity" of the cache.

I.e. you will probably get a cache miss every 4 iterations around this loop. Not good.

(Actually, things are a little more complicated. But the above is a good first understanding. The addresses of entries of B mentioned above is a virtual address. So there might be slightly different physical addresses. Moreover, Bulldozer has a way predictive cache, probably using virtual addresses bits so that it doesn't have to wait for a virtual to physical address translation. But, in any case: your code has a "resonance" of 16K. The L1 data cache has a resonance of 16K. Not good.)]

If you change the dimension just a little bit, e.g. to 2048+1, then the addresses of array B will be spread across all of the sets of the cache. And you will get significantly fewer cache misses.

It is a fairly common optimization to pad your arrays, e.g. to change 2048 to 2049, to avoid this srt of resonance. But "cache blocking is an even more important optimization. http://suif.stanford.edu/papers/lam-asplos91.pdf

In addition to the cache line resonance, there are other things going on here. For example, the L1 cache has 16 banks, each 16 bytes wide. With dimension = 2048, successive B accesses in the inner loop will always go to the same bank. So they can't go in parallel - and if the A access happens to go to the same bank, you will lose.

I don't think, looking at it, that this is as big as the cache resonance.

And, yes, possibly, there may be aliasing going. E.g. the STLF (Store To Load Forwarding buffers) may be comparing only using a small bitfield, and getting false matches.

(Actually, if you think about it, resonance in the cache is like aliasing, related to the use of bitfields. Resonance is caused by multiple cache lines mapping the same set, not being spread arond. Alisaing is caused by matching based on incomplete address bits.)

Overall, my recommendation for tuning:

Try cache blocking without any further analysis. I say this because cache blocking is easy, and it is very likely that this is all you would need to do.
After that, use VTune or OProf. Or Cachegrind. Or ...
Better yet, use a well tuned library routine to do matrix multiply.

回复收藏 0 原文

朕就是辣么酷 2024-12-18 18:52:42

有几种可能的解释。一种可能的解释是神秘所建议的：有限资源（缓存或TLB）耗尽。另一种可能的可能性是假别名停顿，当连续内存访问被某些 2 的幂数的倍数（通常为 4KB）分隔时，可能会发生这种情况。

您可以通过绘制一系列值的时间/维度^3 来开始缩小工作范围。如果您已经耗尽了缓存或耗尽了 TLB 范围，您将看到一个或多或少的平坦部分，然后在 2000 年至 2048 年间急剧上升，然后是另一个平坦部分。如果您看到与混叠相关的停顿，您将看到一个或多或少平坦的图表，在 2048 处有一个向上的窄尖峰。

当然，这具有诊断能力，但它不是决定性的。如果您想确切地知道速度下降的根源是什么，您将需要了解性能计数器，它可以明确地回答此类问题。

回复收藏 0 原文

心病无药医 2024-12-18 18:52:42

我知道这太老了，但我会咬一口。正如人们所说，这是一个缓存问题，导致速度减慢到两倍左右。但这还有另一个问题：速度太慢。如果你看看你的计算循环。

for(i = 0; i < dimension; i++)
    for(j = 0; j < dimension; j++)
        for(k = 0; k < dimension; k++)
            C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

最里面的循环每次迭代都会将 k 更改为 1，这意味着您仅访问 A 的最后一个元素的 1 个双精度但整个“维度”远离 B 的最后一个元素。不利用 B 元素的缓存

。如果将其更改为：

for(i = 0; i < dimension; i++)
    for(j = 0; j < dimension; j++)
        for(k = 0; k < dimension; k++)
            C[dimension*i+k] += A[dimension*i+j] * B[dimension*j+k];

您会得到完全相同的结果（模双加关联性错误），但它对缓存更加友好（本地）。我尝试了一下，它带来了实质性的改进。这可以概括为

不要按定义乘以矩阵，而是按行相乘

乘以加速示例（我更改了代码以将维度作为参数）

$ diff a.c b.c
42c42
<               C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];
---
>               C[dimension*i+k] += A[dimension*i+j] * B[dimension*j+k];
$ make a
cc     a.c   -o a
$ make b
cc     b.c   -o b
$ ./a 1024

secs:88.732918
$ ./b 1024

secs:12.116630

作为奖励（以及与此问题相关的原因）是此循环不会不再受之前问题的困扰。

如果您已经知道这一切，那么我深表歉意！

I know this is waaaay too old, but I'll take a bite. It's (as it has been said) a cache issue what causes the slowdown at around powers of two. But there is another problem with this: it's too slow. If you look at your compute loop.

for(i = 0; i < dimension; i++)
    for(j = 0; j < dimension; j++)
        for(k = 0; k < dimension; k++)
            C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

The inner-most loop changes k by 1 each iteration, meaning you access just 1 double away from the last element you used of A but a whole 'dimension' doubles away from the last element of B. This doesn't take any advantage of the caching of the elements of B.

If you change this to:

for(i = 0; i < dimension; i++)
    for(j = 0; j < dimension; j++)
        for(k = 0; k < dimension; k++)
            C[dimension*i+k] += A[dimension*i+j] * B[dimension*j+k];

You get the exact same results (modulo double addition associativity errors), but it's a whole lot more cache-friendly (local). I tried it and it gives substantial improvements. This can be summarized as

Don't multiply matrices by definition, but rather, by rows

Example of the speed-up (I changed your code to take the dimension as an argument)

$ diff a.c b.c
42c42
<               C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];
---
>               C[dimension*i+k] += A[dimension*i+j] * B[dimension*j+k];
$ make a
cc     a.c   -o a
$ make b
cc     b.c   -o b
$ ./a 1024

secs:88.732918
$ ./b 1024

secs:12.116630

As a bonus (and what makes this related to this question) is that this loop doesn't suffer from the previous problem.

If you already knew all of this, then I apologize!

回复收藏 0 原文

洒一地阳光 2024-12-18 18:52:42

有几个答案提到了 L2 缓存问题。

您实际上可以通过缓存模拟验证这一点。
Valgrind 的 cachegrind 工具可以做到这一点。

valgrind --tool=cachegrind --cache-sim=yes your_executable

设置命令行参数，使其与您的 CPU 匹配L2 参数。

使用不同的矩阵大小进行测试，您可能会看到 L2 缺失率突然增加。

A couple of answers mentioned L2 Cache problems.

You can actually verify this with a cache simulation.
Valgrind's cachegrind tool can do that.

valgrind --tool=cachegrind --cache-sim=yes your_executable

Set the command line parameters so they match with your CPU's L2 parameters.

Test it with different matrix sizes, you'll probably see a sudden increase in L2 miss ratio.

回复收藏 0 原文

~没有更多了~

关于作者

煮酒

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

矩阵乘法：矩阵大小差异小，时序差异大

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

矩阵乘法：矩阵大小差异小，时序差异大

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。