优化“for 循环”在 C99 中具有不同索引的数组上

发布于 2024-09-13 12:42:33 字数 555 浏览 8 评论 0 原文

我想加速 C99 中的数组乘法。

这是原来的 for 循环:

for(int i=0;i<n;i++) {
        for(int j=0;j<m;j++) {
            total[j]+= w[j][i] * x[i];
        }
    }

我的老板让我尝试这个,但它并没有提高速度:

for(int i=0;i<n;i++) {
        float value = x[i];
        for(int j=0;j<m;j++) {
            total[j]+= w[j][i] * value;
        }
    }

你对如何加速这些 for 循环有其他想法(除了我已经使用的 openmp )吗? 我正在使用:

gcc -DMNIST=1 -O3 -fno-strict-aliasing -std=c99 -lm -D_GNU_SOURCE -Wall -pedantic -fopenmp

谢谢!

I want to speed up an array multiplication in C99.

This is the original for loops:

for(int i=0;i<n;i++) {
        for(int j=0;j<m;j++) {
            total[j]+= w[j][i] * x[i];
        }
    }

My boss asked my to try this, but it did not improve the speed:

for(int i=0;i<n;i++) {
        float value = x[i];
        for(int j=0;j<m;j++) {
            total[j]+= w[j][i] * value;
        }
    }

Have you other ideas (except for openmp, which I already use) on how I could speed up these for-loops?
I am using:

gcc -DMNIST=1 -O3 -fno-strict-aliasing -std=c99 -lm -D_GNU_SOURCE -Wall -pedantic -fopenmp

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

夜清冷一曲。 2024-09-20 12:42:42

如果您知道 xtotalw 不会互相别名,那么通过重新排列循环索引并避免每次通过循环写入 total[j]

for(int j=0;j<m;j++) {
    const float * const w_j = w[j];      
    float total_j = 0;
    for(int i=0;i<n;i++)
        total_j += w_j[i] * x[i];
    total[j] += total_j;
}

但是,BLAS 是正确的答案,大多数情况下对于此类事情。最佳解决方案将取决于 nm、预取时间、管道深度、循环展开、缓存行的大小等。您可能不想这样做其他人在幕后所做的优化水平。

If you know that x, total and w do not alias each other you can get a fairly measurable boost by rearranging the loop indices and avoiding the write to total[j] each time through the loop:

for(int j=0;j<m;j++) {
    const float * const w_j = w[j];      
    float total_j = 0;
    for(int i=0;i<n;i++)
        total_j += w_j[i] * x[i];
    total[j] += total_j;
}

However, BLAS is the right answer, most of the time for this sort of thing. The best solution will depend on n, m, prefetch times, pipeline depths, loop unrolling, the size of your cache lines, etc. You probably don't want to do the level of optimization that it other people have done under the covers.

本王不退位尔等都是臣 2024-09-20 12:42:41

如果这真的很重要:

  1. 链接到已调整的 CBLAS 库。有很多可供选择,有些是免费的,有些是商业的。一些平台的系统上已经有一个。
  2. 将您的代码替换为对 cblas_dgemv 的调用。

这是一个非常容易理解的问题,许多聪明人都为此编写了高度调整的库。使用其中之一。

If this really matters:

  1. Link against a tuned CBLAS library. There are lots to choose from, some free and some commercial. Some platforms already have one on the system.
  2. Replace your code with a call to cblas_dgemv.

This is an extraordinarily well understood problem, and many smart people have written highly-tuned libraries for it. Use one of them.

萌化 2024-09-20 12:42:40

现在,每两个连续的内部操作(即total[j]+= w[j][i] * x[i])写入不同的位置并从远处的位置读取。您可以通过本地化读取和写入(从而更多地访问内部缓存)来获得一些性能 - 例如,通过切换 j 循环和 i 循环,以便j 循环是外部循环,i 循环是内部循环。

这样您就可以本地化读取和写入:

  • 所有i的内存写入将位于同一位置。
  • w[j][i]x[i] 的内存读取将按顺序进行。

总结:

for(int j=0;j<m;j++) {
    for(int i=0;i<n;i++) {
        total[j]+= w[j][i] * x[i];
    }
}

Right now, each two consecutive internal operations (i.e. total[j]+= w[j][i] * x[i]) write to different locations and read from distant locations. You can possibly gain some performance by localizing reads and writes (thus, hitting more the internal cache) - for example, by switching the j loop and the i loop, so that the j loop is the external and the i loop is the internal.

This way you'll be localizing both the reads and the writes:

  • Memory writes will be to the same place for all is.
  • Memory reads will be sequential for w[j][i] and x[i].

To sum up:

for(int j=0;j<m;j++) {
    for(int i=0;i<n;i++) {
        total[j]+= w[j][i] * x[i];
    }
}
独孤求败 2024-09-20 12:42:39

其中一种理论是,测试 0 比测试 j 更快。因此,通过从 j=m while j>0 循环,理论上每个循环可以节省一些纳秒。然而,根据最近的经验,这对我来说没有任何区别,所以我认为这不适用于当前的CPU。

另一个问题是内存布局:如果您的内部循环访问的内存块不是分散的,而是连续的,那么您很可能会从 CPU 中可用的最低缓存中获得更多好处。

在当前示例中,将 w 的布局从 w[j][i] 切换为 w[i][j] 可能会有所帮助。在 4 或 8 字节边界上对齐您的值也会有所帮助(但您会发现您的数组已经是这种情况)

另一个是循环展开,这意味着您以 4 字节为单位进行内部循环。因此,如果循环完成,则评估次数必须减少 4 次。最佳值必须根据经验确定,并且还可能取决于当前的问题(例如,如果您知道循环次数是 5 次的倍数,则使用 5)

One of the theories is that testing for zero is faster than testing for j<m. So by looping from j=m while j>0, in theory you could save some nanoseconds per loop. However in recent experience this has made not a single difference to me, so I think this doesn't hold for current cpu's.

Another issue is memory layout: if your inner loop accesses a chunk of memory that isn't spread out, but continuous, chances are you have more benefit of the lowest cache available in your CPU.

In your current example, switching the layout of w from w[j][i] to w[i][j] may therefore help. Aligning your values on 4 or 8 bytes boundaries will help as well (but you will find that this is already the case for your arrays)

Another one is loop-unrolling, meaning that you do your inner loop in chunks of, say, 4. So the evaluation if the loop is done, has to be done 4 times less. The optimum value must be determined emperically, and may also depend on the problem at hand (e.g. if you know you're looping a multiple of 5 times, use 5)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文