当前位置：文江博客话题详情

优化“for 循环”在 C99 中具有不同索引的数组上

发布于 2024-09-13 12:42:33 字数 555 浏览 8 评论 0 原文

我想加速 C99 中的数组乘法。

这是原来的 for 循环：

for(int i=0;i<n;i++) {
        for(int j=0;j<m;j++) {
            total[j]+= w[j][i] * x[i];
        }
    }

我的老板让我尝试这个，但它并没有提高速度：

for(int i=0;i<n;i++) {
        float value = x[i];
        for(int j=0;j<m;j++) {
            total[j]+= w[j][i] * value;
        }
    }

你对如何加速这些 for 循环有其他想法（除了我已经使用的 openmp ）吗？我正在使用：

gcc -DMNIST=1 -O3 -fno-strict-aliasing -std=c99 -lm -D_GNU_SOURCE -Wall -pedantic -fopenmp

谢谢！

原文

I want to speed up an array multiplication in C99.

This is the original for loops:

for(int i=0;i<n;i++) {
        for(int j=0;j<m;j++) {
            total[j]+= w[j][i] * x[i];
        }
    }

My boss asked my to try this, but it did not improve the speed:

for(int i=0;i<n;i++) {
        float value = x[i];
        for(int j=0;j<m;j++) {
            total[j]+= w[j][i] * value;
        }
    }

Have you other ideas (except for openmp, which I already use) on how I could speed up these for-loops?
I am using:

gcc -DMNIST=1 -O3 -fno-strict-aliasing -std=c99 -lm -D_GNU_SOURCE -Wall -pedantic -fopenmp

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜清冷一曲。 2024-09-20 12:42:42

如果您知道 x、total 和 w 不会互相别名，那么通过重新排列循环索引并避免每次通过循环写入 total[j]：

for(int j=0;j<m;j++) {
    const float * const w_j = w[j];      
    float total_j = 0;
    for(int i=0;i<n;i++)
        total_j += w_j[i] * x[i];
    total[j] += total_j;
}

但是，BLAS 是正确的答案，大多数情况下对于此类事情。最佳解决方案将取决于 n、m、预取时间、管道深度、循环展开、缓存行的大小等。您可能不想这样做其他人在幕后所做的优化水平。

If you know that x, total and w do not alias each other you can get a fairly measurable boost by rearranging the loop indices and avoiding the write to total[j] each time through the loop:

for(int j=0;j<m;j++) {
    const float * const w_j = w[j];      
    float total_j = 0;
    for(int i=0;i<n;i++)
        total_j += w_j[i] * x[i];
    total[j] += total_j;
}

However, BLAS is the right answer, most of the time for this sort of thing. The best solution will depend on n, m, prefetch times, pipeline depths, loop unrolling, the size of your cache lines, etc. You probably don't want to do the level of optimization that it other people have done under the covers.

回复收藏 0 原文

本王不退位尔等都是臣 2024-09-20 12:42:41

如果这真的很重要：

链接到已调整的 CBLAS 库。有很多可供选择，有些是免费的，有些是商业的。一些平台的系统上已经有一个。
将您的代码替换为对 cblas_dgemv 的调用。

这是一个非常容易理解的问题，许多聪明人都为此编写了高度调整的库。使用其中之一。

回复收藏 0 原文

萌化 2024-09-20 12:42:40

现在，每两个连续的内部操作（即total[j]+= w[j][i] * x[i]）写入不同的位置并从远处的位置读取。您可以通过本地化读取和写入（从而更多地访问内部缓存）来获得一些性能 - 例如，通过切换 j 循环和 i 循环，以便j 循环是外部循环，i 循环是内部循环。

这样您就可以本地化读取和写入：

所有i的内存写入将位于同一位置。
w[j][i] 和 x[i] 的内存读取将按顺序进行。

总结：

for(int j=0;j<m;j++) {
    for(int i=0;i<n;i++) {
        total[j]+= w[j][i] * x[i];
    }
}

Right now, each two consecutive internal operations (i.e. total[j]+= w[j][i] * x[i]) write to different locations and read from distant locations. You can possibly gain some performance by localizing reads and writes (thus, hitting more the internal cache) - for example, by switching the j loop and the i loop, so that the j loop is the external and the i loop is the internal.

This way you'll be localizing both the reads and the writes:

Memory writes will be to the same place for all is.
Memory reads will be sequential for w[j][i] and x[i].

To sum up:

for(int j=0;j<m;j++) {
    for(int i=0;i<n;i++) {
        total[j]+= w[j][i] * x[i];
    }
}

回复收藏 0 原文

独孤求败 2024-09-20 12:42:39

其中一种理论是，测试 0 比测试 j 更快。因此，通过从 j=m while j>0 循环，理论上每个循环可以节省一些纳秒。然而，根据最近的经验，这对我来说没有任何区别，所以我认为这不适用于当前的CPU。

另一个问题是内存布局：如果您的内部循环访问的内存块不是分散的，而是连续的，那么您很可能会从 CPU 中可用的最低缓存中获得更多好处。

在当前示例中，将 w 的布局从 w[j][i] 切换为 w[i][j] 可能会有所帮助。在 4 或 8 字节边界上对齐您的值也会有所帮助（但您会发现您的数组已经是这种情况）

另一个是循环展开，这意味着您以 4 字节为单位进行内部循环。因此，如果循环完成，则评估次数必须减少 4 次。最佳值必须根据经验确定，并且还可能取决于当前的问题（例如，如果您知道循环次数是 5 次的倍数，则使用 5）

回复收藏 0 原文

~没有更多了~

关于作者

可爱暴击

暂无简介

0 文章

0 评论

25 人气

关注发私信

友情链接

文江博客

优化“for 循环”在 C99 中具有不同索引的数组上

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

优化“for 循环”在 C99 中具有不同索引的数组上

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。