优化“for 循环”在 C99 中具有不同索引的数组上
我想加速 C99 中的数组乘法。
这是原来的 for 循环:
for(int i=0;i<n;i++) {
for(int j=0;j<m;j++) {
total[j]+= w[j][i] * x[i];
}
}
我的老板让我尝试这个,但它并没有提高速度:
for(int i=0;i<n;i++) {
float value = x[i];
for(int j=0;j<m;j++) {
total[j]+= w[j][i] * value;
}
}
你对如何加速这些 for 循环有其他想法(除了我已经使用的 openmp )吗? 我正在使用:
gcc -DMNIST=1 -O3 -fno-strict-aliasing -std=c99 -lm -D_GNU_SOURCE -Wall -pedantic -fopenmp
谢谢!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您知道
x
、total
和w
不会互相别名,那么通过重新排列循环索引并避免每次通过循环写入total[j]
:但是,BLAS 是正确的答案,大多数情况下对于此类事情。最佳解决方案将取决于
n
、m
、预取时间、管道深度、循环展开、缓存行的大小等。您可能不想这样做其他人在幕后所做的优化水平。If you know that
x
,total
andw
do not alias each other you can get a fairly measurable boost by rearranging the loop indices and avoiding the write tototal[j]
each time through the loop:However, BLAS is the right answer, most of the time for this sort of thing. The best solution will depend on
n
,m
, prefetch times, pipeline depths, loop unrolling, the size of your cache lines, etc. You probably don't want to do the level of optimization that it other people have done under the covers.如果这真的很重要:
cblas_dgemv
的调用。这是一个非常容易理解的问题,许多聪明人都为此编写了高度调整的库。使用其中之一。
If this really matters:
cblas_dgemv
.This is an extraordinarily well understood problem, and many smart people have written highly-tuned libraries for it. Use one of them.
现在,每两个连续的内部操作(即total[j]+= w[j][i] * x[i])写入不同的位置并从远处的位置读取。您可以通过本地化读取和写入(从而更多地访问内部缓存)来获得一些性能 - 例如,通过切换
j
循环和i
循环,以便j
循环是外部循环,i
循环是内部循环。这样您就可以本地化读取和写入:
i
的内存写入将位于同一位置。w[j][i]
和x[i]
的内存读取将按顺序进行。总结:
Right now, each two consecutive internal operations (i.e.
total[j]+= w[j][i] * x[i]
) write to different locations and read from distant locations. You can possibly gain some performance by localizing reads and writes (thus, hitting more the internal cache) - for example, by switching thej
loop and thei
loop, so that thej
loop is the external and thei
loop is the internal.This way you'll be localizing both the reads and the writes:
i
s.w[j][i]
andx[i]
.To sum up:
其中一种理论是,测试 0 比测试
j 更快。因此,通过从
j=m
whilej>0
循环,理论上每个循环可以节省一些纳秒。然而,根据最近的经验,这对我来说没有任何区别,所以我认为这不适用于当前的CPU。另一个问题是内存布局:如果您的内部循环访问的内存块不是分散的,而是连续的,那么您很可能会从 CPU 中可用的最低缓存中获得更多好处。
在当前示例中,将
w
的布局从w[j][i]
切换为w[i][j]
可能会有所帮助。在 4 或 8 字节边界上对齐您的值也会有所帮助(但您会发现您的数组已经是这种情况)另一个是循环展开,这意味着您以 4 字节为单位进行内部循环。因此,如果循环完成,则评估次数必须减少 4 次。最佳值必须根据经验确定,并且还可能取决于当前的问题(例如,如果您知道循环次数是 5 次的倍数,则使用 5)
One of the theories is that testing for zero is faster than testing for
j<m
. So by looping fromj=m
whilej>0
, in theory you could save some nanoseconds per loop. However in recent experience this has made not a single difference to me, so I think this doesn't hold for current cpu's.Another issue is memory layout: if your inner loop accesses a chunk of memory that isn't spread out, but continuous, chances are you have more benefit of the lowest cache available in your CPU.
In your current example, switching the layout of
w
fromw[j][i]
tow[i][j]
may therefore help. Aligning your values on 4 or 8 bytes boundaries will help as well (but you will find that this is already the case for your arrays)Another one is loop-unrolling, meaning that you do your inner loop in chunks of, say, 4. So the evaluation if the loop is done, has to be done 4 times less. The optimum value must be determined emperically, and may also depend on the problem at hand (e.g. if you know you're looping a multiple of 5 times, use 5)