改善医学图像重建实施中的局部性并减少缓存污染
我正在为我的大学进行一项与医疗用途图像重建算法相关的研究。
我陷入了长达 3 周的困境,我需要提高以下代码的性能:
for (lor=lor0[mypid]; lor <= lor1[mypid]; lor++)
{
LOR_X = P.symmLOR[lor].x;
LOR_Y = P.symmLOR[lor].y;
LOR_XY = P.symmLOR[lor].xy;
lor_z = P.symmLOR[lor].z;
LOR_Z_X = P.symmLOR[lor_z].x;
LOR_Z_Y = P.symmLOR[lor_z].y;
LOR_Z_XY = P.symmLOR[lor_z].xy;
s0 = P.a2r[lor];
s1 = P.a2r[lor+1];
for (s=s0; s < s1; s++)
{
pixel = P.a2b[s];
v = P.a2p[s];
b[lor] += v * x[pixel];
p = P.symm_Xpixel[pixel];
b[LOR_X] += v * x[p];
p = P.symm_Ypixel[pixel];
b[LOR_Y] += v * x[p];
p = P.symm_XYpixel[pixel];
b[LOR_XY] += v * x[p];
// do Z symmetry.
pixel_z = P.symm_Zpixel[pixel];
b[lor_z] += v * x[pixel_z];
p = P.symm_Xpixel[pixel_z];
b[LOR_Z_X] += v * x[p];
p = P.symm_Ypixel[pixel_z];
b[LOR_Z_Y] += v * x[p];
p = P.symm_XYpixel[pixel_z];
b[LOR_Z_XY] += v * x[p];
}
}
对于任何想知道的人,代码实现了 MLEM 前向函数并且所有变量都是 FLOAT。
经过几次测试,我注意到这部分代码有很大的延迟。 (你知道,90 - 10 规则)。
后来,我使用Papi(http://cl.cs.utk.edu/papi/)来测量L1D缓存未命中。正如我所想,Papi 确认由于未命中次数较多,性能会下降,特别是对于 b 向量(大小很大)的随机访问。
在互联网上阅读信息到目前为止我只知道两个提高性能的选择:提高数据局部性或减少数据污染。
为了进行第一个改进,我将尝试将代码更改为缓存感知,就像 Ulrich Drepper 在每个程序员应该了解内存 (www.akkadia.org/drepper /cpumemory.pdf) A.1 矩阵乘法。
我相信阻止 SpMV(稀疏矩阵向量乘法)会提高性能。
另一方面,每次程序尝试访问 b 向量时,我们都会遇到所谓的缓存污染。
有没有办法在不使用缓存的情况下使用 SIMD 指令从 b 向量加载值?
另外,是否可以使用像 void _mm_stream_ps(float * p , __m128 a ) 这样的函数在向量 b 上存储一个浮点值而不污染缓存?
我不能使用 _mm_stream_ps 因为总是存储 4 个浮点数,但对 b 向量的访问显然是随机的。
我希望能清楚地了解我的困境。
更多信息:v 是具有 CRS 格式的稀疏矩阵存储的列值。我意识到,如果我尝试将 CRS 格式更改为其他格式,则可以进行其他优化,但是,正如我之前所说,我已经进行了数月的多次测试,并且我知道性能下降与向量 b 上的随机访问有关。当我不存储在向量 b 中时,从 400.000.000 L1D 未命中我可以转到 100~ 未命中。
谢谢。
I'm doing a research for my University related to an Image reconstruction algorithm for medical usage.
I'm stuck in something up to 3 weeks, I need to improve the performance of the following code:
for (lor=lor0[mypid]; lor <= lor1[mypid]; lor++)
{
LOR_X = P.symmLOR[lor].x;
LOR_Y = P.symmLOR[lor].y;
LOR_XY = P.symmLOR[lor].xy;
lor_z = P.symmLOR[lor].z;
LOR_Z_X = P.symmLOR[lor_z].x;
LOR_Z_Y = P.symmLOR[lor_z].y;
LOR_Z_XY = P.symmLOR[lor_z].xy;
s0 = P.a2r[lor];
s1 = P.a2r[lor+1];
for (s=s0; s < s1; s++)
{
pixel = P.a2b[s];
v = P.a2p[s];
b[lor] += v * x[pixel];
p = P.symm_Xpixel[pixel];
b[LOR_X] += v * x[p];
p = P.symm_Ypixel[pixel];
b[LOR_Y] += v * x[p];
p = P.symm_XYpixel[pixel];
b[LOR_XY] += v * x[p];
// do Z symmetry.
pixel_z = P.symm_Zpixel[pixel];
b[lor_z] += v * x[pixel_z];
p = P.symm_Xpixel[pixel_z];
b[LOR_Z_X] += v * x[p];
p = P.symm_Ypixel[pixel_z];
b[LOR_Z_Y] += v * x[p];
p = P.symm_XYpixel[pixel_z];
b[LOR_Z_XY] += v * x[p];
}
}
for anyone who wants to know, The code implements the MLEM forward function and all the variables are FLOAT.
After several tests, I had notice that the big delay was on this part of the code. (you know, the 90 - 10 rule).
Later, I used Papi (http://cl.cs.utk.edu/papi/) to measure L1D cache misses. As I thought, Papi confirm that the performance decreases due to higher amount of misses, particularly for the random access to b vector (huge in size).
Reading information on the Internet I just know two options to improve performance so far: improve data locality or decrease data pollution.
To do the first improvement, I'll try to change the code to be cache aware, just like was propossed by Ulrich Drepper on What every programmer should know about the memory (www.akkadia.org/drepper/cpumemory.pdf) A.1 Matrix multiplication.
I believe that blocking the SpMV (sparse matrix-vector Multiplication) will improve the performance.
On the other hand, every time the program tried to access to b vector, we had what is known as cache pollution.
Is there's a way to do a load a value from the b vector with SIMD instruction without using the Cache?
Also, it is possible to use a function like void _mm_stream_ps(float * p , __m128 a ) to store ONE float value on the vector b without polluting the Cache?
I can't use _mm_stream_ps because always store 4 floats but the access to the b vector is clearly random.
I'd hope to be clear in my dilemma.
More info: v is the column value of an Sparse Matrix store with CRS format. I realize that other optimization could be done if I tried to change CRS format to other, however, like I said before, I'd made several test for months and I know that the performance decrease is related to random access on vector b. from 400.000.000 L1D Misses I can go to 100~ Misses when I don't store in vector b.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我想说,首先尝试为你的编译器提供一些帮助。
as
const
在循环之前。(所有
LOR_..
)作为局部变量,像这样的东西:
float LOR_X = P.symmLOR[lor].x;
或size_t s0 = P.a2r[lor];
如果你碰巧有变量
现代,C99 兼容,编译器:
for
(size_t s=s0; s < s1; s++)
b
单独加载和存储向量。物品的位置
你访问的,在那里,不依赖
在
s
上。所以创建一个局部变量保留所有的实际值
您处理的不同案件
在内层循环之前,更新内部循环中的这些局部变量
循环,并将结果存储在
内循环。
一些。指数计算
相对便宜,然后你的
系统可能会更好地识别
流式访问您的一些
向量。
编译器产生并识别
内循环的代码。实验一
位移动尽可能多的“远”负载
并将其存储在该循环之外。
编辑:在重新阅读 gravitron 的答案和您的评论之后,这里重要的事情实际上是将变量声明为尽可能本地和,以检查编译器是否成功加载了缓存缺失的汇编器以及存储在内循环之外。
I'd say, at first try to help your compiler a bit.
as
const
before the loop.(all the
LOR_..
) as local variables,something like:
float LOR_X = P.symmLOR[lor].x;
orsize_t s0 = P.a2r[lor];
variables if you happen to have
modern, C99 compliant, compiler:
for
(size_t s=s0; s < s1; s++)
b
vector. The locations of the items
that you access, there, do not depend
on
s
. So create a local variable tohold the actual value for all the
distinct cases that you handle
before the inner loop, update these local variables inside the inner
loop, and store the results after the
inner loop.
several. The index computations
are relatively cheap, and then your
system might better recognize the
streaming access to some of your
vectors.
compiler produces and identify the
code for the inner loop. Experiment a
bit to move as many of the "far" load
and stores out of that loop.
Edit: after re-reading gravitron's answer and your comment, the important thing here is really to declare the variables as local as possible and to check the assembler that the compiler succeeds in having the cache-missing loads and stores outside the inner loop.
减少向量 b 上的随机访问的一个简单优化是永远不要在内部 for 循环中写入向量 b。
相反,将向量 B 所需的所有值加载到临时变量中,在更新这些临时变量时执行整个内部 for 循环,然后将临时变量写回到向量 B 中。
临时变量最坏的情况是位于相同的缓存行上,具体取决于您的编译器和环境您还可以提示编译器对这些变量使用寄存器。
A simple optimization to reduce the random access on vector b would be to never write into vector b within the inner for loop.
Instead load all values needed from vector B into temporary variables, do the entire inner for loop while updating these temporary variables, then write the temporary variables back into vector B.
The temporary variables will at worst be located on the same cache lines, depending on your compiler and environment you may also hint for the compiler to use registers for these variables.
我什至不会假装我知道代码在做什么:) 但是造成额外内存访问的一个可能原因是别名:如果编译器无法确定
b
,x
,并且各个P.symm
数组不重叠,那么写入b
将影响从x
和可以安排P.symm
。如果编译器特别悲观,它甚至可能强制从内存中重新获取 P 的字段。所有这些都会导致您看到的缓存未命中。改进此问题的两个简单方法是:在
b
上使用 __restrict。这保证了b
不会与其他数组重叠,因此对其进行写入不会影响其他数组的读取(或写入)。重新排序,以便首先读取
P.symm
的所有内容,然后读取x
的内容,最后写入b< /代码>。这应该会打破读取中的一些依赖关系,编译器会并行安排从
P.symm
读取,然后并行从x
读取,希望如此明智地写入b
。另一件风格上的事情(这将有助于第 2 点)是不要过多地重用变量。您没有理由不能使用例如
p_x
、p_y
、p_xy
等,这将使代码重新排序变得更加容易。一旦一切就绪,您就可以在已知的缓存未命中之前开始撒入预取指令(即 gcc 上的
__builtin_prefetch
)。希望有帮助。
I won't even pretend that I know what the code is doing :) But one possible cause for some extra memory access is aliasing: if the compiler can't be sure that
b
,x
, and the variousP.symm
arrays don't overlap, then writes tob
will affect how reads fromx
and theP.symm
's can be scheduled. If the compiler is particularly pessimistic, it might even force the fields ofP
to be re-fetched from memory. All of this will contribute to the cache misses you're seeing. Two simple ways to improve this are:Use __restrict on
b
. This guarantees thatb
doesn't overlap the other arrays, so writes to it won't affect reads (or writes) from other arrays.Reorder things so that all the reads from
P.symm
's are first, then the reads fromx
, then finally all the writes tob
. This should break up some of the dependencies in the reads and the compiler schedule the reads fromP.symm
's in parallel, then the reads fromx
in parallel, and hopefully do the writes tob
sensibly.One other stylistic thing (which will help with point #2) is to not reuse variables so
p
so much. There's no reason you can't have e.g.p_x
,p_y
,p_xy
, etc. and it will make reordering the code easier.Once that's all in place, you can start sprinkling prefetch instructions (i.e.
__builtin_prefetch
on gcc) ahead of known cache misses.Hope that helps.
这些都是很好的答案,我想问为什么要建立这么多索引?通过本地变化不大的索引值?
而且,随机停顿几次看看它通常在哪里也不会害死你。
These are good answers, and I would ask why so much indexing? by index values that don't change much locally?
Also, it wouldn't kill you to do a few random pauses to see where it typically is.