改善医学图像重建实施中的局部性并减少缓存污染

发布于 2024-10-13 03:37:42 字数 1937 浏览 9 评论 0原文

我正在为我的大学进行一项与医疗用途图像重建算法相关的研究。

我陷入了长达 3 周的困境,我需要提高以下代码的性能:

for (lor=lor0[mypid]; lor <= lor1[mypid]; lor++)
{
  LOR_X = P.symmLOR[lor].x;
  LOR_Y = P.symmLOR[lor].y;
  LOR_XY = P.symmLOR[lor].xy;
  lor_z = P.symmLOR[lor].z;
  LOR_Z_X = P.symmLOR[lor_z].x;
  LOR_Z_Y = P.symmLOR[lor_z].y;
  LOR_Z_XY = P.symmLOR[lor_z].xy;  

  s0 = P.a2r[lor];
  s1 = P.a2r[lor+1];

  for (s=s0; s < s1; s++)
  {
    pixel     = P.a2b[s];
    v         = P.a2p[s]; 

    b[lor]    += v * x[pixel];

    p          = P.symm_Xpixel[pixel];
    b[LOR_X]  += v * x[p];

    p          = P.symm_Ypixel[pixel];
    b[LOR_Y]  += v * x[p];

    p          = P.symm_XYpixel[pixel];
    b[LOR_XY] += v * x[p];


    // do Z symmetry.
    pixel_z    = P.symm_Zpixel[pixel];
    b[lor_z]  += v * x[pixel_z];


    p          = P.symm_Xpixel[pixel_z];
    b[LOR_Z_X]  += v * x[p];


    p          = P.symm_Ypixel[pixel_z];
    b[LOR_Z_Y]  += v * x[p];

    p          = P.symm_XYpixel[pixel_z];
    b[LOR_Z_XY] += v * x[p];

   }

}

对于任何想知道的人,代码实现了 MLEM 前向函数并且所有变量都是 FLOAT

经过几次测试,我注意到这部分代码有很大的延迟。 (你知道,90 - 10 规则)。

后来,我使用Papi(http://cl.cs.utk.edu/papi/)来测量L1D缓存未命中。正如我所想,Papi 确认由于未命中次数较多,性能会下降,特别是对于 b 向量(大小很大)的随机访问。

在互联网上阅读信息到目前为止我只知道两个提高性能的选择:提高数据局部性或减少数据污染。

为了进行第一个改进,我将尝试将代码更改为缓存感知,就像 Ulrich Drepper 在每个程序员应该了解内存 (www.akkadia.org/drepper /cpumemory.pdf) A.1 矩阵乘法。

我相信阻止 SpMV(稀疏矩阵向量乘法)会提高性能。

另一方面,每次程序尝试访问 b 向量时,我们都会遇到所谓的缓存污染

有没有办法在不使用缓存的情况下使用 SIMD 指令从 b 向量加载值?

另外,是否可以使用像 void _mm_stream_ps(float * p , __m128 a ) 这样的函数在向量 b 上存储一个浮点值而不污染缓存?

我不能使用 _mm_stream_ps 因为总是存储 4 个浮点数,但对 b 向量的访问显然是随机的。

我希望能清楚地了解我的困境。

更多信息:v 是具有 CRS 格式的稀疏矩阵存储的列值。我意识到,如果我尝试将 CRS 格式更改为其他格式,则可以进行其他优化,但是,正如我之前所说,我已经进行了数月的多次测试,并且我知道性能下降与向量 b 上的随机访问有关。当我不存储在向量 b 中时,从 400.000.000 L1D 未命中我可以转到 100~ 未命中。

谢谢。

I'm doing a research for my University related to an Image reconstruction algorithm for medical usage.

I'm stuck in something up to 3 weeks, I need to improve the performance of the following code:

for (lor=lor0[mypid]; lor <= lor1[mypid]; lor++)
{
  LOR_X = P.symmLOR[lor].x;
  LOR_Y = P.symmLOR[lor].y;
  LOR_XY = P.symmLOR[lor].xy;
  lor_z = P.symmLOR[lor].z;
  LOR_Z_X = P.symmLOR[lor_z].x;
  LOR_Z_Y = P.symmLOR[lor_z].y;
  LOR_Z_XY = P.symmLOR[lor_z].xy;  

  s0 = P.a2r[lor];
  s1 = P.a2r[lor+1];

  for (s=s0; s < s1; s++)
  {
    pixel     = P.a2b[s];
    v         = P.a2p[s]; 

    b[lor]    += v * x[pixel];

    p          = P.symm_Xpixel[pixel];
    b[LOR_X]  += v * x[p];

    p          = P.symm_Ypixel[pixel];
    b[LOR_Y]  += v * x[p];

    p          = P.symm_XYpixel[pixel];
    b[LOR_XY] += v * x[p];


    // do Z symmetry.
    pixel_z    = P.symm_Zpixel[pixel];
    b[lor_z]  += v * x[pixel_z];


    p          = P.symm_Xpixel[pixel_z];
    b[LOR_Z_X]  += v * x[p];


    p          = P.symm_Ypixel[pixel_z];
    b[LOR_Z_Y]  += v * x[p];

    p          = P.symm_XYpixel[pixel_z];
    b[LOR_Z_XY] += v * x[p];

   }

}

for anyone who wants to know, The code implements the MLEM forward function and all the variables are FLOAT.

After several tests, I had notice that the big delay was on this part of the code. (you know, the 90 - 10 rule).

Later, I used Papi (http://cl.cs.utk.edu/papi/) to measure L1D cache misses. As I thought, Papi confirm that the performance decreases due to higher amount of misses, particularly for the random access to b vector (huge in size).

Reading information on the Internet I just know two options to improve performance so far: improve data locality or decrease data pollution.

To do the first improvement, I'll try to change the code to be cache aware, just like was propossed by Ulrich Drepper on What every programmer should know about the memory (www.akkadia.org/drepper/cpumemory.pdf) A.1 Matrix multiplication.

I believe that blocking the SpMV (sparse matrix-vector Multiplication) will improve the performance.

On the other hand, every time the program tried to access to b vector, we had what is known as cache pollution.

Is there's a way to do a load a value from the b vector with SIMD instruction without using the Cache?

Also, it is possible to use a function like void _mm_stream_ps(float * p , __m128 a ) to store ONE float value on the vector b without polluting the Cache?

I can't use _mm_stream_ps because always store 4 floats but the access to the b vector is clearly random.

I'd hope to be clear in my dilemma.

More info: v is the column value of an Sparse Matrix store with CRS format. I realize that other optimization could be done if I tried to change CRS format to other, however, like I said before, I'd made several test for months and I know that the performance decrease is related to random access on vector b. from 400.000.000 L1D Misses I can go to 100~ Misses when I don't store in vector b.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

帅气尐潴 2024-10-20 03:37:42

我想说,首先尝试为你的编译器提供一些帮助。

  • 声明外循环的边界
    as const 在循环之前。
  • 声明所有可能的变量
    (所有LOR_..)作为局部变量,
    像这样的东西:
    float LOR_X = P.symmLOR[lor].x;
    size_t s0 = P.a2r[lor];
  • 这也特别适用于 for 循环
    如果你碰巧有变量
    现代,C99 兼容,编译器:for
    (size_t s=s0; s < s1; s++)
  • b 单独加载和存储
    向量。物品的位置
    你访问的,在那里,不依赖
    s 上。所以创建一个局部变量
    保留所有的实际值
    您处理的不同案件
    在内层循环之前,更新内部循环中的这些局部变量
    循环,并将结果存储在
    内循环。
  • 也许将你的内循环分开
    一些。指数计算
    相对便宜,然后你的
    系统可能会更好地识别
    流式访问您的一些
    向量。
  • 看看你的汇编器
    编译器产生并识别
    内循环的代码。实验一
    位移动尽可能多的“远”负载
    并将其存储在该循环之外。

编辑:在重新阅读 gravitron 的答案和您的评论之后,这里重要的事情实际上是将变量声明为尽可能本地,以检查编译器是否成功加载了缓存缺失的汇编器以及存储在内循环之外。

I'd say, at first try to help your compiler a bit.

  • Declare the bounds for the outer loop
    as const before the loop.
  • Declare all variables that you may
    (all the LOR_..) as local variables,
    something like:
    float LOR_X = P.symmLOR[lor].x; or
    size_t s0 = P.a2r[lor];
  • This also in particular for loop
    variables if you happen to have
    modern, C99 compliant, compiler: for
    (size_t s=s0; s < s1; s++)
  • Separate load and store for your b
    vector. The locations of the items
    that you access, there, do not depend
    on s. So create a local variable to
    hold the actual value for all the
    distinct cases that you handle
    before the inner loop, update these local variables inside the inner
    loop, and store the results after the
    inner loop.
  • Perhaps separate your inner loop in
    several. The index computations
    are relatively cheap, and then your
    system might better recognize the
    streaming access to some of your
    vectors.
  • Look at the assembler that your
    compiler produces and identify the
    code for the inner loop. Experiment a
    bit to move as many of the "far" load
    and stores out of that loop.

Edit: after re-reading gravitron's answer and your comment, the important thing here is really to declare the variables as local as possible and to check the assembler that the compiler succeeds in having the cache-missing loads and stores outside the inner loop.

叫思念不要吵 2024-10-20 03:37:42

减少向量 b 上的随机访问的一个简单优化是永远不要在内部 for 循环中写入向量 b。

相反,将向量 B 所需的所有值加载到临时变量中,在更新这些临时变量时执行整个内部 for 循环,然后将临时变量写回到向量 B 中。

临时变量最坏的情况是位于相同的缓存行上,具体取决于您的编译器和环境您还可以提示编译器对这些变量使用寄存器。

A simple optimization to reduce the random access on vector b would be to never write into vector b within the inner for loop.

Instead load all values needed from vector B into temporary variables, do the entire inner for loop while updating these temporary variables, then write the temporary variables back into vector B.

The temporary variables will at worst be located on the same cache lines, depending on your compiler and environment you may also hint for the compiler to use registers for these variables.

棒棒糖 2024-10-20 03:37:42

我什至不会假装我知道代码在做什么:) 但是造成额外内存访问的一个可能原因是别名:如果编译器无法确定 b, x,并且各个 P.symm 数组不重叠,那么写入 b 将影响从 x 和可以安排P.symm。如果编译器特别悲观,它甚至可能强制从内存中重新获取 P 的字段。所有这些都会导致您看到的缓存未命中。改进此问题的两个简单方法是:

  1. b 上使用 __restrict。这保证了 b 不会与其他数组重叠,因此对其进行写入不会影响其他数组的读取(或写入)。

  2. 重新排序,以便首先读取 P.symm 的所有内容,然后读取 x 的内容,最后写入 b< /代码>。这应该会打破读取中的一些依赖关系,编译器会并行安排从 P.symm 读取,然后并行从 x 读取,希望如此明智地写入 b

另一件风格上的事情(这将有助于第 2 点)是不要过多地重用变量。您没有理由不能使用例如 p_xp_yp_xy 等,这将使代码重新排序变得更加容易。

一旦一切就绪,您就可以在已知的缓存未命中之前开始撒入预取指令(即 gcc 上的 __builtin_prefetch)。

希望有帮助。

I won't even pretend that I know what the code is doing :) But one possible cause for some extra memory access is aliasing: if the compiler can't be sure that b, x, and the various P.symm arrays don't overlap, then writes to b will affect how reads from x and the P.symm's can be scheduled. If the compiler is particularly pessimistic, it might even force the fields of P to be re-fetched from memory. All of this will contribute to the cache misses you're seeing. Two simple ways to improve this are:

  1. Use __restrict on b. This guarantees that b doesn't overlap the other arrays, so writes to it won't affect reads (or writes) from other arrays.

  2. Reorder things so that all the reads from P.symm's are first, then the reads from x, then finally all the writes to b. This should break up some of the dependencies in the reads and the compiler schedule the reads from P.symm's in parallel, then the reads from x in parallel, and hopefully do the writes to b sensibly.

One other stylistic thing (which will help with point #2) is to not reuse variables so p so much. There's no reason you can't have e.g. p_x, p_y, p_xy, etc. and it will make reordering the code easier.

Once that's all in place, you can start sprinkling prefetch instructions (i.e. __builtin_prefetch on gcc) ahead of known cache misses.

Hope that helps.

雨后咖啡店 2024-10-20 03:37:42

这些都是很好的答案,我想问为什么要建立这么多索引?通过本地变化不大的索引值?

而且,随机停顿几次看看它通常在哪里也不会害死你。

These are good answers, and I would ask why so much indexing? by index values that don't change much locally?

Also, it wouldn't kill you to do a few random pauses to see where it typically is.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文