SSE SIMD 优化 For 循环

发布于 2024-09-02 18:27:19 字数 876 浏览 10 评论 0 原文

我在循环中有一些代码

for(int i = 0; i < n; i++)
{
  u[i] = c * u[i] + s * b[i];
}

所以，u 和 b 是相同长度的向量，c 和 s 是标量。该代码是否适合与 SSE 一起使用矢量化以获得加速？

更新

我学习了矢量化（事实证明，如果您使用内在函数，这并不难）并在 SSE 中实现了我的循环。然而，当在 VC++ 编译器中设置 SSE2 标志时，我获得的性能与我自己的 SSE 代码大致相同。另一方面，Intel 编译器比我的 SSE 代码或 VC++ 编译器快得多。

这是我写的代码供参考

double *u = (double*) _aligned_malloc(n * sizeof(double), 16);
for(int i = 0; i < n; i++)
{
   u[i] = 0;
}

int j = 0;
__m128d *uSSE = (__m128d*) u;
__m128d cStore = _mm_set1_pd(c);
__m128d sStore = _mm_set1_pd(s);
for (j = 0; j <= i - 2; j+=2)
{
  __m128d uStore = _mm_set_pd(u[j+1], u[j]);

  __m128d cu = _mm_mul_pd(cStore, uStore);
  __m128d so = _mm_mul_pd(sStore, omegaStore);

  uSSE[j/2] = _mm_add_pd(cu, so);
}
for(; j <= i; ++j)
{
  u[j] = c * u[j] + s * omegaCache[j];
}

原文

I have some code in a loop

for(int i = 0; i < n; i++)
{
  u[i] = c * u[i] + s * b[i];
}

So, u and b are vectors of the same length, and c and s are scalars. Is this code a good candidate for vectorization for use with SSE in order to get a speedup?

UPDATE

I learnt vectorization (turns out it's not so hard if you use intrinsics) and implemented my loop in SSE. However, when setting the SSE2 flag in the VC++ compiler, I get about the same performance as with my own SSE code. The Intel compiler on the other hand was much faster than my SSE code or the VC++ compiler.

Here is the code I wrote for reference

double *u = (double*) _aligned_malloc(n * sizeof(double), 16);
for(int i = 0; i < n; i++)
{
   u[i] = 0;
}

int j = 0;
__m128d *uSSE = (__m128d*) u;
__m128d cStore = _mm_set1_pd(c);
__m128d sStore = _mm_set1_pd(s);
for (j = 0; j <= i - 2; j+=2)
{
  __m128d uStore = _mm_set_pd(u[j+1], u[j]);

  __m128d cu = _mm_mul_pd(cStore, uStore);
  __m128d so = _mm_mul_pd(sStore, omegaStore);

  uSSE[j/2] = _mm_add_pd(cu, so);
}
for(; j <= i; ++j)
{
  u[j] = c * u[j] + s * omegaCache[j];
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

世界如花海般美丽 2024-09-09 18:27:19

是的，这是矢量化的绝佳候选者。但是，在执行此操作之前，请确保您已对代码进行了分析，以确保这实际上值得优化。也就是说，矢量化会像这样：

int i;
for(i = 0; i < n - 3; i += 4)
{
  load elements u[i,i+1,i+2,i+3]
  load elements b[i,i+1,i+2,i+3]
  vector multiply u * c
  vector multiply s * b
  add partial results
  store back to u[i,i+1,i+2,i+3]
}

// Finish up the uneven edge cases (or skip if you know n is a multiple of 4)
for( ; i < n; i++)
  u[i] = c * u[i] + s * b[i];

为了获得更高的性能，您可以考虑预取更多数组元素，和/或展开循环并使用软件流水线将一个循环中的计算与来自不同迭代的内存访问交错。

Yes, this is an excellent candidate for vectorization. But, before you do so, make sure you've profiled your code to be sure that this is actually worth optimizing. That said, the vectorization would go something like this:

int i;
for(i = 0; i < n - 3; i += 4)
{
  load elements u[i,i+1,i+2,i+3]
  load elements b[i,i+1,i+2,i+3]
  vector multiply u * c
  vector multiply s * b
  add partial results
  store back to u[i,i+1,i+2,i+3]
}

// Finish up the uneven edge cases (or skip if you know n is a multiple of 4)
for( ; i < n; i++)
  u[i] = c * u[i] + s * b[i];

For even more performance, you can consider prefetching further array elements, and/or unrolling the loop and using software pipelining to interleave the computation in one loop with the memory accesses from a different iteration.

回复收藏 0 原文

高速公鹿 2024-09-09 18:27:19

_mm_set_pd 未矢量化。如果从字面上理解，它使用标量运算读取两个双精度数，然后组合这两个标量双精度数并将它们复制到 SSE 寄存器中。请改用_mm_load_pd。

回复收藏 0 原文

彼岸花似海 2024-09-09 18:27:19

可能是的，但是您必须通过一些提示来帮助编译器。
放置在指针上的 __restrict__ 告诉编译器两个指针之间没有别名。
如果您知道向量的对齐方式，请将其传达给编译器（Visual C++ 可能有一些功能）。

我自己对 Visual C++ 并不熟悉，但我听说它对矢量化没有好处。
考虑改用英特尔编译器。
Intel 允许对生成的程序集进行相当细粒度的控制： http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_vector.htm

回复收藏 0 原文

少年亿悲伤 2024-09-09 18:27:19

是的，假设 U 和 B 数组没有重叠，这是矢量化的一个很好的候选者。但代码受到内存访问（加载/存储）的约束。矢量化有助于减少每个循环的周期，但指令将由于 U 和 B 数组上的缓存未命中而停止。英特尔 C/C++ 编译器使用 Xeon x5500 处理器的默认标志生成以下代码。编译器按 8 展开循环，并使用 xmm[0-16] SIMD 寄存器使用 SIMD ADD (addpd) 和 MULTIPLY (mulpd) 指令。在每个周期中，假设寄存器中已准备好数据，处理器可以发出 2 个 SIMD 指令，从而产生 4 路标量 ILP。

这里 U、B、C 和 S 是双精度（8 字节）。

    ..B1.14:                        # Preds ..B1.12 ..B1.10
    movaps    %xmm1, %xmm3                                  #5.1
    unpcklpd  %xmm3, %xmm3                                  #5.1
    movaps    %xmm0, %xmm2                                  #6.12
    unpcklpd  %xmm2, %xmm2                                  #6.12
      # LOE rax rcx rbx rbp rsi rdi r8 r12 r13 r14 r15 xmm0 xmm1 xmm2 xmm3
    ..B1.15:     # Preds ..B1.15 ..B1.14
    movsd     (%rsi,%rcx,8), %xmm4                          #6.21
    movhpd    8(%rsi,%rcx,8), %xmm4                         #6.21
    mulpd     %xmm2, %xmm4                                  #6.21
    movaps    (%rdi,%rcx,8), %xmm5                          #6.12
    mulpd     %xmm3, %xmm5                                  #6.12
    addpd     %xmm4, %xmm5                                  #6.21
    movaps    16(%rdi,%rcx,8), %xmm7                        #6.12
    movaps    32(%rdi,%rcx,8), %xmm9                        #6.12
    movaps    48(%rdi,%rcx,8), %xmm11                       #6.12
    movaps    %xmm5, (%rdi,%rcx,8)                          #6.3
    mulpd     %xmm3, %xmm7                                  #6.12
    mulpd     %xmm3, %xmm9                                  #6.12
    mulpd     %xmm3, %xmm11                                 #6.12
    movsd     16(%rsi,%rcx,8), %xmm6                        #6.21
    movhpd    24(%rsi,%rcx,8), %xmm6                        #6.21
    mulpd     %xmm2, %xmm6                                  #6.21
    addpd     %xmm6, %xmm7                                  #6.21
    movaps    %xmm7, 16(%rdi,%rcx,8)                        #6.3
    movsd     32(%rsi,%rcx,8), %xmm8                        #6.21
    movhpd    40(%rsi,%rcx,8), %xmm8                        #6.21
    mulpd     %xmm2, %xmm8                                  #6.21
    addpd     %xmm8, %xmm9                                  #6.21
    movaps    %xmm9, 32(%rdi,%rcx,8)                        #6.3
    movsd     48(%rsi,%rcx,8), %xmm10                       #6.21
    movhpd    56(%rsi,%rcx,8), %xmm10                       #6.21
    mulpd     %xmm2, %xmm10                                 #6.21
    addpd     %xmm10, %xmm11                                #6.21
    movaps    %xmm11, 48(%rdi,%rcx,8)                       #6.3
    addq      $8, %rcx                                      #5.1
    cmpq      %r8, %rcx                                     #5.1
    jl        ..B1.15       # Prob 99%                      #5.1

Yes, this is a great candidate for vectorizaton, assuming there is no overlap of U and B array. But the code is bound by memory access(load/store). Vectorization helps reduce cycles per loop, but the instructions will stall due to cache-miss on U and B array . The Intel C/C++ Compiler generates the following code with default flags for Xeon x5500 processor. The compiler unrolls the loop by 8 and employs SIMD ADD (addpd) and MULTIPLY (mulpd) instructions using xmm[0-16] SIMD registers. In each cycle, the processor can issue 2 SIMD instructions yielding 4-way scalar ILP, assuming you have the data ready in the registers.

Here U, B, C and S are Double Precision (8 bytes).

    ..B1.14:                        # Preds ..B1.12 ..B1.10
    movaps    %xmm1, %xmm3                                  #5.1
    unpcklpd  %xmm3, %xmm3                                  #5.1
    movaps    %xmm0, %xmm2                                  #6.12
    unpcklpd  %xmm2, %xmm2                                  #6.12
      # LOE rax rcx rbx rbp rsi rdi r8 r12 r13 r14 r15 xmm0 xmm1 xmm2 xmm3
    ..B1.15:     # Preds ..B1.15 ..B1.14
    movsd     (%rsi,%rcx,8), %xmm4                          #6.21
    movhpd    8(%rsi,%rcx,8), %xmm4                         #6.21
    mulpd     %xmm2, %xmm4                                  #6.21
    movaps    (%rdi,%rcx,8), %xmm5                          #6.12
    mulpd     %xmm3, %xmm5                                  #6.12
    addpd     %xmm4, %xmm5                                  #6.21
    movaps    16(%rdi,%rcx,8), %xmm7                        #6.12
    movaps    32(%rdi,%rcx,8), %xmm9                        #6.12
    movaps    48(%rdi,%rcx,8), %xmm11                       #6.12
    movaps    %xmm5, (%rdi,%rcx,8)                          #6.3
    mulpd     %xmm3, %xmm7                                  #6.12
    mulpd     %xmm3, %xmm9                                  #6.12
    mulpd     %xmm3, %xmm11                                 #6.12
    movsd     16(%rsi,%rcx,8), %xmm6                        #6.21
    movhpd    24(%rsi,%rcx,8), %xmm6                        #6.21
    mulpd     %xmm2, %xmm6                                  #6.21
    addpd     %xmm6, %xmm7                                  #6.21
    movaps    %xmm7, 16(%rdi,%rcx,8)                        #6.3
    movsd     32(%rsi,%rcx,8), %xmm8                        #6.21
    movhpd    40(%rsi,%rcx,8), %xmm8                        #6.21
    mulpd     %xmm2, %xmm8                                  #6.21
    addpd     %xmm8, %xmm9                                  #6.21
    movaps    %xmm9, 32(%rdi,%rcx,8)                        #6.3
    movsd     48(%rsi,%rcx,8), %xmm10                       #6.21
    movhpd    56(%rsi,%rcx,8), %xmm10                       #6.21
    mulpd     %xmm2, %xmm10                                 #6.21
    addpd     %xmm10, %xmm11                                #6.21
    movaps    %xmm11, 48(%rdi,%rcx,8)                       #6.3
    addq      $8, %rcx                                      #5.1
    cmpq      %r8, %rcx                                     #5.1
    jl        ..B1.15       # Prob 99%                      #5.1

回复收藏 0 原文