GCC SSE代码优化

发布于 2024-12-12 03:57:34 字数 1776 浏览 6 评论 0原文

这篇文章与我几天前发布的另一篇文章密切相关。这次，我编写了一个简单的代码，仅添加一对元素数组，将结果乘以另一个数组中的值并将其存储在第四个数组中，所有变量都是浮点双精度类型。

我制作了该代码的两个版本：一个带有 SSE 指令，使用调用，另一个没有它们，然后我使用 gcc 和 -O0 优化级别编译它们。我将它们写在下面：

// SSE VERSION

#define N 10000
#define NTIMES 100000
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>

double a[N] __attribute__((aligned(16)));
double b[N] __attribute__((aligned(16)));
double c[N] __attribute__((aligned(16)));
double r[N] __attribute__((aligned(16)));

int main(void){
  int i, times;
  for( times = 0; times < NTIMES; times++ ){
     for( i = 0; i <N; i+= 2){ 
        __m128d mm_a = _mm_load_pd( &a[i] );  
        _mm_prefetch( &a[i+4], _MM_HINT_T0 );
        __m128d mm_b = _mm_load_pd( &b[i] );  
        _mm_prefetch( &b[i+4] , _MM_HINT_T0 );
        __m128d mm_c = _mm_load_pd( &c[i] );
        _mm_prefetch( &c[i+4] , _MM_HINT_T0 );
        __m128d mm_r;
        mm_r = _mm_add_pd( mm_a, mm_b );
        mm_a = _mm_mul_pd( mm_r , mm_c );
        _mm_store_pd( &r[i], mm_a );
      }   
   }
 }

//NO SSE VERSION
//same definitions as before
int main(void){
  int i, times;
   for( times = 0; times < NTIMES; times++ ){
     for( i = 0; i < N; i++ ){
      r[i] = (a[i]+b[i])*c[i];
    }   
  }
}

当使用 -O0 编译它们时，如果没有特别给出 -mno-sse （和其他）选项，gcc 会使用 XMM/MMX 寄存器和 SSE 指令。我检查了为第二个代码生成的汇编代码，我注意到它使用了 movsd、addsd 和 mulsd 指令。因此，如果我没记错的话，它会使用 SSE 指令，但仅使用那些使用寄存器最低部分的指令。正如预期的那样，为第一个 C 代码生成的汇编代码使用了 addp 和 mulpd 指令，尽管生成了相当大的汇编代码。

无论如何，据我所知，SIMD 范例的第一个代码应该会获得更好的收益，因为每次迭代都会计算两个结果值。尽管如此，第二个代码的执行速度比第一个代码快 25%。我还用单精度值进行了测试并得到了类似的结果。这是什么原因呢？

原文

This post is closely related to another one I posted some days ago. This time, I wrote a simple code that just adds a pair of arrays of elements, multiplies the result by the values in another array and stores it in a forth array, all variables floating point double precision typed.

I made two versions of that code: one with SSE instructions, using calls to and another one without them I then compiled them with gcc and -O0 optimization level. I write them below:

// SSE VERSION

#define N 10000
#define NTIMES 100000
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>

double a[N] __attribute__((aligned(16)));
double b[N] __attribute__((aligned(16)));
double c[N] __attribute__((aligned(16)));
double r[N] __attribute__((aligned(16)));

int main(void){
  int i, times;
  for( times = 0; times < NTIMES; times++ ){
     for( i = 0; i <N; i+= 2){ 
        __m128d mm_a = _mm_load_pd( &a[i] );  
        _mm_prefetch( &a[i+4], _MM_HINT_T0 );
        __m128d mm_b = _mm_load_pd( &b[i] );  
        _mm_prefetch( &b[i+4] , _MM_HINT_T0 );
        __m128d mm_c = _mm_load_pd( &c[i] );
        _mm_prefetch( &c[i+4] , _MM_HINT_T0 );
        __m128d mm_r;
        mm_r = _mm_add_pd( mm_a, mm_b );
        mm_a = _mm_mul_pd( mm_r , mm_c );
        _mm_store_pd( &r[i], mm_a );
      }   
   }
 }

//NO SSE VERSION
//same definitions as before
int main(void){
  int i, times;
   for( times = 0; times < NTIMES; times++ ){
     for( i = 0; i < N; i++ ){
      r[i] = (a[i]+b[i])*c[i];
    }   
  }
}

When compiling them with -O0, gcc makes use of XMM/MMX registers and SSE intstructions, if not specifically given the -mno-sse (and others) options. I inspected the assembly code generated for the second code and I noticed that it makes use of movsd, addsd and mulsd instructions. So it makes use of SSE instructions but only of those that use the lowest part of the registers, if I am not wrong. The assembly code generated for the first C code made use, as expected, of the addp and mulpd instructions, though a pretty larger assembly code was generated.

Anyway, the first code should get better profit, as far as I know, of SIMD paradigm, since every iteration two result values are computed. Still that, the second code performs something such as a 25 per cent faster than the first one. I also made a test with single precision values and get similar results. What's the reason for that?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

北凤男飞 2024-12-19 03:57:34

GCC 中的矢量化在 -O3 处启用。这就是为什么在 -O0 处，您只能看到普通标量 SSE2 指令（movsd、addsd 等）。使用 GCC 4.6.1 和您的第二个示例：

#define N 10000
#define NTIMES 100000

double a[N] __attribute__ ((aligned (16)));
double b[N] __attribute__ ((aligned (16)));
double c[N] __attribute__ ((aligned (16)));
double r[N] __attribute__ ((aligned (16)));

int
main (void)
{
  int i, times;
  for (times = 0; times < NTIMES; times++)
    {
      for (i = 0; i < N; ++i)
        r[i] = (a[i] + b[i]) * c[i];
    }

  return 0;
}

并使用 gcc -S -O3 -msse2 sse.c 进行编译，为内部循环生成以下指令，这非常好：

.L3:
    movapd  a(%eax), %xmm0
    addpd   b(%eax), %xmm0
    mulpd   c(%eax), %xmm0
    movapd  %xmm0, r(%eax)
    addl    $16, %eax
    cmpl    $80000, %eax
    jne .L3

如您所见，支持矢量化的 GCC 发出代码来并行执行两个循环迭代。不过，它还可以改进 - 该代码使用 SSE 寄存器的低 128 位，但它可以通过启用 SSE 指令的 AVX 编码（如果机器上可用）来使用完整的 256 位 YMM 寄存器。因此，使用 gcc -S -O3 -msse2 -mavx sse.c 编译相同的程序会得到内部循环：

.L3:
    vmovapd a(%eax), %ymm0
    vaddpd  b(%eax), %ymm0, %ymm0
    vmulpd  c(%eax), %ymm0, %ymm0
    vmovapd %ymm0, r(%eax)
    addl    $32, %eax
    cmpl    $80000, %eax
    jne .L3

请注意每个指令前面的 v 以及该指令使用 256 位 YMM 寄存器，并行执行原始循环的四次迭代。

Vectorization in GCC is enabled at -O3. That's why at -O0, you see only the ordinary scalar SSE2 instructions (movsd, addsd, etc). Using GCC 4.6.1 and your second example:

#define N 10000
#define NTIMES 100000

double a[N] __attribute__ ((aligned (16)));
double b[N] __attribute__ ((aligned (16)));
double c[N] __attribute__ ((aligned (16)));
double r[N] __attribute__ ((aligned (16)));

int
main (void)
{
  int i, times;
  for (times = 0; times < NTIMES; times++)
    {
      for (i = 0; i < N; ++i)
        r[i] = (a[i] + b[i]) * c[i];
    }

  return 0;
}

and compiling with gcc -S -O3 -msse2 sse.c produces for the inner loop the following instructions, which is pretty good:

.L3:
    movapd  a(%eax), %xmm0
    addpd   b(%eax), %xmm0
    mulpd   c(%eax), %xmm0
    movapd  %xmm0, r(%eax)
    addl    $16, %eax
    cmpl    $80000, %eax
    jne .L3

As you can see, with the vectorization enabled GCC emits code to perform two loop iterations in parallel. It can be improved, though - this code uses the lower 128 bits of the SSE registers, but it can use the full the 256-bit YMM registers, by enabling the AVX encoding of SSE instructions (if available on the machine). So, compiling the same program with gcc -S -O3 -msse2 -mavx sse.c gives for the inner loop:

.L3:
    vmovapd a(%eax), %ymm0
    vaddpd  b(%eax), %ymm0, %ymm0
    vmulpd  c(%eax), %ymm0, %ymm0
    vmovapd %ymm0, r(%eax)
    addl    $32, %eax
    cmpl    $80000, %eax
    jne .L3

Note that v in front of each instruction and that instructions use the 256-bit YMM registers, four iterations of the original loop are executed in parallel.

回复收藏 0 原文

你又不是我 2024-12-19 03:57:34

我想扩展 chill 的答案并提请您注意 GCC 似乎无法做到同样聪明的事实向后迭代时使用 AVX 指令。

只需将 chill 示例代码中的内部循环替换为：

for (i = N-1; i >= 0; --i)
    r[i] = (a[i] + b[i]) * c[i];

带有选项 -S -O3 -mavx 的 GCC (4.8.4) 就会生成：

.L5:
    vmovsd  a+79992(%rax), %xmm0
    subq    $8, %rax
    vaddsd  b+80000(%rax), %xmm0, %xmm0
    vmulsd  c+80000(%rax), %xmm0, %xmm0
    vmovsd  %xmm0, r+80000(%rax)
    cmpq    $-80000, %rax
    jne     .L5

I would like to extend chill's answer and draw your attention on the fact that GCC seems not to be able to do the same smart use of the AVX instructions when iterating backwards.

Just replace the inner loop in chill's sample code with:

for (i = N-1; i >= 0; --i)
    r[i] = (a[i] + b[i]) * c[i];

GCC (4.8.4) with options -S -O3 -mavx produces:

.L5:
    vmovsd  a+79992(%rax), %xmm0
    subq    $8, %rax
    vaddsd  b+80000(%rax), %xmm0, %xmm0
    vmulsd  c+80000(%rax), %xmm0, %xmm0
    vmovsd  %xmm0, r+80000(%rax)
    cmpq    $-80000, %rax
    jne     .L5

回复收藏 0 原文

~没有更多了~