当前位置：文江博客话题详情

sse c++ c simd

分析 SIMD 代码

发布于 2024-11-03 17:30:41 字数 3243 浏览 5 评论 0 原文

已更新 - 检查下面

将使其尽可能简短。如果需要，很乐意添加更多详细信息。

我有一些用于标准化向量的 sse 代码。我正在使用 QueryPerformanceCounter() （包装在辅助结构中）来衡量性能。

如果我像这样测量，

for( int j = 0; j < NUM_VECTORS; ++j )
{
  Timer t(norm_sse);
  NormaliseSSE( vectors_sse+j);
}

我得到的结果通常比用 4 个代表向量的双精度值进行标准标准化要慢（在相同配置中进行测试）。

for( int j = 0; j < NUM_VECTORS; ++j )
{
  Timer t(norm_dbl);
  NormaliseDBL( vectors_dbl+j);
}

然而，像这样对整个循环进行计时

{
  Timer t(norm_sse);
  for( int j = 0; j < NUM_VECTORS; ++j ){
    NormaliseSSE( vectors_sse+j );
  }    
}

表明 SSE 代码快了一个数量级，但并没有真正影响双精度版本的测量。我已经做了相当多的实验和搜索，但似乎无法找到合理的答案。

例如，我知道将结果转换为浮动时可能会受到惩罚，但这里没有发生这种情况。

任何人都可以提供任何见解吗？在每个标准化之间调用 QueryPerformanceCounter 会导致 SIMD 代码速度减慢这么多？

感谢您的阅读 :)

下面有更多详细信息：

两种标准化方法都是内联的（在反汇编中验证）
中运行
在32 位编译版本

简单向量结构

_declspec(align(16)) struct FVECTOR{
    typedef float REAL;
  union{
    struct { REAL x, y, z, w; };
    __m128 Vec;
  };
};

规范化 SSE 的代码：

  __m128 Vec = _v->Vec;
  __m128 sqr = _mm_mul_ps( Vec, Vec ); // Vec * Vec
  __m128 yxwz = _mm_shuffle_ps( sqr, sqr , 0x4e ); 
  __m128 addOne = _mm_add_ps( sqr, yxwz ); 
  __m128 swapPairs = _mm_shuffle_ps( addOne, addOne , 0x11 );
  __m128 addTwo = _mm_add_ps( addOne, swapPairs ); 
  __m128 invSqrOne = _mm_rsqrt_ps( addTwo ); 
  _v->Vec = _mm_mul_ps( invSqrOne, Vec );

标准化双精度的代码

double len_recip = 1./sqrt(v->x*v->x + v->y*v->y + v->z*v->z);
v->x *= len_recip;
v->y *= len_recip;
v->z *= len_recip;

辅助结构

struct Timer{
  Timer( LARGE_INTEGER & a_Storage ): Storage( a_Storage ){
      QueryPerformanceCounter( &PStart );
  }

  ~Timer(){
    LARGE_INTEGER PEnd;
    QueryPerformanceCounter( &PEnd );
    Storage.QuadPart += ( PEnd.QuadPart - PStart.QuadPart );
  }

  LARGE_INTEGER& Storage;
  LARGE_INTEGER PStart;
};

更新因此，感谢 Johns 的评论，我想我已经成功确认是 QueryPerformanceCounter 对我的 simd 代码做了坏事。

我添加了一个直接使用 RDTSC 的新计时器结构，它似乎给出了与我预期一致的结果。结果仍然比对整个循环进行计时慢得多，而不是分别对每个迭代进行计时，但我希望这是因为获取 RDTSC 涉及刷新指令管道（检查 http://www.strchr.com/performance_measurements_with_rdtsc 了解更多信息）。

struct PreciseTimer{

    PreciseTimer( LARGE_INTEGER& a_Storage ) : Storage(a_Storage){
        StartVal.QuadPart = GetRDTSC();
    }

    ~PreciseTimer(){
        Storage.QuadPart += ( GetRDTSC() - StartVal.QuadPart );
    }

    unsigned __int64 inline GetRDTSC() {
        unsigned int lo, hi;
        __asm {
             ; Flush the pipeline
             xor eax, eax
             CPUID
             ; Get RDTSC counter in edx:eax
             RDTSC
             mov DWORD PTR [hi], edx
             mov DWORD PTR [lo], eax
        }

        return (unsigned __int64)(hi << 32 | lo);

    }

    LARGE_INTEGER StartVal;
    LARGE_INTEGER& Storage;
};

原文

UPDATED - Check Below

Will keep this as short as possible. Happy to add any more details if required.

I have some sse code for normalising a vector. I'm using QueryPerformanceCounter() (wrapped in a helper struct) to measure performance.

If I measure like this

for( int j = 0; j < NUM_VECTORS; ++j )
{
  Timer t(norm_sse);
  NormaliseSSE( vectors_sse+j);
}

The results I get are often slower than just doing a standard normalise with 4 doubles representing a vector (testing in the same configuration).

for( int j = 0; j < NUM_VECTORS; ++j )
{
  Timer t(norm_dbl);
  NormaliseDBL( vectors_dbl+j);
}

However, timing just the entirety of the loop like this

{
  Timer t(norm_sse);
  for( int j = 0; j < NUM_VECTORS; ++j ){
    NormaliseSSE( vectors_sse+j );
  }    
}

shows the SSE code to be an order of magnitude faster, but doesn't really affect the measurements for the double version.
I've done a fair bit of experimentation and searching, and can't seem to find a reasonable answer as to why.

For example, I know there can be penalities when casting the results to float, but none of that is going on here.

Can anyone offer any insight? What is it about calling QueryPerformanceCounter between each normalise that slows the SIMD code down so much?

Thanks for reading :)

More details below:

Both normalise methods are inlined (verified in disassembly)
Running in release
32 bit compilation

Simple Vector struct

_declspec(align(16)) struct FVECTOR{
    typedef float REAL;
  union{
    struct { REAL x, y, z, w; };
    __m128 Vec;
  };
};

Code to Normalise SSE:

  __m128 Vec = _v->Vec;
  __m128 sqr = _mm_mul_ps( Vec, Vec ); // Vec * Vec
  __m128 yxwz = _mm_shuffle_ps( sqr, sqr , 0x4e ); 
  __m128 addOne = _mm_add_ps( sqr, yxwz ); 
  __m128 swapPairs = _mm_shuffle_ps( addOne, addOne , 0x11 );
  __m128 addTwo = _mm_add_ps( addOne, swapPairs ); 
  __m128 invSqrOne = _mm_rsqrt_ps( addTwo ); 
  _v->Vec = _mm_mul_ps( invSqrOne, Vec );

Code to normalise doubles

double len_recip = 1./sqrt(v->x*v->x + v->y*v->y + v->z*v->z);
v->x *= len_recip;
v->y *= len_recip;
v->z *= len_recip;

Helper struct

struct Timer{
  Timer( LARGE_INTEGER & a_Storage ): Storage( a_Storage ){
      QueryPerformanceCounter( &PStart );
  }

  ~Timer(){
    LARGE_INTEGER PEnd;
    QueryPerformanceCounter( &PEnd );
    Storage.QuadPart += ( PEnd.QuadPart - PStart.QuadPart );
  }

  LARGE_INTEGER& Storage;
  LARGE_INTEGER PStart;
};

Update
So thanks to Johns comments, I think I've managed to confirm that it is QueryPerformanceCounter thats doing bad things to my simd code.

I added a new timer struct that uses RDTSC directly, and it seems to give results consistent to what I would expect. The result is still far slower than timing the entire loop, rather than each iteration separately, but I expect that that's because Getting the RDTSC involves flushing the instruction pipeline (Check http://www.strchr.com/performance_measurements_with_rdtsc for more info).

struct PreciseTimer{

    PreciseTimer( LARGE_INTEGER& a_Storage ) : Storage(a_Storage){
        StartVal.QuadPart = GetRDTSC();
    }

    ~PreciseTimer(){
        Storage.QuadPart += ( GetRDTSC() - StartVal.QuadPart );
    }

    unsigned __int64 inline GetRDTSC() {
        unsigned int lo, hi;
        __asm {
             ; Flush the pipeline
             xor eax, eax
             CPUID
             ; Get RDTSC counter in edx:eax
             RDTSC
             mov DWORD PTR [hi], edx
             mov DWORD PTR [lo], eax
        }

        return (unsigned __int64)(hi << 32 | lo);

    }

    LARGE_INTEGER StartVal;
    LARGE_INTEGER& Storage;
};

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雨后彩虹 2024-11-10 17:30:41

当只有 SSE 代码运行循环时，处理器应该能够保持其管道满载，并在单位时间内执行大量 SIMD 指令。当您在循环中添加计时器代码时，现在每个易于优化的操作之间都会出现一大堆非 SIMD 指令，这些指令可能难以预测。 QueryPerformanceCounter 调用可能要么成本高昂，足以使数据操作部分变得无关紧要，要么它执行的代码的性质对处理器保持以最大速率执行指令的能力造成严重破坏（可能是由于缓存逐出或分支）不好预测）。

您可以尝试在 Timer 类中注释掉对 QPC 的实际调用，并查看它的执行情况 - 这可能会帮助您发现问题是否出在 Timer 对象的构造和销毁上，或者是 QPC 调用上。同样，尝试在循环中直接调用 QPC，而不是创建计时器，然后看看比较如何。