分析 SIMD 代码

发布于 2024-11-03 17:30:41 字数 3243 浏览 1 评论 0 原文

已更新 - 检查下面

将使其尽可能简短。如果需要,很乐意添加更多详细信息。

我有一些用于标准化向量的 sse 代码。我正在使用 QueryPerformanceCounter() (包装在辅助结构中)来衡量性能。

如果我像这样测量,

for( int j = 0; j < NUM_VECTORS; ++j )
{
  Timer t(norm_sse);
  NormaliseSSE( vectors_sse+j);
}

我得到的结果通常比用 4 个代表向量的双精度值进行标准标准化要慢(在相同配置中进行测试)。

for( int j = 0; j < NUM_VECTORS; ++j )
{
  Timer t(norm_dbl);
  NormaliseDBL( vectors_dbl+j);
}

然而,像这样对整个循环进行计时

{
  Timer t(norm_sse);
  for( int j = 0; j < NUM_VECTORS; ++j ){
    NormaliseSSE( vectors_sse+j );
  }    
}

表明 SSE 代码快了一个数量级,但并没有真正影响双精度版本的测量。 我已经做了相当多的实验和搜索,但似乎无法找到合理的答案。

例如,我知道将结果转换为浮动时可能会受到惩罚,但这里没有发生这种情况。

任何人都可以提供任何见解吗?在每个标准化之间调用 QueryPerformanceCounter 会导致 SIMD 代码速度减慢这么多?

感谢您的阅读 :)

下面有更多详细信息:

  • 两种标准化方法都是内联的(在反汇编中验证)
  • 中运行
  • 在32 位编译版本

简单向量结构

_declspec(align(16)) struct FVECTOR{
    typedef float REAL;
  union{
    struct { REAL x, y, z, w; };
    __m128 Vec;
  };
};

规范化 SSE 的代码:

  __m128 Vec = _v->Vec;
  __m128 sqr = _mm_mul_ps( Vec, Vec ); // Vec * Vec
  __m128 yxwz = _mm_shuffle_ps( sqr, sqr , 0x4e ); 
  __m128 addOne = _mm_add_ps( sqr, yxwz ); 
  __m128 swapPairs = _mm_shuffle_ps( addOne, addOne , 0x11 );
  __m128 addTwo = _mm_add_ps( addOne, swapPairs ); 
  __m128 invSqrOne = _mm_rsqrt_ps( addTwo ); 
  _v->Vec = _mm_mul_ps( invSqrOne, Vec );   

标准化双精度的代码

double len_recip = 1./sqrt(v->x*v->x + v->y*v->y + v->z*v->z);
v->x *= len_recip;
v->y *= len_recip;
v->z *= len_recip;

辅助结构

struct Timer{
  Timer( LARGE_INTEGER & a_Storage ): Storage( a_Storage ){
      QueryPerformanceCounter( &PStart );
  }

  ~Timer(){
    LARGE_INTEGER PEnd;
    QueryPerformanceCounter( &PEnd );
    Storage.QuadPart += ( PEnd.QuadPart - PStart.QuadPart );
  }

  LARGE_INTEGER& Storage;
  LARGE_INTEGER PStart;
};

更新 因此,感谢 Johns 的评论,我想我已经成功确认是 QueryPerformanceCounter 对我的 simd 代码做了坏事。

我添加了一个直接使用 RDTSC 的新计时器结构,它似乎给出了与我预期一致的结果。结果仍然比对整个循环进行计时慢得多,而不是分别对每个迭代进行计时,但我希望这是因为获取 RDTSC 涉及刷新指令管道(检查 http://www.strchr.com/performance_measurements_with_rdtsc 了解更多信息)。

struct PreciseTimer{

    PreciseTimer( LARGE_INTEGER& a_Storage ) : Storage(a_Storage){
        StartVal.QuadPart = GetRDTSC();
    }

    ~PreciseTimer(){
        Storage.QuadPart += ( GetRDTSC() - StartVal.QuadPart );
    }

    unsigned __int64 inline GetRDTSC() {
        unsigned int lo, hi;
        __asm {
             ; Flush the pipeline
             xor eax, eax
             CPUID
             ; Get RDTSC counter in edx:eax
             RDTSC
             mov DWORD PTR [hi], edx
             mov DWORD PTR [lo], eax
        }

        return (unsigned __int64)(hi << 32 | lo);

    }

    LARGE_INTEGER StartVal;
    LARGE_INTEGER& Storage;
};

UPDATED - Check Below

Will keep this as short as possible. Happy to add any more details if required.

I have some sse code for normalising a vector. I'm using QueryPerformanceCounter() (wrapped in a helper struct) to measure performance.

If I measure like this

for( int j = 0; j < NUM_VECTORS; ++j )
{
  Timer t(norm_sse);
  NormaliseSSE( vectors_sse+j);
}

The results I get are often slower than just doing a standard normalise with 4 doubles representing a vector (testing in the same configuration).

for( int j = 0; j < NUM_VECTORS; ++j )
{
  Timer t(norm_dbl);
  NormaliseDBL( vectors_dbl+j);
}

However, timing just the entirety of the loop like this

{
  Timer t(norm_sse);
  for( int j = 0; j < NUM_VECTORS; ++j ){
    NormaliseSSE( vectors_sse+j );
  }    
}

shows the SSE code to be an order of magnitude faster, but doesn't really affect the measurements for the double version.
I've done a fair bit of experimentation and searching, and can't seem to find a reasonable answer as to why.

For example, I know there can be penalities when casting the results to float, but none of that is going on here.

Can anyone offer any insight? What is it about calling QueryPerformanceCounter between each normalise that slows the SIMD code down so much?

Thanks for reading :)

More details below:

  • Both normalise methods are inlined (verified in disassembly)
  • Running in release
  • 32 bit compilation

Simple Vector struct

_declspec(align(16)) struct FVECTOR{
    typedef float REAL;
  union{
    struct { REAL x, y, z, w; };
    __m128 Vec;
  };
};

Code to Normalise SSE:

  __m128 Vec = _v->Vec;
  __m128 sqr = _mm_mul_ps( Vec, Vec ); // Vec * Vec
  __m128 yxwz = _mm_shuffle_ps( sqr, sqr , 0x4e ); 
  __m128 addOne = _mm_add_ps( sqr, yxwz ); 
  __m128 swapPairs = _mm_shuffle_ps( addOne, addOne , 0x11 );
  __m128 addTwo = _mm_add_ps( addOne, swapPairs ); 
  __m128 invSqrOne = _mm_rsqrt_ps( addTwo ); 
  _v->Vec = _mm_mul_ps( invSqrOne, Vec );   

Code to normalise doubles

double len_recip = 1./sqrt(v->x*v->x + v->y*v->y + v->z*v->z);
v->x *= len_recip;
v->y *= len_recip;
v->z *= len_recip;

Helper struct

struct Timer{
  Timer( LARGE_INTEGER & a_Storage ): Storage( a_Storage ){
      QueryPerformanceCounter( &PStart );
  }

  ~Timer(){
    LARGE_INTEGER PEnd;
    QueryPerformanceCounter( &PEnd );
    Storage.QuadPart += ( PEnd.QuadPart - PStart.QuadPart );
  }

  LARGE_INTEGER& Storage;
  LARGE_INTEGER PStart;
};

Update
So thanks to Johns comments, I think I've managed to confirm that it is QueryPerformanceCounter thats doing bad things to my simd code.

I added a new timer struct that uses RDTSC directly, and it seems to give results consistent to what I would expect. The result is still far slower than timing the entire loop, rather than each iteration separately, but I expect that that's because Getting the RDTSC involves flushing the instruction pipeline (Check http://www.strchr.com/performance_measurements_with_rdtsc for more info).

struct PreciseTimer{

    PreciseTimer( LARGE_INTEGER& a_Storage ) : Storage(a_Storage){
        StartVal.QuadPart = GetRDTSC();
    }

    ~PreciseTimer(){
        Storage.QuadPart += ( GetRDTSC() - StartVal.QuadPart );
    }

    unsigned __int64 inline GetRDTSC() {
        unsigned int lo, hi;
        __asm {
             ; Flush the pipeline
             xor eax, eax
             CPUID
             ; Get RDTSC counter in edx:eax
             RDTSC
             mov DWORD PTR [hi], edx
             mov DWORD PTR [lo], eax
        }

        return (unsigned __int64)(hi << 32 | lo);

    }

    LARGE_INTEGER StartVal;
    LARGE_INTEGER& Storage;
};

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

雨后彩虹 2024-11-10 17:30:41

当只有 SSE 代码运行循环时,处理器应该能够保持其管道满载,并在单位时间内执行大量 SIMD 指令。当您在循环中添加计时器代码时,现在每个易于优化的操作之间都会出现一大堆非 SIMD 指令,这些指令可能难以预测。 QueryPerformanceCounter 调用可能要么成本高昂,足以使数据操作部分变得无关紧要,要么它执行的代码的性质对处理器保持以最大速率执行指令的能力造成严重破坏(可能是由于缓存逐出或分支)不好预测)。

您可以尝试在 Timer 类中注释掉对 QPC 的实际调用,并查看它的执行情况 - 这可能会帮助您发现问题是否出在 Timer 对象的构造和销毁上,或者是 QPC 调用上。同样,尝试在循环中直接调用 QPC,而不是创建计时器,然后看看比较如何。

When it's only the SSE code running the loop, the processor should be able to keep its pipelines full and executing a huge number of SIMD instructions per unit time. When you add the timer code within the loop, now there's a whole bunch of non-SIMD instructions, possibly less predictable, between each of the easy-to-optimize operations. It's likely that the QueryPerformanceCounter call is either expensive enough to make the data manipulation part insignificant, or the nature of the code it executes wreaks havoc with the processor's ability to keep executing instructions at the maximum rate (possibly due to cache evictions or branches that are not well-predicted).

You might try commenting out the actual calls to QPC in your Timer class and see how it performs--this may help you discover if it's the construction and destruction of the Timer objects that is the problem, or the QPC calls. Likewise, try just calling QPC directly in the loop instead of making a Timer and see how that compares.

流星番茄 2024-11-10 17:30:41

QPC 是一个内核函数,调用它会导致上下文切换,这本质上比任何等效的用户模式函数调用都更加昂贵和具有破坏性,并且肯定会消除处理器以其正常速度处理的能力。除此之外,请记住 QPC/QPF 是抽象,需要自己的处理 - 这可能涉及 SSE 本身的使用。

QPC is a kernel function, and calling it causes a context switch, which is inherently far more expensive and destructive than any equivalent user-mode function call, and will definitely annihilate the processor's ability to process at it's normal speed. In addition to that, remember that QPC/QPF are abstractions and require their own processing- which likely involves the use of SSE itself.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文