SSE2 双倍乘法比标准乘法慢
我想知道为什么以下带有 SSE2 指令的代码执行乘法的速度比标准 C++ 实现慢。 代码如下:
m_win = (double*)_aligned_malloc(size*sizeof(double), 16);
__m128d* pData = (__m128d*)input().data;
__m128d* pWin = (__m128d*)m_win;
__m128d* pOut = (__m128d*)m_output.data;
__m128d tmp;
int i=0;
for(; i<m_size/2;i++)
pOut[i] = _mm_mul_pd(pData[i], pWin[i]);
m_output.data
和 input().data
的内存已使用 _aligned_malloc 分配。
然而,对于 2^25 数组,执行此代码的时间与此代码的时间相同 (350ms):
for(int i=0;i<m_size;i++)
m_output.data[i] = input().data[i] * m_win[i];
这怎么可能?理论上应该只需要 50% 的时间,对吗?或者从 SIMD 寄存器到 m_output.data 数组的内存传输开销是否如此昂贵?
where __m128d tmp;
替换第一个代码片段中的行
pOut[i] = _mm_mul_pd(pData[i], pWin[i]);
如果我用
tmp = _mm_mul_pd(pData[i], pWin[i]);
,那么代码的执行速度会非常快,低于我的计时器函数的分辨率。 这是因为所有内容都存储在寄存器中而不是内存中吗?
更令人惊讶的是,如果我在调试模式下编译,SSE 代码只需要 93 毫秒,而标准乘法需要 309 毫秒。
- 调试:93毫秒(SSE2)/ 309毫秒(标准乘法)
- 发布:350毫秒(SSE2)/ 350(标准乘法)
这是怎么回事???
我在发布模式下使用 MSVC2008 和 QtCreator 2.2.1。 这是我的 RELEASE 编译器开关:
cl -c -nologo -Zm200 -Zc:wchar_t- -O2 -MD -GR -EHsc -W3 -w34100 -w34189
这些是 DEBUG 开关:
cl -c -nologo -Zm200 -Zc:wchar_t- -Zi -MDd -GR -EHsc -W3 -w34100 -w34189
编辑 关于 RELEASE 与 DEBUG 问题: 我只是想指出,我分析了代码,并且 SSE 代码实际上在发布模式下速度较慢! 这只是在某种程度上证实了 VS2008 无法通过优化器正确处理内在函数的假设。 Intel VTune 在 DEBUG 模式下为 SSE 循环提供了 289ms 的时间,在 RELEASE 模式下为我提供了 504ms 的时间。 哇……只是哇……
I'm wondering why the following code with SSE2 instructions performs the multiplication slower than the standard C++ implementation.
Here is the code:
m_win = (double*)_aligned_malloc(size*sizeof(double), 16);
__m128d* pData = (__m128d*)input().data;
__m128d* pWin = (__m128d*)m_win;
__m128d* pOut = (__m128d*)m_output.data;
__m128d tmp;
int i=0;
for(; i<m_size/2;i++)
pOut[i] = _mm_mul_pd(pData[i], pWin[i]);
The memory for m_output.data
and input().data
has been allocated with _aligned_malloc.
The time to execute this code however for a 2^25 array is identical to the time for this code (350ms):
for(int i=0;i<m_size;i++)
m_output.data[i] = input().data[i] * m_win[i];
How is that possible? It should theoretically take only 50% of the time, right? Or is the overhead for the memory transfer from SIMD registers to the m_output.data array so expensive?
If I replace the line from the first snippet
pOut[i] = _mm_mul_pd(pData[i], pWin[i]);
by
tmp = _mm_mul_pd(pData[i], pWin[i]);
where __m128d tmp;
then the codes executes blazingly fast, less then the resolution of my timer function.
Is that because everything is just stored in the registers and not the memory?
And even more surprising, if I compile in debug mode, the SSE code takes only 93ms while the standard multiplication takes 309ms.
- DEBUG: 93ms (SSE2) / 309ms (standard multiplication)
- RELEASE: 350ms (SSE2) / 350 (standard multiplication)
What's going on here???
I'm using MSVC2008 with QtCreator 2.2.1 in release mode.
Here are my compilter switches for RELEASE:
cl -c -nologo -Zm200 -Zc:wchar_t- -O2 -MD -GR -EHsc -W3 -w34100 -w34189
and these are for DEBUG:
cl -c -nologo -Zm200 -Zc:wchar_t- -Zi -MDd -GR -EHsc -W3 -w34100 -w34189
EDIT
Regarding the RELEASE vs DEBUG issue:
I just wanted to note that I profiled the code and the SSE code is infact slower in release mode!
That just confirms somehow the hypothesis that VS2008 somehow cant handle intrinsics with the optimizer properly.
Intel VTune gives me 289ms for the SSE loop in DEBUG and 504ms in RELEASE mode.
Wow... just wow...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先,VS 2008 对于 intrisincs 来说是一个糟糕的选择,因为它往往会添加不必要的更多寄存器移动,并且通常不能很好地优化(例如,当存在 SSE 指令时,它在循环归纳变量分析方面存在问题。)
因此,我的疯狂猜测是编译器生成
mulss
指令,CPU 可以轻松地重新排序并并行执行这些指令(迭代之间没有依赖性),而 intrisincs 会导致大量寄存器移动/复杂SSE 代码——它甚至可能会破坏现代 CPU 上的跟踪缓存。 VS2008因在寄存器中进行所有计算而臭名昭著,我猜会有一些CPU无法跳过的危险(例如xor reg、move mem->reg、xor、mov mem->reg、mul、mov mem-> ;reg 是一个依赖链,而标量代码可能是 move mem->reg, mul with mem operand, mov。)您绝对应该查看生成的程序集或者尝试 VS 2010,它对 intrinsincs 有更好的支持。最后,也是最重要的一点:您的代码根本不受计算限制,无论使用多少 SSE 都不会使其速度显着加快。在每次迭代中,您都会读取四个双精度值并写入两个,这意味着 FLOPs 不是您的问题。在这种情况下,您将受到缓存/内存子系统的支配,这可能解释了您看到的差异。调试乘法不应比发布快;如果您发现它的速度比您应该执行更多运行并检查其他情况(如果您的 CPU 支持 Turbo 模式,请小心,这会增加另外 20% 的变化。)清空缓存的上下文切换可能就足够了这个案例。
因此,总的来说,您所做的测试几乎没有意义,只是表明对于内存限制情况,使用或不使用 SSE 没有区别。如果确实存在计算密集且并行的代码,则应该使用 SSE,即使这样,我也会花费大量时间使用探查器来确定要优化的确切位置。简单的点积不适合看到 SSE 的任何性能改进。
First of all, VS 2008 is a bad choice for intrisincs as it tends to add many more register moves than necessary and in general does not optimize very well (for instance, it has issues with loop induction variable analysis when SSE instructions are present.)
So, my wild guess is that the compiler generates
mulss
instructions which the CPU can trivially reorder and execute in parallel (no dependencies between the iterations) while the intrisincs result in lots of register moves/complex SSE code -- it might even blow the trace cache on modern CPUs. VS2008 is notorious for doing all it's calculations in registers and I guess there will be some hazards that the CPU cannot skip (like xor reg, move mem->reg, xor, mov mem->reg, mul, mov mem->reg which is a dependency chain while the scalar code might be move mem->reg, mul with mem operand, mov.) You should definitely look at the generated assembly or try VS 2010 which has much better support for intrinsincs.Finally, and most important: Your code is not compute bound at all, no amount of SSE will make it significantly faster. On each iteration, you are reading four double values and writing two, which means FLOPs is not your problem. In that case, you're at the mercy of the cache/memory subsystem, and that probably explains the variance you see. The debug multiplication shouldn't be faster than release; and if you see it being faster than you should do more runs and check what else is going on (be careful if your CPU supports a turbo mode, that adds another 20% variation.) A context switch which empties the cache might be enough in this case.
So, overall, the test you made is pretty much meaningless and just shows that for memory bound cases there is no difference to use SSE or not. You should use SSE if there is actually code which is compute-dense and parallel, and even then I would spend a lot of time with a profiler to nail down the exact location where to optimize. A simple dot product is not suitable to see any performance improvements with SSE.
几点:
Several points: