使用内在函数的 SSE2 汇编溢出
我是 SSE 和 SSE2 的新手,我编写了一个小型 C 示例(分配两个计数器,一个计数器增加,另一个计数器减少,而不是将两个计数器相加),它按预期工作。我使用了内在函数和 Microsoft Visual Studio 10 C++ Express。作为第二步,我想了解幕后发生的事情,但我现在很困惑。 例如,for 循环中的赋值操作编译为:
__m128i a_ptr = _mm_load_si128((__m128i*)&(a_aligned[i]));
mov eax,dword ptr [i]
mov ecx,dword ptr [a_aligned]
movdqa xmm0,xmmword ptr [ecx+eax*2]
movdqa xmmword ptr [ebp-1C0h],xmm0
movdqa xmm0,xmmword ptr [ebp-1C0h]
movdqa xmmword ptr [a_ptr],xmm0
我知道前两行获取 a_aligned 地址的组成部分,第三行将其复制到 xmm0 寄存器。但我不明白为什么它被复制回内存,而不是再次复制到 xmm0 (而不是复制到 a_ptr)。我认为 _mm_load_si128 内在函数应该将 a_aligned[i] 的 128 位复制到 xmm0,仅此而已。为什么会出现这样的情况呢?我理论上错了吗?如果不是,我应该如何提示编译器?我的示例代码是否正确(从某种意义上说,它没有不必要的内容)? 这是我的完整示例代码:
#include <xmmintrin.h>
#include <emmintrin.h>
#include <iostream>
int main(int argc, char *argv[]) {
unsigned __int16 *a_aligned = (unsigned __int16 *)_mm_malloc(32 * sizeof(unsigned __int16),16);
unsigned __int16 *b_aligned = (unsigned __int16 *)_mm_malloc(32 * sizeof(unsigned __int16),16);
unsigned __int16 *c_aligned = (unsigned __int16 *)_mm_malloc(32 * sizeof(unsigned __int16),16);
for(int i = 0; i < 32; i++) {
a_aligned[i] = i;
b_aligned[i] = i;
c_aligned[i] = 0;
}
for(int i = 0; i < 32; i+=8) {
__m128i a_ptr = _mm_load_si128((__m128i*)&(a_aligned[i]));
__m128i b_ptr = _mm_load_si128((__m128i*)&(b_aligned[i]));
__m128i res = _mm_add_epi16(a_ptr, b_ptr);
_mm_store_si128((__m128i*)&(c_aligned[i]), res);
}
for(int i = 1; i < 32; i++) {
std::cout << c_aligned[i] << " ";
}
_mm_free(a_aligned);
_mm_free(b_aligned);
_mm_free(c_aligned);
return 0;
}
I am new to SSE and SSE2, and I wrote a small C sample (allocating two counters, one increasing other decreasing than adding the two), which is working as expected. I used intrinsics and Microsoft Visual Studio 10 C++ Express. As second step I wanted to understand what's going on under the hood, but I'm puzzled now.
For example the assignment operation in the for loops compiles to:
__m128i a_ptr = _mm_load_si128((__m128i*)&(a_aligned[i]));
mov eax,dword ptr [i]
mov ecx,dword ptr [a_aligned]
movdqa xmm0,xmmword ptr [ecx+eax*2]
movdqa xmmword ptr [ebp-1C0h],xmm0
movdqa xmm0,xmmword ptr [ebp-1C0h]
movdqa xmmword ptr [a_ptr],xmm0
I understand that the first two lines gets the components of a_aligned's address, and the third line copies it to the xmm0 register. But I don't understand why it's copied back to memory, than to xmm0 again (than to a_ptr). I though that the _mm_load_si128 intrinsic should copy a_aligned[i]'s 128 bits to xmm0 and nothing more. Why is this happened? Am I wrong theoretically? If not how should I hint the compiler? Is my sample code correct (in sense that it doesn't have unnecessarities)?
Here is my full sample code:
#include <xmmintrin.h>
#include <emmintrin.h>
#include <iostream>
int main(int argc, char *argv[]) {
unsigned __int16 *a_aligned = (unsigned __int16 *)_mm_malloc(32 * sizeof(unsigned __int16),16);
unsigned __int16 *b_aligned = (unsigned __int16 *)_mm_malloc(32 * sizeof(unsigned __int16),16);
unsigned __int16 *c_aligned = (unsigned __int16 *)_mm_malloc(32 * sizeof(unsigned __int16),16);
for(int i = 0; i < 32; i++) {
a_aligned[i] = i;
b_aligned[i] = i;
c_aligned[i] = 0;
}
for(int i = 0; i < 32; i+=8) {
__m128i a_ptr = _mm_load_si128((__m128i*)&(a_aligned[i]));
__m128i b_ptr = _mm_load_si128((__m128i*)&(b_aligned[i]));
__m128i res = _mm_add_epi16(a_ptr, b_ptr);
_mm_store_si128((__m128i*)&(c_aligned[i]), res);
}
for(int i = 1; i < 32; i++) {
std::cout << c_aligned[i] << " ";
}
_mm_free(a_aligned);
_mm_free(b_aligned);
_mm_free(c_aligned);
return 0;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
内联函数被明确设计为帮助编译器代码生成器更好地优化代码。您正在查看由调试配置生成的汇编代码。那不是优化的代码。查看发布版本中的代码:
看起来更好,不是吗?
Intrinsics were explicitly designed to help the compiler code generator do a better job optimizing the code. You are looking at the assembly code generated by the Debug configuration. That is not optimized code. Look at the code in the Release build:
Looks better, doesn't it?
在编译器设置中打开优化(使用发布配置而不是调试)。
Turn on optimization in your compiler settings (use the Release configuration instead of Debug).