如何使用 128 位 C 变量和 xmm 128 位 asm?
在 gcc 中,我想通过 asm 代码对 2 个 C 变量进行 128 位异或:如何?
asm (
"movdqa %1, %%xmm1;"
"movdqa %0, %%xmm0;"
"pxor %%xmm1,%%xmm0;"
"movdqa %%xmm0, %0;"
:"=x"(buff) /* output operand */
:"x"(bu), "x"(buff)
:"%xmm0","%xmm1"
);
但我有一个分段错误; 这是 objdump 输出:
movq -0x80(%rbp),%xmm2
movq -0x88(%rbp),%xmm3
movdqa %xmm2,%xmm1
movdqa %xmm2,%xmm0
pxor %xmm1,%xmm0
movdqa %xmm0,%xmm2
movq %xmm2,-0x78(%rbp)
in gcc, i want to do a 128 bits xor with 2 C variables, via asm code: how?
asm (
"movdqa %1, %%xmm1;"
"movdqa %0, %%xmm0;"
"pxor %%xmm1,%%xmm0;"
"movdqa %%xmm0, %0;"
:"=x"(buff) /* output operand */
:"x"(bu), "x"(buff)
:"%xmm0","%xmm1"
);
but i have a Segmentation fault error;
this is the objdump output:
movq -0x80(%rbp),%xmm2
movq -0x88(%rbp),%xmm3
movdqa %xmm2,%xmm1
movdqa %xmm2,%xmm0
pxor %xmm1,%xmm0
movdqa %xmm0,%xmm2
movq %xmm2,-0x78(%rbp)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果变量不是 16 字节对齐,您会看到段错误问题。 CPU 可以不会与未对齐的内存地址进行 MOVDQA,并且会生成处理器级“GP 异常”,提示操作系统对您的应用程序进行段错误。
您在堆上声明(堆栈、全局)或分配的 C 变量通常不会与 16 字节边界对齐,尽管有时您可能会偶然获得对齐的变量。您可以使用 __m128 或 __m128i 数据类型指示编译器确保正确对齐。其中每一个都声明一个正确对齐的 128 位值。
此外,读取 objdump,看起来编译器用代码包装了 asm 序列,以使用 MOVQ 指令将操作数从堆栈复制到 xmm2 和 xmm3 寄存器,然后让您的 asm 代码将值复制到 xmm0 和 xmm1。在对 xmm0 进行异或运算后,包装器将结果复制到 xmm2,然后将其复制回堆栈。总体而言,效率不是很高。 MOVQ 一次复制 8 个字节,并期望(在某些情况下)8 字节对齐的地址。获取未对齐的地址,它可能会像 MOVDQA 一样失败。然而,包装器代码将对齐的偏移量(-0x80、-0x88 和后来的 -0x78)添加到 BP 寄存器,该寄存器可能包含也可能不包含对齐值。总的来说,生成的代码不保证对齐。
以下内容确保参数和结果存储在正确对齐的内存位置,并且似乎工作正常:
使用(gcc,ubuntu 32位)编译
输出:
在上面的代码中,_mm_setr_epi32用于初始化a和 b 具有 128 位值,因为编译器可能不支持 128 整数文字。
print128 写出 128 位整数的十六进制表示形式,因为 printf 可能无法这样做。
以下内容较短,避免了一些重复复制。编译器添加其隐藏的包装 movdqa 以使 pxor %2,%0 神奇地工作,而无需您自己加载寄存器:
像以前一样编译:
输出:
或者,如果您想避免内联汇编,您可以使用SSE 内在函数 (PDF)。这些是内联函数/宏,使用类似 C 的语法封装 MMX/SSE 指令。 _mm_xor_si128 将您的任务减少为单个调用:
编译:
输出:
You would see segfault issues if the variables aren't 16-byte aligned. The CPU can't MOVDQA to/from unaligned memory addresses, and would generate a processor-level "GP exception", prompting the OS to segfault your app.
C variables you declare (stack, global) or allocate on the heap aren't generally aligned to a 16 byte boundary, though occasionally you may get an aligned one by chance. You could direct the compiler to ensure proper alignment by using the __m128 or __m128i data types. Each of those declares a properly-aligned 128 bit value.
Further, reading the objdump, it looks like the compiler wrapped the asm sequence with code to copy the operands from the stack to the xmm2 and xmm3 registers using the MOVQ instruction, only to have your asm code then copy the values to xmm0 and xmm1. After xor-ing into xmm0, the wrapper copies the result to xmm2 only to then copy it back to the stack. Overall, not terribly efficient. MOVQ copies 8 bytes at a time, and expects (under some circumstances), an 8-byte aligned address. Getting an unaligned address, it could fail just like MOVDQA. The wrapper code, however, adds an aligned offset (-0x80, -0x88, and later -0x78) to the BP register, which may or may not contain an aligned value. Overall, there's no guaranty of alignment in the generated code.
The following ensures the arguments and result are stored in correctly aligned memory locations, and seems to work fine:
compile with (gcc, ubuntu 32 bit)
output:
In the code above, _mm_setr_epi32 is used to initialize a and b with 128 bit values, as the compiler may not support 128 integer literals.
print128 writes out the hexadecimal representation of a 128 bit integer, as printf may not be able to do so.
The following is shorter and avoids some of the duplicate copying. The compiler adds its hidden wrapping movdqa's to make pxor %2,%0 magically work without you having to load the registers on your own:
compile as before:
output:
Alternatively, if you'd like to avoid the inline assembly, you could use the SSE intrinsics instead (PDF). Those are inlined functions/macros that encapsulate MMX/SSE instructions with a C-like syntax. _mm_xor_si128 reduces your task to a single call:
compile:
output:
嗯,为什么不使用 __builtin_ia32_pxor 内在函数呢?
Umm, why not use the
__builtin_ia32_pxor
intrinsic?在最新模型 gcc 下(我的是 4.5.5),选项 -O2 或以上意味着
-fstrict-aliasing
,这会导致上面给出的代码抱怨:这可以通过提供额外的类型属性来解决,如下所示:
我首先在没有 typedef 的情况下直接尝试了该属性。它被接受了,但我仍然收到警告。 typedef 似乎是魔法的一个必要部分。
顺便说一句,这是我在这里的第二个答案,我仍然讨厌这样一个事实:我还不知道我可以在哪里编辑,所以我无法将其发布到它所属的位置。
还有一件事,在 AMD64 下,%llx 格式说明符需要更改为 %lx。
Under late model gcc (mine is 4.5.5) the option -O2 or above implies
-fstrict-aliasing
which causes the code given above to complain:This can be remedied by supplying additional type attributes as follows:
I first tried the attribute directly without the typedef. It was accepted, but I still got the warning. The typedef seems to be a necessary piece of the magic.
BTW, this is my second answer here and I still hate the fact that I can't yet tell where I'm permitted to edit, so I wasn't able to post this where it belonged.
And one more thing, under AMD64, the %llx format specifier needs to be changed to %lx.