如何使用 128 位 C 变量和 xmm 128 位 asm？

发布于 2024-08-16 11:56:16 字数 492 浏览 9 评论 0原文

在 gcc 中，我想通过 asm 代码对 2 个 C 变量进行 128 位异或：如何？

asm (
    "movdqa %1, %%xmm1;"
    "movdqa %0, %%xmm0;"
     "pxor %%xmm1,%%xmm0;"
     "movdqa %%xmm0, %0;"

    :"=x"(buff) /* output operand */
    :"x"(bu), "x"(buff)
    :"%xmm0","%xmm1"
    );

但我有一个分段错误；这是 objdump 输出：

movq   -0x80(%rbp),%xmm2

movq   -0x88(%rbp),%xmm3

movdqa %xmm2,%xmm1

movdqa %xmm2,%xmm0

pxor   %xmm1,%xmm0

movdqa %xmm0,%xmm2

movq   %xmm2,-0x78(%rbp)

原文

in gcc, i want to do a 128 bits xor with 2 C variables, via asm code: how?

asm (
    "movdqa %1, %%xmm1;"
    "movdqa %0, %%xmm0;"
     "pxor %%xmm1,%%xmm0;"
     "movdqa %%xmm0, %0;"

    :"=x"(buff) /* output operand */
    :"x"(bu), "x"(buff)
    :"%xmm0","%xmm1"
    );

but i have a Segmentation fault error;
this is the objdump output:

movq   -0x80(%rbp),%xmm2

movq   -0x88(%rbp),%xmm3

movdqa %xmm2,%xmm1

movdqa %xmm2,%xmm0

pxor   %xmm1,%xmm0

movdqa %xmm0,%xmm2

movq   %xmm2,-0x78(%rbp)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寂寞美少年 2024-08-23 11:56:16

如果变量不是 16 字节对齐，您会看到段错误问题。 CPU 可以不会与未对齐的内存地址进行 MOVDQA，并且会生成处理器级“GP 异常”，提示操作系统对您的应用程序进行段错误。

您在堆上声明（堆栈、全局）或分配的 C 变量通常不会与 16 字节边界对齐，尽管有时您可能会偶然获得对齐的变量。您可以使用 __m128 或 __m128i 数据类型指示编译器确保正确对齐。其中每一个都声明一个正确对齐的 128 位值。

此外，读取 objdump，看起来编译器用代码包装了 asm 序列，以使用 MOVQ 指令将操作数从堆栈复制到 xmm2 和 xmm3 寄存器，然后让您的 asm 代码将值复制到 xmm0 和 xmm1。在对 xmm0 进行异或运算后，包装器将结果复制到 xmm2，然后将其复制回堆栈。总体而言，效率不是很高。 MOVQ 一次复制 8 个字节，并期望（在某些情况下）8 字节对齐的地址。获取未对齐的地址，它可能会像 MOVDQA 一样失败。然而，包装器代码将对齐的偏移量（-0x80、-0x88 和后来的 -0x78）添加到 BP 寄存器，该寄存器可能包含也可能不包含对齐值。总的来说，生成的代码不保证对齐。

以下内容确保参数和结果存储在正确对齐的内存位置，并且似乎工作正常：

#include <stdio.h>
#include <emmintrin.h>

void print128(__m128i value) {
    int64_t *v64 = (int64_t*) &value;
    printf("%.16llx %.16llx\n", v64[1], v64[0]);
}

void main() {
    __m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first! */
            b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff),
            x;

    asm (
        "movdqa %1, %%xmm0;"      /* xmm0 <- a */
        "movdqa %2, %%xmm1;"      /* xmm1 <- b */
        "pxor %%xmm1, %%xmm0;"    /* xmm0 <- xmm0 xor xmm1 */
        "movdqa %%xmm0, %0;"      /* x <- xmm0 */

        :"=x"(x)          /* output operand, %0 */
        :"x"(a), "x"(b)   /* input operands, %1, %2 */
        :"%xmm0","%xmm1"  /* clobbered registers */
    );

    /* printf the arguments and result as 2 64-bit hex values */
    print128(a);
    print128(b);
    print128(x);
}

使用（gcc，ubuntu 32位）编译

gcc -msse2 -o app app.c

输出：

10ffff0000ffff00 00ffff0000ffff00
0000ffff0000ffff 0000ffff0000ffff
10ff00ff00ff00ff 00ff00ff00ff00ff

在上面的代码中，_mm_setr_epi32用于初始化a和 b 具有 128 位值，因为编译器可能不支持 128 整数文字。

print128 写出 128 位整数的十六进制表示形式，因为 printf 可能无法这样做。

以下内容较短，避免了一些重复复制。编译器添加其隐藏的包装 movdqa 以使 pxor %2,%0 神奇地工作，而无需您自己加载寄存器：

#include <stdio.h>
#include <emmintrin.h>

void print128(__m128i value) {
    int64_t *px = (int64_t*) &value;
    printf("%.16llx %.16llx\n", px[1], px[0]);
}

void main() {
    __m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00),
            b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff);

    asm (
        "pxor %2, %0;"    /* a <- b xor a  */

        :"=x"(a)          /* output operand, %0 */
        :"x"(a), "x"(b)   /* input operands, %1, %2 */
        );

    print128(a);
}

像以前一样编译：

gcc -msse2 -o app app.c

输出：

10ff00ff00ff00ff 00ff00ff00ff00ff

或者，如果您想避免内联汇编，您可以使用SSE 内在函数 (PDF)。这些是内联函数/宏，使用类似 C 的语法封装 MMX/SSE 指令。 _mm_xor_si128 将您的任务减少为单个调用：

#include <stdio.h>
#include <emmintrin.h>

void print128(__m128i value) {
    int64_t *v64 = (int64_t*) &value;
    printf("%.16llx %.16llx\n", v64[1], v64[0]);
}

void main()
{
    __m128i x = _mm_xor_si128(
        _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first !*/
        _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff));

    print128(x);
}

编译：

gcc -msse2 -o app app.c

输出：

10ff00ff00ff00ff 00ff00ff00ff00ff

You would see segfault issues if the variables aren't 16-byte aligned. The CPU can't MOVDQA to/from unaligned memory addresses, and would generate a processor-level "GP exception", prompting the OS to segfault your app.

C variables you declare (stack, global) or allocate on the heap aren't generally aligned to a 16 byte boundary, though occasionally you may get an aligned one by chance. You could direct the compiler to ensure proper alignment by using the __m128 or __m128i data types. Each of those declares a properly-aligned 128 bit value.

Further, reading the objdump, it looks like the compiler wrapped the asm sequence with code to copy the operands from the stack to the xmm2 and xmm3 registers using the MOVQ instruction, only to have your asm code then copy the values to xmm0 and xmm1. After xor-ing into xmm0, the wrapper copies the result to xmm2 only to then copy it back to the stack. Overall, not terribly efficient. MOVQ copies 8 bytes at a time, and expects (under some circumstances), an 8-byte aligned address. Getting an unaligned address, it could fail just like MOVDQA. The wrapper code, however, adds an aligned offset (-0x80, -0x88, and later -0x78) to the BP register, which may or may not contain an aligned value. Overall, there's no guaranty of alignment in the generated code.

The following ensures the arguments and result are stored in correctly aligned memory locations, and seems to work fine:

#include <stdio.h>
#include <emmintrin.h>

void print128(__m128i value) {
    int64_t *v64 = (int64_t*) &value;
    printf("%.16llx %.16llx\n", v64[1], v64[0]);
}

void main() {
    __m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first! */
            b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff),
            x;

    asm (
        "movdqa %1, %%xmm0;"      /* xmm0 <- a */
        "movdqa %2, %%xmm1;"      /* xmm1 <- b */
        "pxor %%xmm1, %%xmm0;"    /* xmm0 <- xmm0 xor xmm1 */
        "movdqa %%xmm0, %0;"      /* x <- xmm0 */

        :"=x"(x)          /* output operand, %0 */
        :"x"(a), "x"(b)   /* input operands, %1, %2 */
        :"%xmm0","%xmm1"  /* clobbered registers */
    );

    /* printf the arguments and result as 2 64-bit hex values */
    print128(a);
    print128(b);
    print128(x);
}

compile with (gcc, ubuntu 32 bit)

gcc -msse2 -o app app.c

output:

10ffff0000ffff00 00ffff0000ffff00
0000ffff0000ffff 0000ffff0000ffff
10ff00ff00ff00ff 00ff00ff00ff00ff

In the code above, _mm_setr_epi32 is used to initialize a and b with 128 bit values, as the compiler may not support 128 integer literals.

print128 writes out the hexadecimal representation of a 128 bit integer, as printf may not be able to do so.

The following is shorter and avoids some of the duplicate copying. The compiler adds its hidden wrapping movdqa's to make pxor %2,%0 magically work without you having to load the registers on your own:

#include <stdio.h>
#include <emmintrin.h>

void print128(__m128i value) {
    int64_t *px = (int64_t*) &value;
    printf("%.16llx %.16llx\n", px[1], px[0]);
}

void main() {
    __m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00),
            b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff);

    asm (
        "pxor %2, %0;"    /* a <- b xor a  */

        :"=x"(a)          /* output operand, %0 */
        :"x"(a), "x"(b)   /* input operands, %1, %2 */
        );

    print128(a);
}

compile as before:

gcc -msse2 -o app app.c

output:

10ff00ff00ff00ff 00ff00ff00ff00ff

Alternatively, if you'd like to avoid the inline assembly, you could use the SSE intrinsics instead (PDF). Those are inlined functions/macros that encapsulate MMX/SSE instructions with a C-like syntax. _mm_xor_si128 reduces your task to a single call:

#include <stdio.h>
#include <emmintrin.h>

void print128(__m128i value) {
    int64_t *v64 = (int64_t*) &value;
    printf("%.16llx %.16llx\n", v64[1], v64[0]);
}

void main()
{
    __m128i x = _mm_xor_si128(
        _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first !*/
        _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff));

    print128(x);
}

compile:

gcc -msse2 -o app app.c

output:

10ff00ff00ff00ff 00ff00ff00ff00ff

回复收藏 0 原文

秋意浓 2024-08-23 11:56:16

嗯，为什么不使用 __builtin_ia32_pxor 内在函数呢？

回复收藏 0 原文

兔小萌 2024-08-23 11:56:16

在最新模型 gcc 下（我的是 4.5.5），选项 -O2 或以上意味着 -fstrict-aliasing ，这会导致上面给出的代码抱怨：

supersuds.cpp:31: warning: dereferencing pointer ‘v64’ does break strict-aliasing rules
supersuds.cpp:30: note: initialized from here

这可以通过提供额外的类型属性来解决，如下所示：

typedef int64_t __attribute__((__may_alias__)) alias_int64_t; 
void print128(__m128i value) {
    alias_int64_t *v64 = (int64_t*)  &value;
    printf("%.16lx %.16lx\n", v64[1], v64[0]); 
}

我首先在没有 typedef 的情况下直接尝试了该属性。它被接受了，但我仍然收到警告。 typedef 似乎是魔法的一个必要部分。

顺便说一句，这是我在这里的第二个答案，我仍然讨厌这样一个事实：我还不知道我可以在哪里编辑，所以我无法将其发布到它所属的位置。

还有一件事，在 AMD64 下，%llx 格式说明符需要更改为 %lx。

Under late model gcc (mine is 4.5.5) the option -O2 or above implies -fstrict-aliasing which causes the code given above to complain:

supersuds.cpp:31: warning: dereferencing pointer ‘v64’ does break strict-aliasing rules
supersuds.cpp:30: note: initialized from here

This can be remedied by supplying additional type attributes as follows:

typedef int64_t __attribute__((__may_alias__)) alias_int64_t; 
void print128(__m128i value) {
    alias_int64_t *v64 = (int64_t*)  &value;
    printf("%.16lx %.16lx\n", v64[1], v64[0]); 
}

I first tried the attribute directly without the typedef. It was accepted, but I still got the warning. The typedef seems to be a necessary piece of the magic.

BTW, this is my second answer here and I still hate the fact that I can't yet tell where I'm permitted to edit, so I wasn't able to post this where it belonged.

And one more thing, under AMD64, the %llx format specifier needs to be changed to %lx.

回复收藏 0 原文

~没有更多了~