将常量浮点数加载到 SSE 寄存器中

发布于 2024-10-17 22:34:29 字数 898 浏览 2 评论 0原文

我正在尝试找出一种将编译时常量浮点数加载到 SSE(2/3) 寄存器中的有效方法。我尝试过编写这样的简单代码，

const __m128 x = { 1.0f, 2.0f, 3.0f, 4.0f };

但是它会从内存中生成 4 条 movss 指令！

movss       xmm0,dword ptr [__real@3f800000 (14048E534h)] 
movss       xmm1,dword ptr [__real@40000000 (14048E530h)] 
movaps      xmm6,xmm12 
shufps      xmm6,xmm12,0C6h 
movss       dword ptr [rsp],xmm0 
movss       xmm0,dword ptr [__real@40400000 (14048E52Ch)] 
movss       dword ptr [rsp+4],xmm1 
movss       xmm1,dword ptr [__real@40a00000 (14048E528h)]

它将标量加载到内存中和从内存中加载出来......（？！？！）

但是这样做......

float Align(16) myfloat4[4] = { 1.0f, 2.0f, 3.0f, 4.0f, }; // out in global scope

会生成。

movaps      xmm5,xmmword ptr [::myarray4 (140512050h)]

理想情况下，如果我有常量，那就太好了，这将是一种甚至不接触内存而只需使用立即样式指令（例如编译到指令本身中的常量）来完成的方法。

谢谢

原文

I'm trying to figure out an efficient way to load compile time constant floats into SSE(2/3) registers. I've tried doing simple code like this,

const __m128 x = { 1.0f, 2.0f, 3.0f, 4.0f };

but that generates 4 movss instructions from memory!

movss       xmm0,dword ptr [__real@3f800000 (14048E534h)] 
movss       xmm1,dword ptr [__real@40000000 (14048E530h)] 
movaps      xmm6,xmm12 
shufps      xmm6,xmm12,0C6h 
movss       dword ptr [rsp],xmm0 
movss       xmm0,dword ptr [__real@40400000 (14048E52Ch)] 
movss       dword ptr [rsp+4],xmm1 
movss       xmm1,dword ptr [__real@40a00000 (14048E528h)]

which load the scalars in and out of memory... (?!?!)

Doing this though..

float Align(16) myfloat4[4] = { 1.0f, 2.0f, 3.0f, 4.0f, }; // out in global scope

generates.

movaps      xmm5,xmmword ptr [::myarray4 (140512050h)]

Ideally, it would be nice if I have constants their would be a way not to even touch memory and just do it with immediate style instructions (e.g. the constants compiled into the instruction itself).

Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

平生欢 2024-10-24 22:34:29

如果您想强制它进行单次加载，您可以尝试 (gcc)：

__attribute__((aligned(16))) float vec[4] = { 1.0f, 1.1f, 1.2f, 1.3f };
__m128 v = _mm_load_ps(vec); // edit by sor: removed the "&" cause its already an address

如果您有 Visual C++，请使用 __declspec(align(16)) 来请求适当的约束。

在我的系统上，这个（用 gcc -m32 -msse -O2 编译；完全没有优化，代码变得混乱，但最终仍然保留单个 movaps ）创建以下内容汇编代码（gcc / AT&T 语法）：

    andl    $-16, %esp
    subl    $16, %esp
    movl    $0x3f800000, (%esp)
    movl    $0x3f8ccccd, 4(%esp)
    movl    $0x3f99999a, 8(%esp)
    movl    $0x3fa66666, 12(%esp)
    movaps  (%esp), %xmm0

请注意，它在分配堆栈空间并将常量放入其中之前对齐堆栈指针。根据您的编译器的不同，忽略 __attribute__((aligned)) 可能会创建不执行此操作的错误代码，因此请注意并检查反汇编。

另外：
由于您一直在询问如何将常量放入代码中，只需使用 float 数组的 static 限定符尝试上述操作即可。这将创建以下程序集：

    movaps  vec.7330, %xmm0
    ...
vec.7330:
    .long   1065353216
    .long   1066192077
    .long   1067030938
    .long   1067869798

If you want to force it to a single load, you could try (gcc):

__attribute__((aligned(16))) float vec[4] = { 1.0f, 1.1f, 1.2f, 1.3f };
__m128 v = _mm_load_ps(vec); // edit by sor: removed the "&" cause its already an address

If you have Visual C++, use __declspec(align(16)) to request the proper constraint.

On my system, this (compiled with gcc -m32 -msse -O2; no optimization at all clutters the code but still retains the single movaps in the end) creates the following assembly code (gcc / AT&T syntax):

    andl    $-16, %esp
    subl    $16, %esp
    movl    $0x3f800000, (%esp)
    movl    $0x3f8ccccd, 4(%esp)
    movl    $0x3f99999a, 8(%esp)
    movl    $0x3fa66666, 12(%esp)
    movaps  (%esp), %xmm0

Note that it aligns the stackpointer before allocating stackspace and putting the constants in there. Leaving the __attribute__((aligned)) out may, depending on your compiler, create incorrect code that doesn't do this, so beware, and check the disassembly.

Additionally:
Since you've been asking for how to put constants into the code, simply try the above with a static qualifier for the float array. That creates the following assembly:

    movaps  vec.7330, %xmm0
    ...
vec.7330:
    .long   1065353216
    .long   1066192077
    .long   1067030938
    .long   1067869798

回复收藏 0 原文

独夜无伴 2024-10-24 22:34:29

首先，您正在以什么优化级别进行编译？在 -O0 或 -O1 处看到这种代码生成器并不罕见，但在大多数编译器中看到 -O2 或更高版本的代码生成器时，我会感到非常惊讶。

其次，上交所没有即时负载。您可以立即加载到 GPR，然后将该值移至 SSE，但您无法在没有实际加载的情况下构造其他值（忽略某些特殊值，例如 0 或 (int)-1，它可以通过逻辑运算生成。

最后，如果在打开优化且性能关键的位置生成错误代码，请针对您的编译器提交错误。

回复收藏 0 原文

够运 2024-10-24 22:34:29

通常，这样的常量会在代码的任何循环或“热”部分之前加载，因此性能不应该那么重要。但是，如果您无法避免在循环内执行此类操作，那么我会首先尝试 _mm_set_ps 并查看会生成什么。也可以尝试 ICC 而不是 gcc，因为它往往会生成更好的代码。

回复收藏 0 原文

韶华倾负 2024-10-24 22:34:29

如果四个浮点常量相同，则生成常量会更简单（也更快）。例如，1.f 的位模式是 0x3f800000。使用 SSE2 生成此文件的一种方法

        register __m128i onef;
        __asm__ ( "pcmpeqb %0, %0" : "=x" ( onef ) );
        onef = _mm_slli_epi32( onef, 25 );
        onef = _mm_srli_epi32( onef, 2 );

是使用 SSE4.1 的另一种方法是，

        register uint32_t t = 0x3f800000;
        register __m128 onef;
        __asm__ ( "pinsrd %0, %1, 0" : "=x" ( onef ) : "r" ( t ) );
        onef = _mm_shuffle_epi32( onef, 0 );

请注意，我不肯定此版本是否比 SSE2 更快，还没有对其进行分析，仅测试了结果是否正确。

如果四个浮点数中每一个的值必须不同，则可以生成每个常量并将其打乱或混合在一起。

这是否有用取决于是否可能发生缓存未命中，否则从内存加载常量会更快。像这样的技巧在 vmx/altivec 中非常有用，但大多数 PC 上的大缓存可能会使它对 sse 不太有用。

Agner Fog 的优化手册第 2 册第 13.4 节对此进行了很好的讨论，http://www.agner .org/optimize/。

最后注意，上面内联汇编器的使用是 gcc 特有的，原因是允许使用未初始化的变量而不生成编译器警告。使用 vc，您可能需要也可能不需要首先使用 _mm_setzero_ps() 初始化变量，然后希望优化器可以删除它。

Generating constants is much simpler (and quicker) if the four float constants are the same. For example the bit pattern for 1.f is 0x3f800000. One way this can be generated using SSE2

        register __m128i onef;
        __asm__ ( "pcmpeqb %0, %0" : "=x" ( onef ) );
        onef = _mm_slli_epi32( onef, 25 );
        onef = _mm_srli_epi32( onef, 2 );

Another approach with SSE4.1 is,

        register uint32_t t = 0x3f800000;
        register __m128 onef;
        __asm__ ( "pinsrd %0, %1, 0" : "=x" ( onef ) : "r" ( t ) );
        onef = _mm_shuffle_epi32( onef, 0 );

Note that i'm not possitive if this version is any faster than the SSE2 one, have not profiled it, only tested the result was correct.

If the values of each of the four floats must be different, then each of the constants can be generated and shuffled or blended together.

Wether or not this is useful depends on if a cache miss is likely, else loading the constant from memory is quicker. Tricks like this are very helpful in vmx/altivec, but large caches on most pcs may make this less useful for sse.

There is a good discussion of this in Agner Fog's Optimization Manual, book 2, section 13.4, http://www.agner.org/optimize/.

Final note, the use of inline assembler above is gcc specific, the reason is to allow the use of uninitialized variables without generating a compiler warning. With vc, you may or may not need to first initialize the variables with _mm_setzero_ps(), then hope that the optimizer can remove this.

回复收藏 0 原文

~没有更多了~