将常量浮点数加载到 SSE 寄存器中
我正在尝试找出一种将编译时常量浮点数加载到 SSE(2/3) 寄存器中的有效方法。我尝试过编写这样的简单代码,
const __m128 x = { 1.0f, 2.0f, 3.0f, 4.0f };
但是它会从内存中生成 4 条 movss 指令!
movss xmm0,dword ptr [__real@3f800000 (14048E534h)]
movss xmm1,dword ptr [__real@40000000 (14048E530h)]
movaps xmm6,xmm12
shufps xmm6,xmm12,0C6h
movss dword ptr [rsp],xmm0
movss xmm0,dword ptr [__real@40400000 (14048E52Ch)]
movss dword ptr [rsp+4],xmm1
movss xmm1,dword ptr [__real@40a00000 (14048E528h)]
它将标量加载到内存中和从内存中加载出来......(?!?!)
但是这样做......
float Align(16) myfloat4[4] = { 1.0f, 2.0f, 3.0f, 4.0f, }; // out in global scope
会生成。
movaps xmm5,xmmword ptr [::myarray4 (140512050h)]
理想情况下,如果我有常量,那就太好了,这将是一种甚至不接触内存而只需使用立即样式指令(例如编译到指令本身中的常量)来完成的方法。
谢谢
I'm trying to figure out an efficient way to load compile time constant floats into SSE(2/3) registers. I've tried doing simple code like this,
const __m128 x = { 1.0f, 2.0f, 3.0f, 4.0f };
but that generates 4 movss instructions from memory!
movss xmm0,dword ptr [__real@3f800000 (14048E534h)]
movss xmm1,dword ptr [__real@40000000 (14048E530h)]
movaps xmm6,xmm12
shufps xmm6,xmm12,0C6h
movss dword ptr [rsp],xmm0
movss xmm0,dword ptr [__real@40400000 (14048E52Ch)]
movss dword ptr [rsp+4],xmm1
movss xmm1,dword ptr [__real@40a00000 (14048E528h)]
which load the scalars in and out of memory... (?!?!)
Doing this though..
float Align(16) myfloat4[4] = { 1.0f, 2.0f, 3.0f, 4.0f, }; // out in global scope
generates.
movaps xmm5,xmmword ptr [::myarray4 (140512050h)]
Ideally, it would be nice if I have constants their would be a way not to even touch memory and just do it with immediate style instructions (e.g. the constants compiled into the instruction itself).
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您想强制它进行单次加载,您可以尝试 (gcc):
如果您有 Visual C++,请使用 __declspec(align(16)) 来请求适当的约束。
在我的系统上,这个(用 gcc -m32 -msse -O2 编译;完全没有优化,代码变得混乱,但最终仍然保留单个 movaps )创建以下内容汇编代码(gcc / AT&T 语法):
请注意,它在分配堆栈空间并将常量放入其中之前对齐堆栈指针。根据您的编译器的不同,忽略 __attribute__((aligned)) 可能会创建不执行此操作的错误代码,因此请注意并检查反汇编。
另外:
由于您一直在询问如何将常量放入代码中,只需使用
float
数组的static
限定符尝试上述操作即可。这将创建以下程序集:If you want to force it to a single load, you could try (gcc):
If you have Visual C++, use
__declspec(align(16))
to request the proper constraint.On my system, this (compiled with
gcc -m32 -msse -O2
; no optimization at all clutters the code but still retains the singlemovaps
in the end) creates the following assembly code (gcc / AT&T syntax):Note that it aligns the stackpointer before allocating stackspace and putting the constants in there. Leaving the
__attribute__((aligned))
out may, depending on your compiler, create incorrect code that doesn't do this, so beware, and check the disassembly.Additionally:
Since you've been asking for how to put constants into the code, simply try the above with a
static
qualifier for thefloat
array. That creates the following assembly:首先,您正在以什么优化级别进行编译?在 -O0 或 -O1 处看到这种代码生成器并不罕见,但在大多数编译器中看到 -O2 或更高版本的代码生成器时,我会感到非常惊讶。
其次,上交所没有即时负载。您可以立即加载到 GPR,然后将该值移至 SSE,但您无法在没有实际加载的情况下构造其他值(忽略某些特殊值,例如
0
或(int)-1
,它可以通过逻辑运算生成。最后,如果在打开优化且性能关键的位置生成错误代码,请针对您的编译器提交错误。
First off, what optimization level are you compiling at? It's not uncommon to see that sort of codegen at -O0 or -O1, but I would be quite surprised to see it with -O2 or higher in most compilers.
Second, there are no immediate loads in SSE. You can do a load immediate to a GPR, then move that value to SSE, but you cannot conjure other values without an actual load (ignoring certain special values like
0
or(int)-1
, which can be produced via logical operations.Finally, if the bad code is being generated with optimizations turned on and in a performance-critical location, please file a bug against your compiler.
通常,这样的常量会在代码的任何循环或“热”部分之前加载,因此性能不应该那么重要。但是,如果您无法避免在循环内执行此类操作,那么我会首先尝试
_mm_set_ps
并查看会生成什么。也可以尝试 ICC 而不是 gcc,因为它往往会生成更好的代码。Normally constants such as this would be loaded prior to any loops or "hot" parts of the code, so performance should not be that important. But if you can't avoid doing this kind of thing inside a loop then I would try
_mm_set_ps
first and see what that generates. Also try ICC rather than gcc, as it tends to generate better code.如果四个浮点常量相同,则生成常量会更简单(也更快)。例如,1.f 的位模式是 0x3f800000。使用 SSE2 生成此文件的一种方法
是使用 SSE4.1 的另一种方法是,
请注意,我不肯定此版本是否比 SSE2 更快,还没有对其进行分析,仅测试了结果是否正确。
如果四个浮点数中每一个的值必须不同,则可以生成每个常量并将其打乱或混合在一起。
这是否有用取决于是否可能发生缓存未命中,否则从内存加载常量会更快。像这样的技巧在 vmx/altivec 中非常有用,但大多数 PC 上的大缓存可能会使它对 sse 不太有用。
Agner Fog 的优化手册第 2 册第 13.4 节对此进行了很好的讨论,http://www.agner .org/optimize/。
最后注意,上面内联汇编器的使用是 gcc 特有的,原因是允许使用未初始化的变量而不生成编译器警告。使用 vc,您可能需要也可能不需要首先使用 _mm_setzero_ps() 初始化变量,然后希望优化器可以删除它。
Generating constants is much simpler (and quicker) if the four float constants are the same. For example the bit pattern for 1.f is 0x3f800000. One way this can be generated using SSE2
Another approach with SSE4.1 is,
Note that i'm not possitive if this version is any faster than the SSE2 one, have not profiled it, only tested the result was correct.
If the values of each of the four floats must be different, then each of the constants can be generated and shuffled or blended together.
Wether or not this is useful depends on if a cache miss is likely, else loading the constant from memory is quicker. Tricks like this are very helpful in vmx/altivec, but large caches on most pcs may make this less useful for sse.
There is a good discussion of this in Agner Fog's Optimization Manual, book 2, section 13.4, http://www.agner.org/optimize/.
Final note, the use of inline assembler above is gcc specific, the reason is to allow the use of uninitialized variables without generating a compiler warning. With vc, you may or may not need to first initialize the variables with _mm_setzero_ps(), then hope that the optimizer can remove this.