如何强制 gcc 使用所有 SSE(或 AVX)寄存器?
我正在尝试使用 SSE 或新的 AVX 指令为 Windows x64 目标编写一些计算密集型代码,在 GCC 4.5.2 和 4.6.1、MinGW64(TDM GCC 构建和一些自定义构建)中进行编译。我的编译器选项是 -O3 -mavx 。 (隐含-m64
)
简而言之,我想对压缩浮点数的 4 个 3D 向量执行一些冗长的计算。这需要 4x3=12 个 xmm 或 ymm 寄存器用于存储,以及 2 或 3 个寄存器用于临时结果。恕我直言,这应该恰好适合 64 位目标的 16 个可用 SSE(或 AVX)寄存器。然而,GCC 生成的代码非常不理想,存在寄存器溢出问题,仅使用寄存器 xmm0-xmm10
并将数据从堆栈移入堆栈。我的问题是:
有没有办法说服 GCC 使用所有寄存器 xmm0-xmm15
?
要解决这个问题,请考虑以下 SSE 代码(仅用于说明):
void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
for (int i=0; i < 10; i++) {
vect<__m128> v = q2 - q1;
a1 += v;
// a2 -= v;
q2 *= _mm_set1_ps(2.);
}
}
这里vect<__m128>
只是一个由 3 个 __m128
组成的 struct
,具有标量的自然加法和乘法。当a2 -= v
行被注释掉时,即我们只需要3x3寄存器来存储,因为我们忽略了a2
,生成的代码确实很简单,没有任何动作,一切在寄存器xmm0-xmm10
中执行。当我删除注释 a2 -= v
时,代码非常糟糕,在寄存器和堆栈之间进行了大量的改组。即使编译器可以只使用寄存器 xmm11-xmm13 或其他东西。
事实上,我还没有看到 GCC 在我的所有代码中使用任何寄存器 xmm11-xmm15
。我做错了什么?我知道它们是被调用者保存的寄存器,但是通过简化循环代码,这种开销是完全合理的。
I'm trying to write some computationally intensive code for Windows x64 target, with SSE or the new AVX instructions, compiling in GCC 4.5.2 and 4.6.1, MinGW64 (TDM GCC build, and some custom build). My compiler options are -O3 -mavx
. (-m64
is implied)
In short, I want to perform some lengthy computation on 4 3D vectors of packed floats. That requires 4x3=12 xmm or ymm registers for storage, and 2 or 3 registers for temporary results. This should IMHO fit snugly in the 16 available SSE (or AVX) registers available for 64bit targets. However, GCC produces a very suboptimal code with register spilling, using only registers xmm0-xmm10
and shuffling data from and onto the stack. My question is:
Is there a way to convince GCC to use all the registers xmm0-xmm15
?
To fix ideas, consider the following SSE code (for illustration only):
void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
for (int i=0; i < 10; i++) {
vect<__m128> v = q2 - q1;
a1 += v;
// a2 -= v;
q2 *= _mm_set1_ps(2.);
}
}
Here vect<__m128>
is simply a struct
of 3 __m128
, with natural addition and multiplication by scalar. When the line a2 -= v
is commented out, i.e. we need only 3x3 registers for storage since we are ignoring a2
, the produced code is indeed straightforward with no moves, everything is performed in registers xmm0-xmm10
. When I remove the comment a2 -= v
, the code is pretty awful with a lot of shuffling between registers and stack. Even though the compiler could just use registers xmm11-xmm13
or something.
I actually haven't seen GCC use any of the registers xmm11-xmm15
anywhere in all my code yet. What am I doing wrong? I understand that they are callee-saved registers, but this overhead is completely justified by simplifying the loop code.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
两点:
所以,如果你想要更好的寄存器分配,你基本上有两个选择:
Two points:
So, if you want better register allocation, you basically have two options:
实际上,你看到的并不是溢出,而是 gcc 对内存中的 a1 和 a2 进行操作,因为它不知道它们是否有别名。如果将最后两个参数声明为
vect<__m128>& __restrict__
GCC 可以并且将会注册分配 a1 和 a2。Actually, what you see aren't spills, it is gcc operating on a1 and a2 in memory because it can't know if they are aliased. If you declare the last two parameters as
vect<__m128>& __restrict__
GCC can and will register allocate a1 and a2.