如何强制 gcc 使用所有 SSE（或 AVX）寄存器？

发布于 2024-11-06 22:51:59 字数 1193 浏览 8 评论 0原文

我正在尝试使用 SSE 或新的 AVX 指令为 Windows x64 目标编写一些计算密集型代码，在 GCC 4.5.2 和 4.6.1、MinGW64（TDM GCC 构建和一些自定义构建）中进行编译。我的编译器选项是 -O3 -mavx 。（隐含-m64）

简而言之，我想对压缩浮点数的 4 个 3D 向量执行一些冗长的计算。这需要 4x3=12 个 xmm 或 ymm 寄存器用于存储，以及 2 或 3 个寄存器用于临时结果。恕我直言，这应该恰好适合 64 位目标的 16 个可用 SSE（或 AVX）寄存器。然而，GCC 生成的代码非常不理想，存在寄存器溢出问题，仅使用寄存器 xmm0-xmm10 并将数据从堆栈移入堆栈。我的问题是：

有没有办法说服 GCC 使用所有寄存器 xmm0-xmm15？

要解决这个问题，请考虑以下 SSE 代码（仅用于说明）：

void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
    for (int i=0; i < 10; i++) {
        vect<__m128> v = q2 - q1;
        a1 += v;
//      a2 -= v;

        q2 *= _mm_set1_ps(2.);
    }
}

这里vect<__m128> 只是一个由 3 个 __m128 组成的 struct，具有标量的自然加法和乘法。当a2 -= v行被注释掉时，即我们只需要3x3寄存器来存储，因为我们忽略了a2，生成的代码确实很简单，没有任何动作，一切在寄存器xmm0-xmm10中执行。当我删除注释 a2 -= v 时，代码非常糟糕，在寄存器和堆栈之间进行了大量的改组。即使编译器可以只使用寄存器 xmm11-xmm13 或其他东西。

事实上，我还没有看到 GCC 在我的所有代码中使用任何寄存器 xmm11-xmm15 。我做错了什么？我知道它们是被调用者保存的寄存器，但是通过简化循环代码，这种开销是完全合理的。

原文

I'm trying to write some computationally intensive code for Windows x64 target, with SSE or the new AVX instructions, compiling in GCC 4.5.2 and 4.6.1, MinGW64 (TDM GCC build, and some custom build). My compiler options are -O3 -mavx. (-m64 is implied)

In short, I want to perform some lengthy computation on 4 3D vectors of packed floats. That requires 4x3=12 xmm or ymm registers for storage, and 2 or 3 registers for temporary results. This should IMHO fit snugly in the 16 available SSE (or AVX) registers available for 64bit targets. However, GCC produces a very suboptimal code with register spilling, using only registers xmm0-xmm10 and shuffling data from and onto the stack. My question is:

Is there a way to convince GCC to use all the registers xmm0-xmm15?

To fix ideas, consider the following SSE code (for illustration only):

void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
    for (int i=0; i < 10; i++) {
        vect<__m128> v = q2 - q1;
        a1 += v;
//      a2 -= v;

        q2 *= _mm_set1_ps(2.);
    }
}

Here vect<__m128> is simply a struct of 3 __m128, with natural addition and multiplication by scalar. When the line a2 -= v is commented out, i.e. we need only 3x3 registers for storage since we are ignoring a2, the produced code is indeed straightforward with no moves, everything is performed in registers xmm0-xmm10. When I remove the comment a2 -= v, the code is pretty awful with a lot of shuffling between registers and stack. Even though the compiler could just use registers xmm11-xmm13 or something.

I actually haven't seen GCC use any of the registers xmm11-xmm15 anywhere in all my code yet. What am I doing wrong? I understand that they are callee-saved registers, but this overhead is completely justified by simplifying the loop code.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你的背包 2024-11-13 22:51:59

两点：

首先，你做了很多假设。寄存器溢出在 x86 CPU 上相当便宜（由于快速 L1 缓存和寄存器阴影和其他技巧），并且仅 64 位寄存器的访问成本更高（就较大的指令而言），因此可能只是 GCC 的版本与您想要的一样快或更快。
其次，GCC 与任何编译器一样，尽其所能进行寄存器分配。没有“请更好地注册分配”选项，因为如果有的话，它总是会被启用。编译器并不是想刁难你。（我记得寄存器分配是一个NP完全问题，所以编译器永远无法生成完美的解决方案。它能做的最好的就是近似）

所以，如果你想要更好的寄存器分配，你基本上有两个选择：