如何强制 gcc 使用所有 SSE(或 AVX)寄存器?

发布于 2024-11-06 22:51:59 字数 1193 浏览 0 评论 0原文

我正在尝试使用 SSE 或新的 AVX 指令为 Windows x64 目标编写一些计算密集型代码,在 GCC 4.5.2 和 4.6.1、MinGW64(TDM GCC 构建和一些自定义构建)中进行编译。我的编译器选项是 -O3 -mavx 。 (隐含-m64

简而言之,我想对压缩浮点数的 4 个 3D 向量执行一些冗长的计算。这需要 4x3=12 个 xmm 或 ymm 寄存器用于存储,以及 2 或 3 个寄存器用于临时结果。恕我直言,这应该恰好适合 64 位目标的 16 个可用 SSE(或 AVX)寄存器。然而,GCC 生成的代码非常不理想,存在寄存器溢出问题,仅使用寄存器 xmm0-xmm10 并将数据从堆栈移入堆栈。我的问题是:

有没有办法说服 GCC 使用所有寄存器 xmm0-xmm15

要解决这个问题,请考虑以下 SSE 代码(仅用于说明):

void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
    for (int i=0; i < 10; i++) {
        vect<__m128> v = q2 - q1;
        a1 += v;
//      a2 -= v;

        q2 *= _mm_set1_ps(2.);
    }
}

这里vect<__m128> 只是一个由 3 个 __m128 组成的 struct,具有标量的自然加法和乘法。当a2 -= v行被注释掉时,即我们只需要3x3寄存器来存储,因为我们忽略了a2,生成的代码确实很简单,没有任何动作,一切在寄存器xmm0-xmm10中执行。当我删除注释 a2 -= v 时,代码非常糟糕,在寄存器和堆栈之间进行了大量的改组。即使编译器可以只使用寄存器 xmm11-xmm13 或其他东西。

事实上,我还没有看到 GCC 在我的所有代码中使用任何寄存器 xmm11-xmm15 。我做错了什么?我知道它们是被调用者保存的寄存器,但是通过简化循环代码,这种开销是完全合理的。

I'm trying to write some computationally intensive code for Windows x64 target, with SSE or the new AVX instructions, compiling in GCC 4.5.2 and 4.6.1, MinGW64 (TDM GCC build, and some custom build). My compiler options are -O3 -mavx. (-m64 is implied)

In short, I want to perform some lengthy computation on 4 3D vectors of packed floats. That requires 4x3=12 xmm or ymm registers for storage, and 2 or 3 registers for temporary results. This should IMHO fit snugly in the 16 available SSE (or AVX) registers available for 64bit targets. However, GCC produces a very suboptimal code with register spilling, using only registers xmm0-xmm10 and shuffling data from and onto the stack. My question is:

Is there a way to convince GCC to use all the registers xmm0-xmm15?

To fix ideas, consider the following SSE code (for illustration only):

void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
    for (int i=0; i < 10; i++) {
        vect<__m128> v = q2 - q1;
        a1 += v;
//      a2 -= v;

        q2 *= _mm_set1_ps(2.);
    }
}

Here vect<__m128> is simply a struct of 3 __m128, with natural addition and multiplication by scalar. When the line a2 -= v is commented out, i.e. we need only 3x3 registers for storage since we are ignoring a2, the produced code is indeed straightforward with no moves, everything is performed in registers xmm0-xmm10. When I remove the comment a2 -= v, the code is pretty awful with a lot of shuffling between registers and stack. Even though the compiler could just use registers xmm11-xmm13 or something.

I actually haven't seen GCC use any of the registers xmm11-xmm15 anywhere in all my code yet. What am I doing wrong? I understand that they are callee-saved registers, but this overhead is completely justified by simplifying the loop code.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

你的背包 2024-11-13 22:51:59

两点:

  • 首先,你做了很多假设。寄存器溢出在 x86 CPU 上相当便宜(由于快速 L1 缓存和寄存器阴影和其他技巧),并且仅 64 位寄存器的访问成本更高(就较大的指令而言),因此可能只是 GCC 的版本与您想要的一样快或更快。
  • 其次,GCC 与任何编译器一样,尽其所能进行寄存器分配。没有“请更好地注册分配”选项,因为如果有的话,它总是会被启用。编译器并不是想刁难你。 (我记得寄存器分配是一个NP完全问题,所以编译器永远无法生成完美的解决方案。它能做的最好的就是近似)

所以,如果你想要更好的寄存器分配,你基本上有两个选择:

  • 编写一个更好的寄存器分配器,并将其修补到GCC中,或者
  • 绕过GCC并重写汇编中的函数,这样您就可以准确控制何时使用哪些寄存器。

Two points:

  • First, You're making a lot of assumptions. Register spilling is pretty cheap on x86 CPUs (due to fast L1 caches and register shadowing and other tricks), and the 64-bit only registers are more costly to access (in terms of larger instructions), so it may just be that GCC's version is as fast, or faster, than the one you want.
  • Second, GCC, like any compiler, does the best register allocation it can. There's no "please do better register allocation" option, because if there was, it'd always be enabled. The compiler isn't trying to spite you. (Register allocation is a NP-complete problem, as I recall, so the compiler will never be able to generate a perfect solution. The best it can do is to approximate)

So, if you want better register allocation, you basically have two options:

  • write a better register allocator, and patch it into GCC, or
  • bypass GCC and rewrite the function in assembly, so you can control exactly which registers are used when.
2024-11-13 22:51:59

实际上,你看到的并不是溢出,而是 gcc 对内存中的 a1 和 a2 进行操作,因为它不知道它们是否有别名。如果将最后两个参数声明为 vect<__m128>& __restrict__ GCC 可以并且将会注册分配 a1 和 a2。

Actually, what you see aren't spills, it is gcc operating on a1 and a2 in memory because it can't know if they are aliased. If you declare the last two parameters as vect<__m128>& __restrict__ GCC can and will register allocate a1 and a2.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文