在海湾合作委员会中的怪异自动矢量化，在Godbolt上取得了不同的结果

发布于 2025-01-21 11:10:29 字数 1862 浏览 4 评论 0原文

我对自动矢量化结果感到困惑。以下代码addtest.c

#include <stdio.h>
#include <stdlib.h>

#define ELEMS 1024

int
main()
{
  float data1[ELEMS], data2[ELEMS];
  for (int i = 0; i < ELEMS; i++) {
    data1[i] = drand48();
    data2[i] = drand48();
  }
  for (int i = 0; i < ELEMS; i++)
    data1[i] += data2[i];
  printf("%g\n", data1[ELEMS-1]); 
  return 0;
}

与gcc 11.1.0 by编辑，

gcc-11 -O3 -march=haswell -masm=intel -save-temps -o addtest addtest.c

并且add-to循环是自动矢量进行的，因为

.L3:
    vmovaps ymm1, YMMWORD PTR [r12]
    vaddps  ymm0, ymm1, YMMWORD PTR [rax]
    add r12, 32
    add rax, 32
    vmovaps YMMWORD PTR -32[r12], ymm0
    cmp r12, r13
    jne .L3

这很明显：加载来自data1 data1 < /code>，加载并从data2，将存储到data1中，并在两者之间推进索引。

如果我将相同的代码传递给 https://godbolt.org ，选择x86-64 GCC-11.1.1.1.1.1.1 < /code> and Options -O3 -March = Haswell，我得到以下汇编代码：

.L3:
        vmovaps ymm1, YMMWORD PTR [rbp-4112+rax]
        vaddps  ymm0, ymm1, YMMWORD PTR [rbp-8208+rax]
        vmovaps YMMWORD PTR [rbp-8240], ymm1
        vmovaps YMMWORD PTR [rbp-8208+rax], ymm0
        add     rax, 32
        cmp     rax, 4096
        jne     .L3

令人惊讶的一件事是不同的地址处理，但是完全混淆我的事情是附加商店到> [RBP-8240]。据我所知，这个位置再也不会使用。

如果我在Godbolt上选择GCC 7.5，那么多余的商店就会消失（但从8.1起，它是生产的）。

所以我的问题是：

为什么我的编译器和Godbolt之间有区别（不同的地址处理，多余的商店）？
多余的商店做什么？

非常感谢您的帮助！

原文

I'm confused by an auto-vectorization result. The following code addtest.c

#include <stdio.h>
#include <stdlib.h>

#define ELEMS 1024

int
main()
{
  float data1[ELEMS], data2[ELEMS];
  for (int i = 0; i < ELEMS; i++) {
    data1[i] = drand48();
    data2[i] = drand48();
  }
  for (int i = 0; i < ELEMS; i++)
    data1[i] += data2[i];
  printf("%g\n", data1[ELEMS-1]); 
  return 0;
}

is compiled with gcc 11.1.0 by

gcc-11 -O3 -march=haswell -masm=intel -save-temps -o addtest addtest.c

and the add-to loop is auto-vectorized as

.L3:
    vmovaps ymm1, YMMWORD PTR [r12]
    vaddps  ymm0, ymm1, YMMWORD PTR [rax]
    add r12, 32
    add rax, 32
    vmovaps YMMWORD PTR -32[r12], ymm0
    cmp r12, r13
    jne .L3

This is clear: load from data1, load and add from data2, store to data1, and in between, advance the indices.

If I pass the same code to https://godbolt.org, select x86-64 gcc-11.1 and options -O3 -march=haswell, I get the following assembly code:

.L3:
        vmovaps ymm1, YMMWORD PTR [rbp-4112+rax]
        vaddps  ymm0, ymm1, YMMWORD PTR [rbp-8208+rax]
        vmovaps YMMWORD PTR [rbp-8240], ymm1
        vmovaps YMMWORD PTR [rbp-8208+rax], ymm0
        add     rax, 32
        cmp     rax, 4096
        jne     .L3

One surprising thing is the different address handling, but the thing that confuses me completely is the additional store to [rbp-8240]. This location is never used again, as far as I can see.

If I select gcc 7.5 on godbolt, the superfluous store disappears (but from 8.1 upwards, it is produced).

So my questions are:

Why is there a difference between my compiler and godbolt (different address handling, superfluous store)?
What does the superfluous store do?

Thanks a lot for your help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

所有深爱都是秘密 2025-01-28 11:10:30

差异制造者是-fpie，默认情况下在大多数发行版中，但不是Godbolt。这不是很有意义的，但是编译器是复杂机械，不是“聪明”。

它也不是-march = Haswell或AVX的特定特定。仅-O3也会发生相同的差异。

Godbolt使用比发行版更简单的选项配置GCC，例如无默认值，没有-fstack-protector-strong。要在本地匹配Godbolt，请至少使用-fno-pie -no-pie -fno-stack-protector。我可能会忘记其他人。

idk为什么这会触发或避免错过优化，但我可以用GCC 11.1确认它在我的Arch GNU/Linux系统上进行。

与gcc -o3 -march = Haswell -fno -stack -protector -fno -pie
本地
（and -Masm = Intel -s -o- vec.c | Limes）它与Godbolt：

.L3:
        vmovaps ymm1, YMMWORD PTR [rbp-4112+rax]
        vaddps  ymm0, ymm1, YMMWORD PTR [rbp-8208+rax]
        vmovaps YMMWORD PTR [rbp-8240], ymm1
        vmovaps YMMWORD PTR [rbp-8208+rax], ymm0
        add     rax, 32
        cmp     rax, 4096
        jne     .L3

但是使用distro配置的GCC默认为-O3 -March = haswell：

.L3:
        vmovaps ymm1, YMMWORD PTR [r12]
        vaddps  ymm0, ymm1, YMMWORD PTR [rax]
        add     r12, 32
        add     rax, 32
        vmovaps YMMWORD PTR -32[r12], ymm0
        cmp     r12, r13
        jne     .L3

没有-march = Haswell也会发生相同的错过。我们得到移动XMMWord Ptr [RSP]，XMM1存储到环路内的固定地址。（由于GCC不需要过度对准堆栈即可溢出32字节向量，因此它不使用RBP作为帧指针。）

出于明显的原因，使用-fpie > 在Godbolt Compiler Explorer上，也避免多余的商店。（使您在本地获得相同的ASM）。 -fpie强迫GCC在静态存储中的数组中执行此操作（因为[ARR + RAX]将要求符号地址作为32位绝对：

您可以并且应该在 GCC的Bugzilla 中 - 优化”。

The difference-maker is -fpie, which is on by default in most distros but not Godbolt. This doesn't make a lot of sense, but compilers are complex pieces of machinery, not "smart".

It's not specific to -march=haswell or AVX either; the same difference happens with just -O3.

Godbolt configures GCC with simpler options than distros, e.g. without default-pie, and without -fstack-protector-strong. To match Godbolt locally, use at least -fno-pie -no-pie -fno-stack-protector. There might be others I'm forgetting about.

IDK why this would trigger or avoid a missed-optimization, but I can confirm it does on my Arch GNU/Linux system with GCC 11.1.

Locally with gcc -O3 -march=haswell -fno-stack-protector -fno-pie
(and -masm=intel -S -o- vec.c | less) it matches Godbolt:

.L3:
        vmovaps ymm1, YMMWORD PTR [rbp-4112+rax]
        vaddps  ymm0, ymm1, YMMWORD PTR [rbp-8208+rax]
        vmovaps YMMWORD PTR [rbp-8240], ymm1
        vmovaps YMMWORD PTR [rbp-8208+rax], ymm0
        add     rax, 32
        cmp     rax, 4096
        jne     .L3

But with distro-configured GCC defaults from -O3 -march=haswell:

.L3:
        vmovaps ymm1, YMMWORD PTR [r12]
        vaddps  ymm0, ymm1, YMMWORD PTR [rax]
        add     r12, 32
        add     rax, 32
        vmovaps YMMWORD PTR -32[r12], ymm0
        cmp     r12, r13
        jne     .L3

The same missed-opt happens without -march=haswell; we get a movaps XMMWORD PTR [rsp], xmm1 store to a fixed address inside the loop. (Since GCC doesn't need to over-align the stack to spill a 32-byte vector, it didn't use RBP as a frame pointer.)

For no apparent reason, using -fpie on the Godbolt compiler explorer gets GCC to use two pointer increments instead of indexed addressing modes, also avoiding the redundant store. (Making the same asm you get locally). -fpie forces GCC to do that for arrays in static storage (because [arr + rax] would require the symbol address as a 32-bit absolute: 32-bit absolute addresses no longer allowed in x86-64 Linux?)

You can and should report this on GCC's bugzilla with the keyword "missed-optimization".

回复收藏 0 原文

~没有更多了~