在海湾合作委员会中的怪异自动矢量化,在Godbolt上取得了不同的结果
我对自动矢量化结果感到困惑。以下代码addtest.c
#include <stdio.h>
#include <stdlib.h>
#define ELEMS 1024
int
main()
{
float data1[ELEMS], data2[ELEMS];
for (int i = 0; i < ELEMS; i++) {
data1[i] = drand48();
data2[i] = drand48();
}
for (int i = 0; i < ELEMS; i++)
data1[i] += data2[i];
printf("%g\n", data1[ELEMS-1]);
return 0;
}
与gcc 11.1.0
by编辑,
gcc-11 -O3 -march=haswell -masm=intel -save-temps -o addtest addtest.c
并且add-to循环是自动矢量进行的,因为
.L3:
vmovaps ymm1, YMMWORD PTR [r12]
vaddps ymm0, ymm1, YMMWORD PTR [rax]
add r12, 32
add rax, 32
vmovaps YMMWORD PTR -32[r12], ymm0
cmp r12, r13
jne .L3
这很明显:加载来自data1
data1 < /code>,加载并从
data2
,将存储到data1
中,并在两者之间推进索引。
如果我将相同的代码传递给 https://godbolt.org ,选择x86-64 GCC-11.1.1.1.1.1.1 < /code> and Options
-O3 -March = Haswell
,我得到以下汇编代码:
.L3:
vmovaps ymm1, YMMWORD PTR [rbp-4112+rax]
vaddps ymm0, ymm1, YMMWORD PTR [rbp-8208+rax]
vmovaps YMMWORD PTR [rbp-8240], ymm1
vmovaps YMMWORD PTR [rbp-8208+rax], ymm0
add rax, 32
cmp rax, 4096
jne .L3
令人惊讶的一件事是不同的地址处理,但是完全混淆我的事情是附加商店到> [RBP-8240]
。据我所知,这个位置再也不会使用。
如果我在Godbolt上选择GCC 7.5
,那么多余的商店就会消失(但从8.1起,它是生产的)。
所以我的问题是:
- 为什么我的编译器和Godbolt之间有区别(不同的地址处理,多余的商店)?
- 多余的商店做什么?
非常感谢您的帮助!
I'm confused by an auto-vectorization result. The following code addtest.c
#include <stdio.h>
#include <stdlib.h>
#define ELEMS 1024
int
main()
{
float data1[ELEMS], data2[ELEMS];
for (int i = 0; i < ELEMS; i++) {
data1[i] = drand48();
data2[i] = drand48();
}
for (int i = 0; i < ELEMS; i++)
data1[i] += data2[i];
printf("%g\n", data1[ELEMS-1]);
return 0;
}
is compiled with gcc 11.1.0
by
gcc-11 -O3 -march=haswell -masm=intel -save-temps -o addtest addtest.c
and the add-to loop is auto-vectorized as
.L3:
vmovaps ymm1, YMMWORD PTR [r12]
vaddps ymm0, ymm1, YMMWORD PTR [rax]
add r12, 32
add rax, 32
vmovaps YMMWORD PTR -32[r12], ymm0
cmp r12, r13
jne .L3
This is clear: load from data1
, load and add from data2
, store to data1
, and in between, advance the indices.
If I pass the same code to https://godbolt.org, select x86-64 gcc-11.1
and options -O3 -march=haswell
, I get the following assembly code:
.L3:
vmovaps ymm1, YMMWORD PTR [rbp-4112+rax]
vaddps ymm0, ymm1, YMMWORD PTR [rbp-8208+rax]
vmovaps YMMWORD PTR [rbp-8240], ymm1
vmovaps YMMWORD PTR [rbp-8208+rax], ymm0
add rax, 32
cmp rax, 4096
jne .L3
One surprising thing is the different address handling, but the thing that confuses me completely is the additional store to [rbp-8240]
. This location is never used again, as far as I can see.
If I select gcc 7.5
on godbolt, the superfluous store disappears (but from 8.1 upwards, it is produced).
So my questions are:
- Why is there a difference between my compiler and godbolt (different address handling, superfluous store)?
- What does the superfluous store do?
Thanks a lot for your help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
差异制造者是
-fpie
,默认情况下在大多数发行版中,但不是Godbolt。这不是很有意义的,但是编译器是复杂机械,不是“聪明”。它也不是
-march = Haswell
或AVX的特定特定。仅-O3
也会发生相同的差异。Godbolt使用比发行版更简单的选项配置GCC,例如无默认值,没有
-fstack-protector-strong
。要在本地匹配Godbolt,请至少使用-fno-pie -no-pie -fno-stack-protector
。我可能会忘记其他人。idk为什么这会触发或避免错过优化,但我可以用GCC 11.1确认它在我的Arch GNU/Linux系统上进行。
与
gcc -o3 -march = Haswell -fno -stack -protector -fno -pie
本地
(and
-Masm = Intel -s -o- vec.c | Limes
)它与Godbolt:但是使用distro配置的GCC默认为
-O3 -March = haswell
:没有
-march = Haswell
也会发生相同的错过。我们得到移动XMMWord Ptr [RSP],XMM1
存储到环路内的固定地址。 (由于GCC不需要过度对准堆栈即可溢出32字节向量,因此它不使用RBP作为帧指针。)出于明显的原因,使用
-fpie
> 在Godbolt Compiler Explorer上,也避免多余的商店。 (使您在本地获得相同的ASM)。-fpie
强迫GCC在静态存储中的数组中执行此操作(因为[ARR + RAX]
将要求符号地址作为32位绝对:您可以并且应该在 GCC的Bugzilla 中 - 优化”。
The difference-maker is
-fpie
, which is on by default in most distros but not Godbolt. This doesn't make a lot of sense, but compilers are complex pieces of machinery, not "smart".It's not specific to
-march=haswell
or AVX either; the same difference happens with just-O3
.Godbolt configures GCC with simpler options than distros, e.g. without default-pie, and without
-fstack-protector-strong
. To match Godbolt locally, use at least-fno-pie -no-pie -fno-stack-protector
. There might be others I'm forgetting about.IDK why this would trigger or avoid a missed-optimization, but I can confirm it does on my Arch GNU/Linux system with GCC 11.1.
Locally with
gcc -O3 -march=haswell -fno-stack-protector -fno-pie
(and
-masm=intel -S -o- vec.c | less
) it matches Godbolt:But with distro-configured GCC defaults from
-O3 -march=haswell
:The same missed-opt happens without
-march=haswell
; we get amovaps XMMWORD PTR [rsp], xmm1
store to a fixed address inside the loop. (Since GCC doesn't need to over-align the stack to spill a 32-byte vector, it didn't use RBP as a frame pointer.)For no apparent reason, using
-fpie
on the Godbolt compiler explorer gets GCC to use two pointer increments instead of indexed addressing modes, also avoiding the redundant store. (Making the same asm you get locally).-fpie
forces GCC to do that for arrays in static storage (because[arr + rax]
would require the symbol address as a 32-bit absolute: 32-bit absolute addresses no longer allowed in x86-64 Linux?)You can and should report this on GCC's bugzilla with the keyword "missed-optimization".