编译器是否利用函数参数的不确定顺序？

发布于 2024-07-14 20:49:27 字数 288 浏览 21 评论 0原文

好的，我知道标准规定 C++ 实现可以选择函数参数的求值顺序，但是是否有任何实现在实际影响程序的情况下真正“利用”了这一点？

经典示例：

int i = 0;
foo(i++, i++);

注意：我不是在找人告诉我评估顺序不可靠，我很清楚这一点。我只对是否有任何编译器实际上按照从左到右的顺序进行计算感兴趣，因为我的猜测是，如果他们这样做，许多写得不好的代码就会崩溃（没错，但他们仍然可能会抱怨）。

原文

Okay, I'm aware that the standard dictates that a C++ implementation may choose in which order arguments of a function are evaluated, but are there any implementations that actually 'take advantage' of this in a scenario where it would actually affect the program?

Classic Example:

int i = 0;
foo(i++, i++);

Note: I'm not looking for someone to tell me that the order of evaluation can't be relied on, I'm well aware of that. I'm only interested in whether any compilers actually do evaluate out of a left-to-right order because my guess would be that if they did lots of poorly written code would break (rightly so, but they would still probably complain).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笔落惊风雨 2024-07-21 20:49:27

它取决于参数类型、被调用函数的调用约定、架构和编译器。在 x86 上，Pascal 调用约定从左到右计算参数，而在 C 调用约定中（__cdecl）它是从右到左。大多数在多个平台上运行的程序都会考虑调用约定以避免意外情况。

Raymond Chen 的博客上有一篇不错的文章，如果你感兴趣。您可能还想查看堆栈和调用 GCC 手册的部分。

编辑：只要我们有分歧：我的回答不是将其视为语言问题，而是将其视为平台问题。语言标准不保证或偏爱其中一种，并将其保留为未指定。注意措辞。它并没有说这是未定义的。从这个意义上讲，未指定意味着您无法指望的行为，即不可移植的行为。我手边没有 C 规范/草案，但它应该与我的 n2798 草案 (C++) 中的类似

抽象机的某些其他方面和操作在本国际标准中被描述为未指定（例如，函数参数的求值顺序）。在可能的情况下，本国际标准定义了一组允许的行为。这些定义了抽象机的非确定性方面。因此，对于给定程序和给定输入，抽象机的实例可以具有多个可能的执行序列。

回复收藏 0 原文

陌上青苔 2024-07-21 20:49:27

我在 c++ 标准中找到了答案。

第 5.2.2.8 段：

参数的求值顺序未指定。参数表达式求值的所有副作用在输入函数之前生效。后缀表达式和参数表达式列表的求值顺序是
未指定。

换句话说，它仅取决于编译器。

回复收藏 0 原文

可可 2024-07-21 20:49:27

阅读本文

这不是您的问题的精确副本，但我的答案（以及其他一些答案）也涵盖了您的问题。

有很好的优化原因可以解释为什么编译器可能不仅选择从右到左而且还交错它们。

该标准甚至不保证顺序排列。它仅保证当函数被调用时，所有参数都已被完全评估。

是的，我见过几个版本的 GCC 正是这样做的。对于您的示例，将调用 foo(0,0)，之后 i 将是 2。（我无法给您编译器的确切版本号。那是不久前的事了 - 但如果再次出现这种行为，我不会感到惊讶。这是安排指令的有效方法）

回复收藏 0 原文

不离久伴 2024-07-21 20:49:27

所有参数都会被评估。订单未定义（根据标准）。但是（据我所知）C/C++ 的所有实现都从从右到左评估函数参数。 编辑：CLang 是一个例外（请参阅下面的评论）。

我相信从右到左的求值顺序已经非常非常古老了（自第一个 C 编译器以来）。当然，早在 C++ 发明之前，C++ 的大多数实现都会保持相同的求值顺序，因为早期的 C++ 实现只是简单地转换为 C。

从右到左求值函数参数有一些技术原因。在堆栈架构中，参数通常被压入堆栈。在 C/C++ 中，您可以使用比实际指定的参数更多的参数来调用函数——额外的参数将被简单地忽略。如果参数从左到右求值，并从左到右压入，则堆栈指针正下方的堆栈槽将保存最后一个参数，并且函数无法获取任何特定参数的偏移量（因为推送的参数的实际数量取决于调用者）。

在从右到左的压入顺序中，堆栈指针正下方的堆栈槽将始终保存第一个参数，下一个槽保存第二个参数，依此类推。参数偏移量对于函数来说始终是确定性的（可以写成和在其他地方编译到库中，与调用它的地方分开）。

现在，从右到左的推送顺序并不强制要求从右到左的求值顺序，但在早期的编译器中，内存是稀缺的。按照从右到左的求值顺序，可以就地使用相同的堆栈（本质上，在求值参数之后——可能是表达式或函数调用！——返回值已经位于堆）。在从左到右的计算中，参数值必须单独存储，并以相反的顺序推回堆栈。

回复收藏 0 原文

猫九 2024-07-21 20:49:27

上次我看到 VS2005 和 GCC 3.x 在 x86 硬件上的差异是在 2007 年。
所以这是（曾经？）一种非常有可能的情况。所以我不再依赖评估顺序。也许现在好多了。

回复收藏 0 原文

关于从前 2024-07-21 20:49:27

我预计大多数现代编译器会尝试交错计算参数的指令，因为 C++ 标准要求它们是独立的，因此缺乏任何相互依赖性。这样做应该有助于保持深度流水线 CPU 的执行单元满载，从而提高吞吐量。（至少我希望声称是优化编译器的编译器在给出优化标志时会这样做。）

回复收藏 0 原文

德意的啸 2024-07-21 20:49:27

TL:DR：不，在实践中，在未排序或不确定排序的表达式的情况下，GCC 和 clang 不会根据会导致更高效的 asm 来选择 eval 顺序。（即使对于没有副作用的表达式，至少在某些情况下也是如此。）

首先，foo(i++, i++)在C++14及更早版本中具有未定义的行为，对i进行两次无序修改，就ISO标准而言与i++ + i++没有区别；实际的实现可能会在函数参数的情况下做一些一致的事情。

C++17 将函数参数从未排序更改为不确定排序，因此它必须是 arg1 = i++; arg2 = i++; foo(arg1,arg2) 或相反，由编译器选择。 C++ 引入的评估顺序保证是什么17？

GCC 和 clang 在基于未排序或不确定排序情况下评估顺序选择的优化方面遇到了困难，甚至遗漏了一些非常基本的东西，例如通过调用保存 3 个单独的本地变量与仅保存return baz(a+b+c, foo(x)); 的 a+b+c 结果；请参阅 Godbolt 链接 包含此问题其他部分的代码。（这甚至不是正确性问题，只是在抽象机中以任何方式给出相同结果的实现细节。）

GCC bug#70408 在某些情况下重用相同的调用保留寄存器会产生更小的代码显示了没有 UB 的无序求值的情况，其中如果编译器足够智能，则不同的评估顺序选择可以让编译器保存/恢复更少的寄存器。显然，根据 Andrew Pinski 的回复，教 GCC 考虑不同的 eval 顺序确实很困难，至少对于函数调用来说是这样。

// __attribute__((const)) //  optional: promises no side-effects and not even reading anything except args
int foo(int);  // not inlineable

int bar(int a) {
  return foo(a+2) + 5 * foo (a);
}

# first function args goes in RDI
foo_gcc:  # current GCC13.2 is the same as GCC6 when I reported the missed-optimization
    pushq   %rbp
    pushq   %rbx               # save two call-preserved regs
    movl    %edi, %ebx
    leal    2(%rdi), %edi        # lea is worse than  add $2, %edi; separate missed opt
    subq    $8, %rsp           # and adjust the stack again for alignment
    call    foo                # foo(a+2)
    movl    %ebx, %edi
    movl    %eax, %ebp
    call    foo                # foo(a)
    addq    $8, %rsp
    leal    (%rax,%rax,4), %eax   # eax = 4*eax + eax = 5*eax
    popq    %rbx
    addl    %ebp, %eax
    popq    %rbp
    ret

与手动相比，以其他顺序调用并在复制时使用 LEA 添加。（在某些CPU上，lea到同一个寄存器中比add-immediate稍差，例如Ice Lake之前的Intel，但当它保存一条或多条指令时，它总是值得使用的，甚至只是一个 mov。）

foo_handwritten:
    push    %rbx
    lea     2(%rdi), %ebx          # stash ebx=a+2
    call    foo                    # foo(a)
    mov     %ebx, %edi
    lea     (%rax,%rax,4), %ebx    # reuse ebx to stash 5*foo(a)
    call    foo                    # foo(a+2)
    add     %ebx, %eax
    pop     %rbx
    ret

如果我们只考虑函数参数顺序，包括可能的副作用，要么在不同的变量上以避免 UB，要么在 C++17 中，所以它们的顺序是不确定的，我们可以获得有意义的无 UB 代码。

像 foo(++*p, ++*q) 这样的东西很有趣：在 C++17 之前，编译器可以假设 int *p,*q don' t 彼此别名，因为这会导致未排序的副作用导致 UB。因此，例如它可以在增量和存储¹之前执行两种加载。（在为有序机器调度指令以更好地隐藏加载延迟时特别有用。）但看起来编译器实际上并没有这样做。

在使用寄存器参数调用约定时，传出参数需要位于我们接收传入参数的同一寄存器中。您必须在使用之前加载到寄存器中，然后在尾调用跳转之前覆盖传入的参数以及存储到指向的对象。）

（对于堆栈参数也是如此，但在这种情况下，像这样将需要一些寄存器复制指令，因为它不能只将 *p 加载到 RDI（例如在 x86-64 上），它需要随后的指针来存储增量值。但它还需要递增的值最终位于最初保存指针的寄存器中。

int baz(int,int);      // not visible to the compiler for inlining

int pointers(int *p, int *q){   // in RDI, RSI in the x86-64 System V calling convention
    return baz(++*p, ++*q);     // int args in EDI, ESI, the low 32 bits of RDI,RSI
}

这个问题是对称的；两种评估顺序都没有任何优势。每个解除引用的值都必须存放在保存其指针的寄存器中，而不是不同的参数中。 baz(++*q, ++*p) 使它们相反将允许保存一条指令，但它仍然是对称的，因此无论我们首先评估哪一个，我们想要评估的寄存器都会被占用。

我们可以添加一个虚拟参数，以便传入和传出参数仅在一个寄存器中重叠，并对它们进行排序，以便两种情况都不需要在保存指向它的指针的同一寄存器中的整数。（因为这将需要复制指针或加载到其他地方，然后在存储之后复制整数。）

// test cases that allow more efficient asm with one eval order than the other
int pointers2(int dummy1, int *q, int *p){  // q then p
    return baz(++*q, ++*p);                 // q then p - 2nd arg clashes with incoming
}

int pointers3(int *p, int dummy1, int *q){  // p then q
    return baz(++*q, ++*p);                 // q then p - 1st arg clashes with incoming
}

通过此设置，一个评估顺序可以避免需要任何额外的寄存器复制指令。另一个不能，因为它想要将 ++*q 或 ++*p 计算到仍保存另一个指针的寄存器中。编译器似乎使用固定的求值顺序，或者至少没有利用其自由选择的优势。

x86-64 上的 GCC 从右到左运行，使 pointers2 变得更糟，因为它想要计算传出的 ++*p arg，而传入的 q arg 仍然存在于 RSI 中。针对 ARM 的 GCC 对于这两个函数都是从左到右，从而更好地处理pointers2。因此，即使在同一个编译器中，它也不总是一致的，并且可能是由于一些任意的内部原因造成的。这些都不涉及任何堆栈参数，尽管历史上 GCC 支持的 x86 调用约定是带有纯堆栈参数的 32 位 x86，这也许可以解释从右到左被嵌入到 GCC 的 x86 内部。（我还没有测试其他情况来看看它是否始终一致，但这里的一些旧答案报告说它是一致的。）

x86-64 的 Clang 从左到右（如 ARM GCC），所以 pointers2 避免任何浪费的指令；当 RSI 准备好计算 ++*p 时，就不再需要它了。

# Actual GCC13 code-gen for the good version
# p in RDI, q in RDX
pointers3(int*, int, int*):
        mov     eax, DWORD PTR [rdi]          # *p
        lea     esi, [rax+1]                  # 1 + *p   But LEA saves the day
        mov     DWORD PTR [rdi], esi          # ++*p

        mov     eax, DWORD PTR [rdx]          # *q 
        lea     edi, [rax+1]                  # eval into EDI, the first arg
        mov     DWORD PTR [rdx], edi

        jmp     baz(int, int)                 # tailcall

它可以首先加载到 ESI 中，并使用 add 而不是 lea 来复制+add²。但没有浪费 mov 指令； GCC 的从右到左的求值顺序恰好适合该顺序。与其他函数不同：

# Actual GCC13 code-gen for the sub-optimal version
# q in RSI, p in RDX
pointers2(int, int*, int*):
        mov     ecx, DWORD PTR [rdx]     # load *p
        mov     rax, rsi               ##### copy q to RAX, this is the insn we could avoid
        lea     esi, [rcx+1]             # eval ++*p into outgoing 2nd arg reg
        mov     DWORD PTR [rdx], esi     # and store it back to *p

        mov     ecx, DWORD PTR [rax]     # using the copy of q
        lea     edi, [rcx+1]             # eval ++*q into 1st arg
        mov     DWORD PTR [rax], edi

        jmp     baz(int, int)            # tailcall

选择其他顺序来评估 args 将避免使用 pointers3 那样的 mov rax, rsi 指令，从而节省代码大小并让 out-of- order exec 看得更远一点（因为它不会占用 ROB = 重新排序缓冲区中的条目。）

# Hand-written version that GCC *could* have made
#  with different semantics if p==q.  Or equivalent with __restrict
# q in RSI, p in RDX
pointers2(int, int*, int*):
        mov     edi, [rsi]
        add     edi, 1                # eval ++*q into 1st arg
        mov     [rsi], edi

        mov     esi, [rdx]
        add     esi, 1                # eval ++*p into outgoing 2nd arg
        mov     [rdx], esi

        jmp     baz(int, int)         # tailcall

（DWORD PTR 暗示另一个操作数是 32 位寄存器，所以我更喜欢省略它只是示例之间的格式差异，关键区别在于总共有 7 条指令，而不是 8 条，而且我使用的指令在所有 CPU 上都至少便宜 3 个字节。 )

一般来说，一个 mov 在整个程序中并不重要，但编译器仍然不应该浪费指令。这个例子应该足以得出这样的结论：GCC 和 clang 一般不会寻求此类优化，即使在可能节省更多成本的情况下也是如此。（与作为参数的函数调用一样。）

脚注 1：

在 C++17 中，p==q 情况将避免 UB，但编译器选择首先执行哪个增量会影响函数参数。 所以你不应该编写像 baz(++*p, ++*q) 这样的代码，除非你知道它们不相等。

你甚至可以向使用 int *__restrict p、int *__restrict q 的编译器，或者如果它们是不同的类型，则使用默认的 -fstrict-aliasing，GCC 和 clang 会推断它们不存在不是别名。例如int *p, long long *q。

__restrict 对错过优化的测试用例没有帮助，尽管 GCC 会在任一存储之前执行两次加载，始终在 ARM 上或仅针对 x86-64 上的 pointer2 。有序执行在 ARM 内核上很常见，因此隐藏加载延迟更有用。即使在 x86-64 上，如果您可以等到加载之后，具有未知地址的存储（因为乱序执行尚未准备好地址）会更好，因此 CPU 不必预测是否需要是否存储转发。

GCC 的 ARM 代码生成可能会更好，将第一次加载作为第一条指令，而不是将指针复制到另一个寄存器； -mcpu=cortex-a53 与 -mcpu=cortex-a76 确实做出了不同的调整/指令调度选择。此外，对于非别名情况，它最终会在函数的两个版本中花费额外的 mov 。如果不可避免的话，为了避免有序核心上的负载延迟瓶颈，这可能是值得的，但我不确定是否如此。例如，

这部分正在深入讨论 ARM 的指令调度，有点偏离主题，

@ hand-written for the version with  __restrict  pointers
@ GCC13.2  starts with  mov r3, r0  then 2 loads / 2 adds / 2 stores
pointers3(int*, int, int*):
        ldr     r1, [r0]
        ldr     r3, [r2]       @ Eventually want it in R0, but load elsewhere
        adds    r1, r1, #1     @ first use is 2 insns after the load, same as GCC
        str     r1, [r0]       @ store right after add; GCC left a gap
        adds    r0, r3, #1     @ adds into a different low reg is still a 16-bit instruction with small-enough immediate
        str     r0, [r2]
        b       baz(int, int)

这对于有序 ARM 来说可能仍然相当不错。第二个 LDR 和第一个 ADDS 可以在同一周期中通过管道，因为它们是独立的。但 STR 与产生其输入的 ADDS 不在同一周期中，除非内存阶段在管道中晚于 exec 并且超标量 CPU 支持该转发。（P5 Pentium 没有，IIRC。）
然后下一个周期第一个 STR 和第二个 ADDS 可以继续，然后是第二个 STR 和 B 尾调用跳转。

这看起来并不比 GCC 的 mov/load/load 差多少；然后在加载结果准备好后，添加/添加；商店/商店（或者可能是分开的，因为大多数管道不会有 2 个商店单元）；分支。或者也许是这样，只有在调整像 -mcpu=cortex-a76 这样的 OoO 执行核心时，我的指令保存版本才值得。无论如何，如果我们需要一个 mov ，我认为最好是在 2 次加载之后、2 次存储之前，在停止等待加载使用延迟之前完成更多工作。或者也许是加载/指针复制/加载，这样我们仍然可以在只有一个加载单元的管道上的第一个周期上进行双重发布。不管怎样，我们先偏离主题，进入 ARM 指令调度。

有趣的事实：这种调度与 clang 用于 RISC-V 的模式相同：加载/加载/添加/存储/添加/存储。其他 ISA 的 Clang 仍然采用从左到右的评估顺序。（至少进入寄存器；我没有检查这么多参数，以至于它需要一些堆栈参数。希望它首先评估堆栈参数，这样它就有备用寄存器来评估其他寄存器参数。或者至少以这种方式安排指令，如果它们是独立的。）

如果将 if (p==q) return 0; 放入函数中，即使是 x86-64 的 GCC 也会首先对 pointers2 执行这两个加载代码>. （仍然有一个额外的 mov ，可以通过在第一个存储之后执行 lea esi, [rax+1] 来避免，而不是 add eax, 1 之前，mov esi, eax 之后。）

脚注 2：GCC 使用 LEA，它本来可以使用 `add`

和 return baz(*q <; <= 7, *p *= 4); (Godbolt)，GCC 使用lea esi, [0+rax*4] 为 7 个字节（操作码 + modrm + SIB + disp32=0），比首先加载到 ESI 时长 4 个字节shl esi，2。

但它确实直接加载到 EDI 中，因此它可以 sal edi, 7。对于 *q *= 12385, ++*p，它确实使用内存源 imul-immediate 来加载并乘法到 EDI 中。

我猜它的成本模型错误地认为，如果 LEA 完全可行，那么它就和其他任何东西一样便宜，因此寄存器分配不会费心去努力将输入输入到正确的寄存器中。不过，在其他情况下，对该启发式进行简单更改可能会使代码变得更糟，例如在可以保存指令的情况下不使用 LEA，具体取决于实际的启发式是什么。

TL:DR: no, GCC and clang in practice don't choose an eval order based on what would lead to more efficient asm, in cases of unsequenced or indeterminatedly-sequenced expressions with. (Even for expressions without side-effects at least in some cases.)

First of all, foo(i++, i++) has undefined behaviour in C++14 and earlier, with two unsequenced modifications of i, no different from i++ + i++ as far as the ISO standard is concerned; real implementations might happen to do something consistent in the case of function args.

C++17 changed function args from unsequenced to indeterminately sequenced, so it has to be either arg1 = i++; arg2 = i++; foo(arg1,arg2) or the reverse, at the compiler's choice. What are the evaluation order guarantees introduced by C++17?

GCC and clang fail hard at optimizing based on choice of evaluation order in unsequenced or indeterminately-sequenced cases, missing even some very basic stuff like saving 3 separate local vars across a call vs. just saving the a+b+c result for return baz(a+b+c, foo(x));; see quux in the Godbolt link that has code for other parts of this question. (And that's not even a correctness issue, just implementation details that gives the same result in the abstract machine either way.)

GCC bug#70408 reusing the same call-preserved register would give smaller code in some cases shows a case of unsequenced evaluation that doesn't have UB, where a different choice of evaluation order could let the compiler save/restore fewer registers if it was smart enough. Apparently per Andrew Pinski's reply, teaching GCC to consider a different eval order would be really hard, at least for function calls.

// __attribute__((const)) //  optional: promises no side-effects and not even reading anything except args
int foo(int);  // not inlineable

int bar(int a) {
  return foo(a+2) + 5 * foo (a);
}

# first function args goes in RDI
foo_gcc:  # current GCC13.2 is the same as GCC6 when I reported the missed-optimization
    pushq   %rbp
    pushq   %rbx               # save two call-preserved regs
    movl    %edi, %ebx
    leal    2(%rdi), %edi        # lea is worse than  add $2, %edi; separate missed opt
    subq    $8, %rsp           # and adjust the stack again for alignment
    call    foo                # foo(a+2)
    movl    %ebx, %edi
    movl    %eax, %ebp
    call    foo                # foo(a)
    addq    $8, %rsp
    leal    (%rax,%rax,4), %eax   # eax = 4*eax + eax = 5*eax
    popq    %rbx
    addl    %ebp, %eax
    popq    %rbp
    ret

vs. by hand, calling in the other order and using LEA to add while copying. (lea into the same register is slightly worse than add-immediate on some CPUs, e.g. Intel before Ice Lake, but it's always worth using when it saves one or more instruction, even just a mov.)

foo_handwritten:
    push    %rbx
    lea     2(%rdi), %ebx          # stash ebx=a+2
    call    foo                    # foo(a)
    mov     %ebx, %edi
    lea     (%rax,%rax,4), %ebx    # reuse ebx to stash 5*foo(a)
    call    foo                    # foo(a+2)
    add     %ebx, %eax
    pop     %rbx
    ret

If we look at just function-arg order including possible side-effects, either on different variables to avoid UB, or in C++17 so they're indeterminately sequenced, we can get UB-free code that's meaningful to talk about.

Stuff like foo(++*p, ++*q) is interesting: pre-C++17, a compiler could assume int *p,*q don't alias each other, since that would lead to UB from unsequenced side-effects. So e.g. it could do both loads before either increment and store¹. (Especially useful when scheduling instructions for an in-order machine to better hide load latency.) But it looks like compilers didn't actually do that.

In calling conventions with register args, the outgoing args need to ends up in the same registers where we received incoming args. (Also true with stack args, but in that case you'd have to load into registers before use anyway, and then overwrite the incoming args before a tail-call jump as well as storing to the pointed-to object.)

So a function like this will need some register copying instructions, since it can't just load *p into RDI (on x86-64 for example), it needs the pointer around afterwards to store the incremented value back. But it also needs the incremented value to end up in the register that was originally holding the pointer.

int baz(int,int);      // not visible to the compiler for inlining

int pointers(int *p, int *q){   // in RDI, RSI in the x86-64 System V calling convention
    return baz(++*p, ++*q);     // int args in EDI, ESI, the low 32 bits of RDI,RSI
}

This problem is symmetric; neither evaluation order has any advantage. And each dereferenced value has to land in the register that held its pointer, rather than a different arg. baz(++*q, ++*p) to make them opposite would allow saving an instruction, but it's still symmetric so whichever one we evaluate first, the register we want to eval into is occupied.

We can add a dummy argument so incoming and outgoing args only overlap in one register, and order them so neither case needs the integer in the same register that held a pointer to it. (Because that would necessitate either copying the pointer or loading somewhere else and then copying the integer after the store.)

// test cases that allow more efficient asm with one eval order than the other
int pointers2(int dummy1, int *q, int *p){  // q then p
    return baz(++*q, ++*p);                 // q then p - 2nd arg clashes with incoming
}

int pointers3(int *p, int dummy1, int *q){  // p then q
    return baz(++*q, ++*p);                 // q then p - 1st arg clashes with incoming
}

With this setup, one evaluation order can avoid needing any extra register-copy instructions. The other one can't, because it wants to evaluate ++*q or ++*p into a register that's still holding the other pointer. Compilers appear to use a fixed evaluation order, or at least not using their freedom to choose to their advantage.

GCC on x86-64 goes right-to-left, making pointers2 worse because it wants to compute the outgoing ++*p arg while the incoming q arg is still live in RSI. GCC targeting ARM goes left-to-right for both these functions, handling pointers2 better. So it's not always consistent even within the same compiler, and may be due to some arbitrary internals. None of these involve any stack args, although historically only x86 calling convention GCC supported was 32-bit x86 with purely stack args, which could perhaps explain right-to-left being baked-in to GCC's x86 internals. (I haven't tested other cases to see if it's always consistent, but some older answers here report that it was.)

Clang for x86-64 goes left-to-right (like ARM GCC), so pointers2 avoids any wasted instructions; RSI is no longer needed by the time it's ready to compute ++*p there.

# Actual GCC13 code-gen for the good version
# p in RDI, q in RDX
pointers3(int*, int, int*):
        mov     eax, DWORD PTR [rdi]          # *p
        lea     esi, [rax+1]                  # 1 + *p   But LEA saves the day
        mov     DWORD PTR [rdi], esi          # ++*p

        mov     eax, DWORD PTR [rdx]          # *q 
        lea     edi, [rax+1]                  # eval into EDI, the first arg
        mov     DWORD PTR [rdx], edi

        jmp     baz(int, int)                 # tailcall

It could have loaded into ESI in the first place and used add instead of lea to copy+add². But no wasted mov instructions; GCC's evaluation right-to-left evaluation order happened to be good for that one. Unlike the other function:

# Actual GCC13 code-gen for the sub-optimal version
# q in RSI, p in RDX
pointers2(int, int*, int*):
        mov     ecx, DWORD PTR [rdx]     # load *p
        mov     rax, rsi               ##### copy q to RAX, this is the insn we could avoid
        lea     esi, [rcx+1]             # eval ++*p into outgoing 2nd arg reg
        mov     DWORD PTR [rdx], esi     # and store it back to *p

        mov     ecx, DWORD PTR [rax]     # using the copy of q
        lea     edi, [rcx+1]             # eval ++*q into 1st arg
        mov     DWORD PTR [rax], edi

        jmp     baz(int, int)            # tailcall

Picking the other order to evaluate args would have avoided the mov rax, rsi instruction the way pointers3 did, saving code-size and letting out-of-order exec see a little bit farther (since it won't take up an entry in the ROB = Reorder Buffer.)

# Hand-written version that GCC *could* have made
#  with different semantics if p==q.  Or equivalent with __restrict
# q in RSI, p in RDX
pointers2(int, int*, int*):
        mov     edi, [rsi]
        add     edi, 1                # eval ++*q into 1st arg
        mov     [rsi], edi

        mov     esi, [rdx]
        add     esi, 1                # eval ++*p into outgoing 2nd arg
        mov     [rdx], esi

        jmp     baz(int, int)         # tailcall

(DWORD PTR is implied by the other operand being a 32-bit register so I prefer to omit it. That's just a formatting difference between the examples, the key difference is that this is 7 instructions total instead of 8, and 3 bytes smaller in code-size. The instructions I used are all at least as cheap on all CPUs.)

Generally one mov doesn't matter in a whole program, but compilers should still not waste instructions. This example should be enough to conclude that GCC and clang don't look for such optimizations in general, even for cases where there might be more savings. (Like with function calls as args.)

Footnote 1:

In C++17 the p==q case would avoid UB, but the compiler's choice of which increment do do first would affect the function args. So you shouldn't write code like baz(++*p, ++*q) unless you know they're not equal.

You can even make this promise to the compiler with int *__restrict p, int *__restrict q, or if they're different types then with the default -fstrict-aliasing, GCC and clang will infer they don't alias. e.g. int *p, long long *q.

__restrict doesn't help with the missed-optimization test-case, although GCC will then do both loads before either store, always on ARM or just for pointer2 on x86-64. In-order exec is common on ARM cores, so hiding load latency is more useful. Even on x86-64, stores with unknown addresses (because out-of-order exec doesn't have the address ready yet) are better if you can wait until after loads, so the CPU doesn't have to predict whether it needs to store-forward or not.

GCC's ARM code-gen probably would have been better with the first load as the first instruction, instead of copying a pointer to another register; -mcpu=cortex-a53 vs. -mcpu=cortex-a76 does make that different tuning / instruction-scheduling choice. Also, it ends up spending an extra mov in both versions of the function for the non-aliasing case. That could be worth it to avoid load latency bottlenecks on in-order cores, if it's unavoidable, but I'm not sure it is. e.g.

This part is getting into the weeds of instruction scheduling for ARM, kind of off topic

@ hand-written for the version with  __restrict  pointers
@ GCC13.2  starts with  mov r3, r0  then 2 loads / 2 adds / 2 stores
pointers3(int*, int, int*):
        ldr     r1, [r0]
        ldr     r3, [r2]       @ Eventually want it in R0, but load elsewhere
        adds    r1, r1, #1     @ first use is 2 insns after the load, same as GCC
        str     r1, [r0]       @ store right after add; GCC left a gap
        adds    r0, r3, #1     @ adds into a different low reg is still a 16-bit instruction with small-enough immediate
        str     r0, [r2]
        b       baz(int, int)

This might still be pretty decent for in-order ARM. The second LDR and first ADDS can go through the pipeline in the same cycle since they're independent. But not the STR in the same cycle as the ADDS that produces its input, unless the memory stage is later in the pipeline than exec and a superscalar CPU supports that forwarding. (P5 Pentium didn't, IIRC.)
Then the next cycle the first STR and second ADDS can go, then the second STR and the B tailcall jump.

This doesn't seem much if any worse than GCC's mov/load/load ; then after load results are ready, adds/adds ; store/store (or probably separate since most pipelines won't have 2 store units) ; branch. Or maybe it is and my instruction-saving version would only be worth it when tuning for OoO exec cores like -mcpu=cortex-a76. Regardless, if we need a mov, I assume it would be better to it after the 2 loads, before the 2 stores, to get more done before stalling to wait for load-use latency. Or maybe load / pointer-copy / load so we can still dual-issue on the first cycle on a pipeline with only one load unit. Anyway, getting off topic here into ARM instruction scheduling.

Fun fact: this scheduling is the same pattern clang uses for RISC-V: load/load / add / store / add / store. Clang for other ISAs still goes with left-to-right eval order. (At least into registers; I didn't check with so many args that it would need a few stack args. Hopefully it would eval stack args first so it has spare registers to eval other reg args. Or at least schedule the instructions that way, if they're independent.)

If you put if (p==q) return 0; into the functions, even GCC for x86-64 will do both loads first for pointers2. (Still with an extra mov which it could have avoided by doing an lea esi, [rax+1] after the first store, instead of add eax, 1 before and mov esi, eax after.)

Footnote 2: GCC uses LEA where it could have used `add`

With return baz(*q <<= 7, *p *= 4); (Godbolt), GCC uses lea esi, [0+rax*4] which is 7 bytes (opcode + modrm + SIB + disp32=0), which is 4 bytes longer than if it had loaded into ESI in the first place for shl esi, 2.

But it does load into EDI directly so it can sal edi, 7. And with *q *= 12385, ++*p, it does use a memory-source imul-immediate to load-and-multiply into EDI.

I guess its cost model incorrectly thinks that if LEA is possible at all, it's as cheap as anything else, so register allocation doesn't bother to work harder to get the inputs into the right register. A simple change to that heuristic might make worse code in other cases, though, like not using LEA in a case where it could save instructions, depending on what the actual heuristics are.

回复收藏 0 原文

~没有更多了~