为什么使用 MOV 指令将 XOR 交换优化为普通交换?

发布于 2025-01-12 15:28:57 字数 1486 浏览 1 评论 0 原文

在测试 Compiler Explorer 周围的内容时,我尝试了以下无溢出函数来计算 2 的平均值无符号 32 位整数:

uint32_t average_1(uint32_t a, uint32_t b)
{
    if(a < b){
        a ^= b;
        b ^= a;
        a ^= b;
    }
    return b + (a - b) / 2;
}

编译为:(与激活 -O1-O2-O3 优化相同

average_1:
        cmp     edi, esi
        jnb     .L2
        mov     eax, edi
        mov     edi, esi
        mov     esi, eax
.L2:
        sub     edi, esi
        shr     edi
        lea     eax, [rdi+rsi]
        ret

)代码的交换部分经过优化,可使用带有 1 个附加寄存器的 mov 指令。

我已经阅读了这些问题:

为什么人们不使用异或交换?
通过 mov、xor 交换变量的成本

并得到:

  • 使用XOR 交换本身更难以解释,并且
  • mov 指令方法需要更多内存,因此不会对执行速度产生负面影响 读。

但在这种情况下,看到的不是内存,而是仅使用了 eax、esi、edi 寄存器,我认为编译后的汇编代码也可以生成为:

average_1:
        cmp     edi, esi
        jnb     .L2
        xor     edi, esi
        xor     esi, edi
        xor     edi, esi

...

没有内存读取,并且指令数量相同,我没有看到任何不良影响,并且对它的更改感到奇怪。 显然,有些事情我没有想到,但它是什么?


编辑:要明确的是,这里我的问题不是“为什么不使用异或交换”,而是
当使用 XOR 交换时,虽然它在这种特殊情况下似乎不会影响执行速度,但为什么它仍然被优化了?

While testing things around Compiler Explorer, I tried out the following overflow-free function for calculating the average of 2 unsigned 32-bit integers:

uint32_t average_1(uint32_t a, uint32_t b)
{
    if(a < b){
        a ^= b;
        b ^= a;
        a ^= b;
    }
    return b + (a - b) / 2;
}

which get compiled into this: (same with either -O1, -O2, -O3 optimization activated)

average_1:
        cmp     edi, esi
        jnb     .L2
        mov     eax, edi
        mov     edi, esi
        mov     esi, eax
.L2:
        sub     edi, esi
        shr     edi
        lea     eax, [rdi+rsi]
        ret

which the swapping part of the code is optimized to use the mov instruction with 1 additional register.

I've read through these questions:

Why don't people use xor swaps?
Cost of swapping variables through mov, xor

and gets that:

  • Using the XOR swap is more difficult at explaining itself, and
  • The mov instruction method has no negative effects on execution speed, by requiring more memory
    reads.

But in this case, seeing not the memory but only the eax, esi, edi register being used, I thought the compiled assembly code could also be generated as:

average_1:
        cmp     edi, esi
        jnb     .L2
        xor     edi, esi
        xor     esi, edi
        xor     edi, esi

...

with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed.
Clearly, there is something I did not think through though, but what is it?


Edit: To be clear, here my question is not about "why not use the XOR swap" but rather
"when XOR swap is used, though it doesn't seem to affect execution speed in this particular case, why is it still optimized out?"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

虐人心 2025-01-19 15:28:57

Clang 也做同样的事情。可能是出于编译器构造和 CPU 架构的原因:

  • 将逻辑分解为交换可能在某些情况下可以实现更好的优化;对于编译器来说,尽早执行这件事绝对是有意义的,这样它就可以通过交换来跟踪值。

  • 异或交换对于交换寄存器来说完全是垃圾,唯一的优点是它不需要临时。但是 xchg reg,reg 已经做得更好了。


我对 GCC 的优化器识别异或交换模式并将其解开以遵循原始值并不感到惊讶。一般来说,这使得通过交换实现常量传播和值范围优化成为可能,特别是在交换不以被交换的变量值为条件的情况下。这种模式识别可能会在将程序逻辑转换为 GIMPLE (SSA) 表示形式后不久发生,因此那时它会忘记原始源曾经使用过异或交换,并且不会考虑以这种方式发出 asm。

希望有时这可以让它优化到只有一个 mov 或两个 mov,具体取决于周围代码的寄存器分配(例如,如果其中一个变量可以移动)到一个新的寄存器,而不必最终回到原来的位置)。以及这两个变量是否在稍后实际使用,或者仅使用一个。或者,如果它可以完全解开无条件交换,则可能不需要 mov 指令。

但最坏的情况是,三个需要临时寄存器的 mov 指令仍然更好,除非寄存器用完。我猜 GCC 不够聪明,无法使用 xchg reg,reg 来代替溢出其他内容或保存/恢复另一个 tmp reg,因此可能存在这种优化实际上会造成伤害的极端情况。

(显然 GCC -Os 确实有一个窥视孔优化,可以使用 xchg reg,reg 而不是 3x mov:PR 92549 已修复GCC10。它在 RTL 期间查找这一点 -> 是的,它在这里起作用:将你的 xor-swap 转换为 xchg:https://godbolt.org/z/zs969xh47)

xor-swap 具有更差的延迟并击败了 mov-elimination

没有内存读取,并且指令数量相同,我没有看到任何不良影响,并且对它的更改感到奇怪。显然有些事情我没有想到,但它是什么?

指令计数只是 之一的粗略代理与性能分析相关的三件事:前端微指令、延迟和后端执行端口。 (机器代码大小以字节为单位:x86 机器代码指令是可变长度的。)

机器代码字节大小相同,前端微指令数量相同,但关键路径延迟更差:对于异或交换,从输入 a 到输出 a 需要 3 个周期,从输入 b 到输出 a 需要 2 个周期,例如。

MOV-swap 从输入到输出的最坏情况为 1 周期和 2 周期延迟,或者使用 mov-消除。 (这也可以避免使用后端执行端口,特别是与 IvyBridge 和 Tiger Lake 等前端比整数 ALU 端口数量宽的 CPU 相关。还有 Ice Lake,除了 Intel 禁用了 mov-elimination 作为勘误表解决方法;不确定是否为 Tiger Lake 重新启用。)

还相关:


如果你要分支,只需复制平均代码

GCC 真正错过的优化(即使使用 -O3)是尾部重复导致大约相同的静态代码大小,只是多了几个字节因为这些大多是 2 字节指令。最大的胜利是,a 路径的长度会变得与另一个路径相同,而不是先进行交换,然后运行相同的 3 uop 进行平均,长度是原来的两倍。

更新:GCC 将使用 -ftracer 为您执行此操作 (https://godbolt .org/z/es7a3bEPv),优化掉交换。(即仅启用手动或作为-fprofile-use,而不是在 -O3 处,因此在没有 PGO 的情况下一直使用可能不是一个好主意,可能会在冷函数/代码路径中使机器代码膨胀。)

在源代码中手动执行此操作(<一个href="https://godbolt.org/z/78cEEsTTW" rel="noreferrer">Godbolt):

uint32_t average_1_taildup(uint32_t a, uint32_t b)
{
    if(a < b){
        return a + (b - a) / 2;
    }else {
        return b + (a - b) / 2;
    }
}
# GCC11.2 -O3
average_1_taildup:
        cmp     edi, esi
        jnb     .L5
        sub     esi, edi
        shr     esi
        lea     eax, [rsi+rdi]
        ret
.L5:
        sub     edi, esi
        shr     edi
        lea     eax, [rdi+rsi]
        ret

Clang 使用 cmov< 将版本 1 和 1_taildup 编译为代码/code> (例如 cmp / mov / cmovb / cmovb,或者使尾部重复版本有点混乱)。

但是,如果您打算采用无分支,那么您的 average_3 会更好

uint32_t average_3(uint32_t a, uint32_t b)
{
    return (a & b) + ((a ^ b) >> 1);
}
# clang 13.0 -O3
average_3:
        mov     eax, esi
        and     eax, edi
        xor     esi, edi
        shr     esi
        add     eax, esi
        ret

GCC 和 clang 的版本都只有 5 条指令(加上 ret),但 clang 对其进行了安排,因此关键即使没有 mov 消除,从输入到输出的路径延迟也仅为 3 个周期(3 个单微指令指令)。 (GCC 有一条链,长度为 4 个指令,包括一个 mov。)

高效非溢出无符号均值

另请参阅 C/C++ 中高效的防溢出算术平均值 - 扩展到 uint64_t 甚至可以更便宜,尤其是在内联时,在64 位机器。 (正如问题下的评论中所讨论的,例如 https://godbolt.org/z/sz53eEYh9 显示代码来自我发表评论时的现有答案。)

另一个不错的选择是这样,但通常不如扩大:

  return (a&b) + (a^b)/2;

如果编译器识别出这些习惯用法中的任何一个,他们可以使用 asm不过,add/rcr 技巧比扩大到 unsigned __int128 来平均 uint64_t 更有效。

Clang does the same thing. Probably for compiler-construction and CPU architecture reasons:

  • Disentangling that logic into just a swap may allow better optimization in some cases; definitely something it makes sense for a compiler to do early so it can follow values through the swap.

  • Xor-swap is total garbage for swapping registers, the only advantage being that it doesn't need a temporary. But xchg reg,reg already does that better.


I'm not surprised that GCC's optimizer recognizes the xor-swap pattern and disentangles it to follow the original values. In general, this makes constant-propagation and value-range optimizations possible through swaps, especially for cases where the swap wasn't conditional on the values of the vars being swapped. This pattern-recognition probably happens soon after transforming the program logic to GIMPLE (SSA) representation, so at that point it will forget that the original source ever used an xor swap, and not think about emitting asm that way.

Hopefully sometimes that lets it then optimize down to only a single mov, or two movs, depending on register allocation for the surrounding code (e.g. if one of the vars can move to a new register, instead of having to end up back in the original locations). And whether both variables are actually used later, or only one. Or if it can fully disentangle an unconditional swap, maybe no mov instructions.

But worst case, three mov instructions needing a temporary register is still better, unless it's running out of registers. I'd guess GCC is not smart enough to use xchg reg,reg instead of spilling something else or saving/restoring another tmp reg, so there might be corner cases where this optimization actually hurts.

(Apparently GCC -Os does have a peephole optimization to use xchg reg,reg instead of 3x mov: PR 92549 was fixed for GCC10. It looks for that quite late, during RTL -> assembly. And yes, it works here: turning your xor-swap into an xchg: https://godbolt.org/z/zs969xh47)

xor-swap has worse latency and defeats mov-elimination

with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed. Clearly there is something I did not think through though, but what is it?

Instruction count is only a rough proxy for one of three things that are relevant for perf analysis: front-end uops, latency, and back-end execution ports. (And machine-code size in bytes: x86 machine-code instructions are variable-length.)

It's the same size in machine-code bytes, and same number of front-end uops, but the critical-path latency is worse: 3 cycles from input a to output a for xor-swap, and 2 from input b to output a, for example.

MOV-swap has at worst 1-cycle and 2-cycle latencies from inputs to outputs, or less with mov-elimination. (Which can also avoid using back-end execution ports, especially relevant for CPUs like IvyBridge and Tiger Lake with a front-end wider than the number of integer ALU ports. And Ice Lake, except Intel disabled mov-elimination on it as an erratum workaround; not sure if it's re-enabled for Tiger Lake or not.)

Also related:


If you're going to branch, just duplicate the averaging code

GCC's real missed optimization here (even with -O3) is that tail-duplication results in about the same static code size, just a couple extra bytes since these are mostly 2-byte instructions. The big win is that the a<b path then becomes the same length as the other, instead of twice as long to first do a swap and then run the same 3 uops for averaging.

update: GCC will do this for you with -ftracer (https://godbolt.org/z/es7a3bEPv), optimizing away the swap. (That's only enabled manually or as part of -fprofile-use, not at -O3, so it's probably not a good idea to use all the time without PGO, potentially bloating machine code in cold functions / code-paths.)

Doing it manually in the source (Godbolt):

uint32_t average_1_taildup(uint32_t a, uint32_t b)
{
    if(a < b){
        return a + (b - a) / 2;
    }else {
        return b + (a - b) / 2;
    }
}
# GCC11.2 -O3
average_1_taildup:
        cmp     edi, esi
        jnb     .L5
        sub     esi, edi
        shr     esi
        lea     eax, [rsi+rdi]
        ret
.L5:
        sub     edi, esi
        shr     edi
        lea     eax, [rdi+rsi]
        ret

Clang compiles both version 1 and 1_taildup into code using cmov (e.g. cmp / mov / cmovb / cmovb, or making a bit of a mess for the tail-duplication version).

But if you're going to go branchless then your average_3 is superior:

uint32_t average_3(uint32_t a, uint32_t b)
{
    return (a & b) + ((a ^ b) >> 1);
}
# clang 13.0 -O3
average_3:
        mov     eax, esi
        and     eax, edi
        xor     esi, edi
        shr     esi
        add     eax, esi
        ret

Both GCC and clang's versions are only 5 instructions (plus ret), but clang arranges it so the critical-path latency is only 3 cycles (3 single-uop instructions) from either input to the output, even without mov-elimination. (GCC has one chain that's 4 instructions long including a mov.)

Efficient non-overflowing unsigned mean

See also Efficient overflow-immune arithmetic mean in C/C++ - widening to uint64_t can be even cheaper, especially when inlining, on a 64-bit machine. (As discussed in comments under the question, e.g. https://godbolt.org/z/sz53eEYh9 shows code from the existing answers at the time I commented.)

Another good option is this, but usually not as good as widening:

  return (a&b) + (a^b)/2;

If compilers recognized any of these idioms, they could use the asm add/rcr trick, though, which is even more efficient than widening to unsigned __int128 for averaging uint64_t.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文