为什么添加XORP指令会使用CVTSI2SS和添加〜5倍的速度使此功能更快？

发布于 2025-01-23 23:19:42 字数 1255 浏览 2 评论 0 原文

我正在使用Google基准进行优化功能，并遇到了我的代码在某些情况下出乎意料地放慢速度的情况。我开始尝试它，查看汇编的组件，并最终提出了一个最小的测试用例，展示了该问题。这是我想到的大会表现出这种放缓：

    .text
test:
    #xorps  %xmm0, %xmm0
    cvtsi2ss    %edi, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    retq
    .global test

此功能遵循GCC/Clang的X86-64呼叫函数声明的召集 extern“ C” float Test（int）; 请注意，注释 XORPS 指令。不计算此指令大大提高了功能的性能。 Google基准测试使用我的计算机对其进行测试，Google基准显示没有 XORPS 指令的功能为8.54ns（CPU），而功能则使用 XORPS 指令采用1.48NS。我已经在具有各种操作系统，处理器，处理器世代和不同的处理器制造商（Intel和AMD）的多台计算机上测试了此功能，并且它们都表现出相似的性能差异。重复 adds 指令使速度更加明显（到一点点），并且此放缓仍然使用其他说明（例如 mulss ），甚至是指令的混合只要它们都取决于％XMM0 中的值。值得指出的是，仅调用 XORPS 每个函数调用会导致性能改进。用 XORPS 在循环外调用循环（如Google基准测试）对性能进行采样，仍然显示出较慢的性能。

由于在这种情况下，专门添加指令可以提高性能，因此这似乎是由CPU中真正低级的东西引起的。由于它发生在各种各样的CPU上，因此似乎必须是故意的。但是，我找不到任何文档来解释为什么会发生这种情况。有人对这里发生的事情有解释吗？这个问题似乎取决于复杂的因素，因为我在原始代码中看到的放缓仅在特定的优化级别（-O2，有时-O1，而不是-os），而无需内部，并使用特定的编译器（Clang（Clang），但不是GCC）。

原文

I was messing around with optimizing a function using Google Benchmark, and ran into a situation where my code was unexpectedly slowing down in certain situations. I started experimenting with it, looking at the compiled assembly, and eventually came up with a minimal test case that exhibits the issue. Here's the assembly I came up with that exhibits this slowdown:

    .text
test:
    #xorps  %xmm0, %xmm0
    cvtsi2ss    %edi, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    addss   %xmm0, %xmm0
    retq
    .global test

This function follows GCC/Clang's x86-64 calling convention for the function declaration extern "C" float test(int); Note the commented out xorps instruction. uncommenting this instruction dramatically improves the performance of the function. Testing it using my machine with an i7-8700K, Google benchmark shows the function without the xorps instruction takes 8.54ns (CPU), while the function with the xorps instruction takes 1.48ns. I've tested this on multiple computers with various OS's, processors, processor generations, and different processor manufacturers (Intel and AMD), and they all exhibit a similar performance difference. Repeating the addss instruction makes the slowdown more pronounced (to a point), and this slowdown still occurs using other instructions here (eg. mulss) or even a mix of instructions so long as they all depend on the value in %xmm0 in some way. It's worth pointing out that only calling xorps each function call results in the performance improvement. Sampling the performance with a loop (as Google Benchmark does) with the xorps call outside the loop still shows the slower performance.

Since this is a case where exclusively adding instructions improves performance, this appears to be caused by something really low-level in the CPU. Since it occurs across a wide variety of CPU's, it seems like this must be intentional. However, I couldn't find any documentation that explains why this happens. Does anybody have an explanation for what's going on here? The issue seems to be dependent on complicated factors, as the slowdown I saw in my original code only occurred on a specific optimization level (-O2, sometimes -O1, but not -Os), without inlining, and using a specific compiler (Clang, but not GCC).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

墨小沫ゞ 2025-01-30 23:19:42

cvtsi2ss％edi，％xmm0 将float合并到xmm0的低元素中，因此它对旧值具有错误的依赖性。（跨重复调用相同功能，创建创建一个长循环依赖链。）

XOR-Zeroing打破了DEP链，使级别的执行执行能够发挥其魔力。因此，您可以在添加吞吐量（0.5个周期）而不是延迟（4个周期）上进行瓶颈。

您的CPU是Skylake衍生物，因此这些是数字；较早的英特尔使用专用的FP-ADD执行单元，而不是在FMA单元上运行它，具有3个周期延迟，1个周期吞吐量。 https://agner.org/optimize/ 。函数呼叫/RET间接费用可能会阻止您看到8台式 adds UOPS中的延迟 *带宽产品的完整预期加速度。如果您从单个功能中的循环中删除 XORPS dep-Breaking，则应获得该加速。

GCC往往非常“谨慎地”对错误依赖性，花费额外的说明（前端带宽）以防万一。在前端瓶颈的代码中（或者总代码大小 / UOP-CACHE足迹是一个因素），如果寄存器实际上是及时准备就绪的，则该成本性能。

clang/llvm对此是鲁ck和caval的，通常不费心避免对未写入当前功能写的寄存器的错误依赖。（即，假设 /假装寄存器在功能输入时“冷”）。正如您在注释中显示的那样，Clang确实避免在一个函数内部循环时通过XOR-Zero进行循环的DEP链，而不是通过多个调用到同一函数。

在某些情况下，Clang甚至无缘无故地使用8位GP-Integer部分寄存器，而这种情况无法节省任何代码大小或说明，而不是32位regs。通常情况可能还不错，但是如果我们在我们时，如果呼叫者（或兄弟姐妹函数调用）在我们的飞行中仍然有缓存失误负载，则有耦合到较长的DEP链中或创建循环依赖链的风险例如，称为。

请参阅，以了解有关OOO EXEC如何将短到中长到中长度独立 dep链的更多信息。也相关：在哈斯韦尔（Haswell）上只有3个周期，不同于Agner的说明表？（带有多个蓄能器的fp循环）是关于展开具有多个累加器的点生产，以隐藏FMA延迟。

https://www.uops.info/html-instr/CVTSI2SS_XMM_R32.html在各种UARCAND上都有此说明的性能细节。

如果您可以使用AVX，则可以避免使用 vcvtsi2ss％EDI，％xmm7，％xmm0 （其中xmm7是您最近还没有写的任何寄存器，或者是哪个是在导致EDI当前值的DEP链中的早期。

正如我在为什么SQRTSD指令的延迟基于输入更改？英特尔处理器

这款ISA设计疣归功于Intel在Pentium III上使用SSE1的短期优化。 P3在内部处理了128位注册，为两半。离开上半部未修饰的让标量指令将其解码为单个UOP。（但这仍然给出PIII sqrtss 一个错误的依赖性）。 AVX最终让我们避免使用 vsqrtsd％src，％src，％dst 至少对于寄存器源（如果不是内存），并且类似地 vcvtsi2sd％eax，％cold_reg，％cold_reg，％dst 对于类似地设计的标量int-＆gt; fp转换说明。
（GCC错过优化报告： ~~80586~~ ， ~~89071~~ https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571“ rel =“ noreferrer”> 80571 。）

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571“ rel =“ noreferrer >如果将寄存器的上部元素归零，我们将不会遇到这个愚蠢的问题，也不需要围绕XOR-ZEROING指令撒上；谢谢英特尔。（另一种策略是使用sse2 movd％eax，％xmm0 d d d d d d d d d d d os Zero-Extend，然后包装在整个128位矢量上运行的Int-＆gt; fp转换。

正是通过将写入32位整数寄存器的写入隐含到整个64位寄存器而不是使其未修改（又称合并），避免了AMD64的问题。为什么x86-64在32位寄存器上的说明零64位寄存器的上部？（写8和16位寄存器 do 导致对AMD的错误依赖性CPU和Intel从Haswell）。

cvtsi2ss %edi, %xmm0 merges the float into the low element of XMM0 so it has a false dependency on the old value. (Across repeated calls to the same function, creating one long loop-carried dependency chain.)

xor-zeroing breaks the dep chain, allowing out-of-order exec to work its magic. So you bottleneck on addss throughput (0.5 cycles) instead of latency (4 cycles).

Your CPU is a Skylake derivative so those are the numbers; earlier Intel have 3 cycle latency, 1 cycle throughput using a dedicated FP-add execution unit instead of running it on the FMA units. https://agner.org/optimize/. Probably function call/ret overhead prevents you from seeing the full 8x expected speedup from the latency * bandwidth product of 8 in-flight addss uops in the pipelined FMA units; you should get that speedup if you remove xorps dep-breaking from a loop within a single function.

GCC tends to be very "careful" about false dependencies, spending extra instructions (front-end bandwidth) to break them just in case. In code that bottlenecks on the front-end (or where total code size / uop-cache footprint is a factor) this costs performance if the register was actually ready in time anyway.

Clang/LLVM is reckless and cavalier about it, typically not bothering to avoid false dependencies on registers not written in the current function. (i.e. assuming / pretending that registers are "cold" on function entry). As you show in comments, clang does avoid creating a loop-carried dep chain by xor-zeroing when looping inside one function, instead of via multiple calls to the same function.

Clang even uses 8-bit GP-integer partial registers for no reason in some cases where that doesn't save any code-size or instructions vs. 32-bit regs. Usually it's probably fine, but there's a risk of coupling into a long dep chain or creating a loop-carried dependency chain if the caller (or a sibling function call) still has a cache-miss load in flight to that reg when we're called, for example.

See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for more about how OoO exec can overlap short to medium length independent dep chains. Also related: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) is about unrolling a dot-product with multiple accumulators to hide FMA latency.

https://www.uops.info/html-instr/CVTSI2SS_XMM_R32.html has performance details for this instruction across various uarches.

You can avoid this if you can use AVX, with vcvtsi2ss %edi, %xmm7, %xmm0 (where xmm7 is any register you haven't written recently, or which is earlier in a dep chain that leads to the current value of EDI).

As I mentioned in Why does the latency of the sqrtsd instruction change based on the input? Intel processors

This ISA design wart is thanks to Intel optimizing for the short term with SSE1 on Pentium III. P3 handled 128-bit registers internally as two 64-bit halves. Leaving the upper half unmodified let scalar instructions decode to a single uop. (But that still gives PIII sqrtss a false dependency). AVX finally lets us avoid this with vsqrtsd %src,%src, %dst at least for register sources if not memory, and similarly vcvtsi2sd %eax, %cold_reg, %dst for the similarly near-sightedly designed scalar int->fp conversion instructions.
(GCC missed-optimization reports: ~~80586~~, ~~89071~~, 80571.)

If cvtsi2ss/sd had zeroed the upper elements of registers we wouldn't have this stupid problem / wouldn't need to sprinkle xor-zeroing instruction around; thanks Intel. (Another strategy is to use SSE2 movd %eax, %xmm0 which does zero-extend, then packed int->fp conversion which operates on the whole 128-bit vector. This can break even for float where the int->fp scalar conversion is 2 uops, and the vector strategy is 1+1. But not double where the int->fp packed conversion costs a shuffle + FP uop.)

This is exactly the problem that AMD64 avoided by making writes to 32-bit integer registers implicitly zero-extend to the full 64-bit register instead of leaving it unmodified (aka merging). Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register? (writing 8 and 16-bit registers do cause false dependencies on AMD CPUs, and Intel since Haswell).