在不优化的情况下编译时,添加冗余分配会加速代码

发布于 2025-01-17 13:21:39 字数 1588 浏览 3 评论 0 原文

我发现一个有趣的现象:

#include<stdio.h>
#include<time.h>

int main() {
    int p, q;
    clock_t s,e;
    s=clock();
    for(int i = 1; i < 1000; i++){
        for(int j = 1; j < 1000; j++){
            for(int k = 1; k < 1000; k++){
                p = i + j * k;
                q = p;  //Removing this line can increase running time.
            }
        }
    }
    e = clock();
    double t = (double)(e - s) / CLOCKS_PER_SEC;
    printf("%lf\n", t);
    return 0;
}

我在 gcc 7.3.0 i5-5257u mac os 在没有任何优化的情况下编译代码。这是10次的平均运行时间: 还有其他人在其他英特尔平台上测试此案并获得相同的结果。
我发布由GCC addl $ 1,-12(%rbp)速度越来越快。

movl    -44(%rbp), %eax
movl    %eax, -48(%rbp)

两个汇编代码之间的唯一区别是,


彼得的答案非常有帮助。对 AMD现象II X4 810 ARMV7处理器(BCM2835)的测试显示了一个相反的结果,该结果支持了商店 - 转向速度是某些Intel CPU的特定结果。< and 驱使我重写这个问题。 :)
这个问题的核心是与处理器架构和组装有关的有趣现象。因此,我认为可能值得讨论。

I find an interesting phenomenon:

#include<stdio.h>
#include<time.h>

int main() {
    int p, q;
    clock_t s,e;
    s=clock();
    for(int i = 1; i < 1000; i++){
        for(int j = 1; j < 1000; j++){
            for(int k = 1; k < 1000; k++){
                p = i + j * k;
                q = p;  //Removing this line can increase running time.
            }
        }
    }
    e = clock();
    double t = (double)(e - s) / CLOCKS_PER_SEC;
    printf("%lf\n", t);
    return 0;
}

I use GCC 7.3.0 on i5-5257U Mac OS to compile the code without any optimization. Here is the average run time over 10 times:
enter image description here
There are also other people who test the case on other Intel platforms and get the same result.
I post the assembly generated by GCC here. The only difference between two assembly codes is that before addl $1, -12(%rbp) the faster one has two more operations:

movl    -44(%rbp), %eax
movl    %eax, -48(%rbp)

So why does the program run faster with such an assignment?


Peter's answer is very helpful. The tests on an AMD Phenom II X4 810 and an ARMv7 processor (BCM2835) shows an opposite result which supports that store-forwarding speedup is specific to some Intel CPU.
And BeeOnRope's comment and advice drives me to rewrite the question. :)
The core of this question is the interesting phenomenon which is related to processor architecture and assembly. So I think it may be worth to be discussed.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

还不是爱你 2025-01-24 13:21:40

TL:DR:如果重新加载不尝试“立即”发生,Sandybridge 系列存储转发的延迟较低。添加无用的代码可以加速调试模式循环,因为 -O0 反优化代码中的循环携带延迟瓶颈几乎总是涉及 存储/重新加载一些 C 变量
这种减速的其他示例: 超线程, 调用空函数通过访问变量指针
显然,低功耗 Goldmont ,除非有其他原因需要额外的负载帮助。

这些都与优化代码无关。存储转发延迟的瓶颈偶尔会发生,但向代码添加无用的复杂性不会加快速度。


您正在对调试版本进行基准测试,这基本上是无用。它们与优化代码相比具有不同的瓶颈,而不是统一的减速。


但显然,一个版本的调试版本运行速度比另一版本的调试版本运行速度慢是有真正原因的。 (假设您测量正确,并且不仅仅是 CPU 频率变化(涡轮/省电)导致挂钟时间差异。)

如果您想了解 x86 性能分析的详细信息,我们可以尝试解释为什么asm首先会按照它最初的方式执行,以及为什么来自额外的C语句(使用-O0编译为额外的asm指令)的asm可以使其整体速度更快。 这将告诉我们一些有关 asm 性能影响的信息,但对于优化 C 没有任何用处。

您没有显示整个内部循环,只显示了一些循环体,但是 gcc -O0漂亮可预测的。每个 C 语句都与所有其他语句分开编译,所有 C 变量在每个语句的块之间溢出/重新加载。这使您可以在单步执行时使用调试器更改变量,甚至跳转到函数中的不同行,并且代码仍然有效。以这种方式编译的性能成本是灾难性的。例如,您的循环没有副作用(没有使用任何结果),因此整个三重嵌套循环可以并且会在实际构建中编译为零指令,运行速度无限快。或者更现实地说,即使没有优化或进行重大转换,每次迭代运行 1 个周期而不是 ~6 个周期。


瓶颈可能是对 k 的循环承载依赖,以及存储/重新加载和add 增量。在大多数 CPU 上,存储转发延迟通常为大约 5 个周期。因此,您的内部循环仅限于每约 6 个周期运行一次,即内存目标 add 的延迟。

如果您使用的是 Intel CPU,当重新加载无法立即执行时,存储/重新加载延迟实际上可以更低(更好)。在依赖对之间有更多独立的加载/存储可能会在您的情况下解释这一点。请参阅函数调用循环比空循环更快

因此,随着循环中的工作增多,在连续运行时可以维持每 6 个周期一个吞吐量的 addl $1, -12(%rbp) 可能只会造成瓶颈每 4 或 5 个周期进行一次迭代。

根据测量来自 2013 年的博客文章,所以是的,这也是对您的 Broadwell i5-5257U 最有可能的解释。看来这种效应发生在所有 Intel Sandybridge 系列 CPU 上


如果没有关于您的测试硬件、编译器版本(或内部循环的 asm 源)、以及两个版本的绝对和/或相对性能数字的更多信息,这是我最好的轻松猜测解释。在我的 Skylake 系统上进行基准测试/分析 gcc -O0 不够有趣,无法亲自尝试。下次,包括计时数字。


对于不属于循环承载依赖链的所有工作,存储/重新加载的延迟并不重要,只有吞吐量。现代无序 CPU 中的存储队列确实有效地提供了内存重命名,消除了 write-after - 写入和读后写入危险来自于重复使用相同的堆栈内存来写入p,然后在其他地方读取和写入。 (请参阅https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies更多关于具体记忆危害的信息,以及此问答 有关延迟与吞吐量以及重用相同寄存器/寄存器重命名的更多信息)

内部循环的多次迭代可以在以下位置进行:一次,因为内存顺序缓冲区 (MOB) 会跟踪每个加载需要从哪个存储中获取数据,而不需要同一位置的先前存储来提交到 L1D 并退出存储队列。 (有关 CPU 微架构内部结构的更多信息,请参阅 Intel 的优化手册和 Agner Fog 的 microarch PDF。MOB 是 存储缓冲区和加载缓冲区)


这是否意味着添加无用的语句将加快实际程序的速度? (启用优化)

一般来说,不,不会。编译器将循环变量保存在最内层循环的寄存器中。无用的语句实际上会在启用优化的情况下优化掉。

调整 gcc -O0 的源代码是没有用的。使用 -O3 或项目默认构建脚本使用的任何选项进行测量。

此外,这种存储转发加速是 Intel Sandybridge 系列特有的,您不会在 Ryzen 等其他微架构上看到它,除非它们也具有类似的存储转发延迟效应。


存储转发延迟可能是实际(优化)编译器输出中的一个问题,特别是如果您没有使用链接时优化 (LTO) 来内联微小函数,尤其是传递或返回的函数通过引用进行任何操作(因此它必须通过内存而不是寄存器)。如果您确实想在 Intel CPU 上解决这个问题,并且可能会让其他一些 CPU 上的情况变得更糟,那么缓解这个问题可能需要像 volatile 这样的黑客技术。请参阅评论中的讨论

TL:DR: Sandybridge-family store-forwarding has lower latency if the reload doesn't try to happen "right away". Adding useless code can speed up a debug-mode loop because loop-carried latency bottlenecks in -O0 anti-optimized code almost always involve store/reload of some C variables.
Other examples of this slowdown in action: hyperthreading, calling an empty function, accessing vars through pointers.
And apparently also on low-power Goldmont, unless there's a different cause there for an extra load helping.

None of this is relevant for optimized code. Bottlenecks on store-forwarding latency can occasionally happen, but adding useless complications to your code won't speed it up.


You're benchmarking a debug build, which is basically useless. They have different bottlenecks than optimized code, not a uniform slowdown.


But obviously there is a real reason for the debug build of one version running slower than the debug build of the other version. (Assuming you measured correctly and it wasn't just CPU frequency variation (turbo / power-saving) leading to a difference in wall-clock time.)

If you want to get into the details of x86 performance analysis, we can try to explain why the asm performs the way it does in the first place, and why the asm from an extra C statement (which with -O0 compiles to extra asm instructions) could make it faster overall. This will tell us something about asm performance effects, but nothing useful about optimizing C.

You haven't shown the whole inner loop, only some of the loop body, but gcc -O0 is pretty predictable. Every C statement is compiled separately from all the others, with all C variables spilled / reloaded between the blocks for each statement. This lets you change variables with a debugger while single-stepping, or even jump to a different line in the function, and have the code still work. The performance cost of compiling this way is catastrophic. For example, your loop has no side-effects (none of the results are used) so the entire triple-nested loop can and would compile to zero instructions in a real build, running infinitely faster. Or more realistically, running 1 cycle per iteration instead of ~6 even without optimizing away or doing major transformations.


The bottleneck is probably the loop-carried dependency on k, with a store/reload and an add to increment. Store-forwarding latency is typically around 5 cycles on most CPUs. And thus your inner loop is limited to running once per ~6 cycles, the latency of memory-destination add.

If you're on an Intel CPU, store/reload latency can actually be lower (better) when the reload can't try to execute right away. Having more independent loads/stores in between the dependent pair may explain it in your case. See Loop with function call faster than an empty loop.

So with more work in the loop, that addl $1, -12(%rbp) which can sustain one per 6 cycle throughput when run back-to-back might instead only create a bottleneck of one iteration per 4 or 5 cycles.

This effect apparently happens on Sandybridge and Haswell (not just Skylake), according to measurements from a 2013 blog post, so yes, this is the most likely explanation on your Broadwell i5-5257U, too. It appears that this effect happens on all Intel Sandybridge-family CPUs.


Without more info on your test hardware, compiler version (or asm source for the inner loop), and absolute and/or relative performance numbers for both versions, this is my best low-effort guess at an explanation. Benchmarking / profiling gcc -O0 on my Skylake system isn't interesting enough to actually try it myself. Next time, include timing numbers.


The latency of the stores/reloads for all the work that isn't part of the loop-carried dependency chain doesn't matter, only the throughput. The store queue in modern out-of-order CPUs does effectively provide memory renaming, eliminating write-after-write and write-after-read hazards from reusing the same stack memory for p being written and then read and written somewhere else. (See https://en.wikipedia.org/wiki/Memory_disambiguation#Avoiding_WAR_and_WAW_dependencies for more about memory hazards specifically, and this Q&A for more about latency vs. throughput and reusing the same register / register renaming)

Multiple iterations of the inner loop can be in flight at once, because the memory-order buffer (MOB) keeps track of which store each load needs to take data from, without requiring a previous store to the same location to commit to L1D and get out of the store queue. (See Intel's optimization manual and Agner Fog's microarch PDF for more about CPU microarchitecture internals. The MOB is a combination of the store buffer and load buffer)


Does this mean adding useless statements will speed up real programs? (with optimization enabled)

In general, no, it doesn't. Compilers keep loop variables in registers for the innermost loops. And useless statements will actually optimize away with optimization enabled.

Tuning your source for gcc -O0 is useless. Measure with -O3, or whatever options the default build scripts for your project use.

Also, this store-forwarding speedup is specific to Intel Sandybridge-family, and you won't see it on other microarchitectures like Ryzen, unless they also have a similar store-forwarding latency effect.


Store-forwarding latency can be a problem in real (optimized) compiler output, especially if you didn't use link-time-optimization (LTO) to let tiny functions inline, especially functions that pass or return anything by reference (so it has to go through memory instead of registers). Mitigating the problem may require hacks like volatile if you really want to just work around it on Intel CPUs and maybe make things worse on some other CPUs. See discussion in comments

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文