当前位置：文江博客话题详情

是否有可能告诉分支预测器跟随分支的可能性有多大？

发布于 2024-08-13 08:11:39 字数 173 浏览 15 评论 0原文

需要澄清的是，我在这里不追求任何形式的可移植性，因此任何将我绑定到某个盒子的解决方案都可以。

基本上，我有一个 if 语句，99% 的时间评估结果为 true，并且我正在努力维持最后一个时钟的性能，我可以发出某种编译器命令（使用 GCC 4.1.2 和 x86 ISA，如果这很重要）告诉分支预测器它应该缓存该分支？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

牵强ㄟ 2024-08-20 08:11:39

是的，但它不会有没有效果。 Netburst 之前的旧（过时）架构是例外，即使如此，它也没有做任何可测量的事情。

英特尔在 Netburst 架构中引入了一个“分支提示”操作码，以及一些旧架构上用于冷跳的默认静态分支预测（向后预测采取，向前预测不采取）。 GCC 使用 __builtin_expect (x, Prediction) 来实现此功能，其中预测通常为 0 或 1。
在所有较新的处理器架构（>= Core 2）上，编译器发出的操作码都会被忽略。这实际上起作用的小角落案例是旧 Netburst 架构上的冷跳情况。英特尔现在建议不要使用静态分支提示，可能是因为他们认为代码大小的增加比可能的边际加速更有害。

除了预测器无用的分支提示之外，__builtin_expect 还有它的用处，编译器可以重新排序代码以提高缓存使用率或节省内存。

它没有按预期工作的原因有多种。

处理器可以完美地预测小循环(n＜64)。
处理器可以完美地预测小的重复模式（n~7）。
处理器本身可以比编译器/程序员在编译时更好地估计运行时分支的概率。
分支的可预测性（=分支被正确预测的概率）比分支被采用的概率重要得多。不幸的是，这高度依赖于架构，并且预测分支的可预测性是出了名的困难。

在 Agner Fogs 手册中了解有关分支预测内部工作的更多信息。
另请参阅 gcc 邮件列表。

回复收藏 0 原文

各自安好 2024-08-20 08:11:39

是的。 http://kerneltrap.org/node/4705

__builtin_expect 是一种方法
gcc（版本 >= 2.96）提供
程序员指示分支
预测信息到
编译器。返回值
__builtin_expect 是第一个参数（只能是整数）
传递给它。

if (__builtin_expect (x, 0))
                foo ();

     [This] would indicate that we do not expect to call `foo', since we
     expect `x' to be zero.

Yes. http://kerneltrap.org/node/4705

The __builtin_expect is a method that
gcc (versions >= 2.96) offer for
programmers to indicate branch
prediction information to the
compiler. The return value of
__builtin_expect is the first argument (which could only be an integer)
passed to it.

if (__builtin_expect (x, 0))
                foo ();

     [This] would indicate that we do not expect to call `foo', since we
     expect `x' to be zero.

回复收藏 0 原文

伤痕我心 2024-08-20 08:11:39

Pentium 4（又名 Netburst 微架构）将分支预测器提示作为 jcc 指令的前缀，但只有 P4 曾经对它们做过任何事情。
请参阅http://ref.x86asm.net/geek32.html。和
Agner Fog 出色的 asm opt 指南第 3.5 节，来自 http://www.agner.org/optimize/。他还有 C++ 优化指南。

早期和后期的 x86 CPU 会默默地忽略这些前缀字节。是否有任何性能使用可能/不可能提示的测试结果？提到 PowerPC 有一些跳转指令，这些指令将分支预测提示作为编码的一部分。这是一个非常罕见的建筑特色。在编译时静态预测分支很难准确完成，因此通常最好将其留给硬件来解决。

关于最新 Intel 和 AMD CPU 中分支预测器和分支目标缓冲区的具体行为方式，官方并未发布太多信息。优化手册（很容易在 AMD 和 Intel 的网站上找到）提供了一些建议，但没有记录具体的行为。有些人已经运行测试来尝试预测实现，例如 Core2 有多少 BTB 条目...无论如何，显式暗示预测器的想法已经被放弃（暂时）。

例如，记录的内容是，Core2 有一个分支历史缓冲区，如果循环始终运行恒定的短迭代次数，则可以避免错误预测循环退出。 8 或 16 IIRC。但不要太快展开，因为适合 64 字节（或 Penryn 上的 19uops）的循环不会有指令获取瓶颈，因为它从缓冲区重播......去阅读 Agner Fog 的 pdf，它们是优秀。

另请参阅英特尔为何更改静态这些年来的分支预测机制？：据我们从试图对 CPU 功能进行逆向工程的性能实验可知，自 Sandybridge 以来，英特尔根本不使用静态预测。（当动态预测失败时，许多较旧的 CPU 将静态预测作为后备。正常的静态预测是不采用前向分支而采用后向分支（因为后向分支通常是循环分支）。）

的效果使用 GNU C 的 __builtin_expect 的 >likely()/unlikely() 宏（如 Drakosha 的答案提到的）确实不直接插入 BP 提示进入汇编。（它可能会使用 gcc -march=pentium4 执行此操作，但在编译其他任何内容时则不会）。

实际效果是对代码进行布局，以便快速路径具有更少的分支，并且可能更少的指令总数。这将有助于在静态预测发挥作用的情况下进行分支预测（例如，动态预测器很冷，在确实回退到静态预测的 CPU 上，而不是仅仅让分支在预测器缓存中相互别名。）

请参阅 GCC的__builtin_expect在if else中的优点是什么语句？获取代码生成的具体示例。

即使预测完美，采用的分支的成本也略高于未采用的分支。当 CPU 获取 16 字节块中的代码以并行解码时，所采取的分支意味着该获取块中的后续指令不是要执行的指令流的一部分。它在前端产生气泡，这可能成为高吞吐量代码的瓶颈（不会因缓存未命中而在后端停滞，并且具有高指令级并行性）。

在不同块之间跳转还可能会触及更多缓存代码行，增加 L1i 缓存占用空间，并且在冷时可能会导致更多指令缓存未命中。（以及潜在的 uop 缓存占用空间）。因此，这是使快速路径短而线性的另一个优点。

GCC 的配置文件引导优化通常使可能/不可能的宏变得不必要。编译器收集运行时数据，了解每个分支的代码布局决策，并识别热块/函数与冷块/函数。（例如，它将展开热函数中的循环，但不会展开冷函数中的循环。）请参阅 -fprofile-generate 和 -fprofile-use GCC 手册。如何在 g++ 中使用配置文件引导优化？

否则 GCC 有如果您没有使用可能/不可能的宏并且没有使用 PGO，则可以使用各种启发式进行猜测。 -fguess-branch-probability 在 -O1 及更高版本默认启用。

https://www.phoronix.com /scan.php?page=article&item=gcc-82-pgo&num=1 具有在 Xeon 可扩展服务器 CPU 上使用 gcc8.2 的 PGO 与常规的基准测试结果。（Skylake-AVX512）。每个基准测试都至少获得了小幅加速，有些基准提高了约 10%。（其中大部分可能来自热循环中的循环展开，但其中一些可能来自更好的分支布局和其他效果。）

Pentium 4 (aka Netburst microarchitecture) had branch-predictor hints as prefixes to the jcc instructions, but only P4 ever did anything with them.
See http://ref.x86asm.net/geek32.html. And
Section 3.5 of Agner Fog's excellent asm opt guide, from http://www.agner.org/optimize/. He has a guide to optimizing in C++, too.

Earlier and later x86 CPUs silently ignore those prefix bytes. Are there any performance test results for usage of likely/unlikely hints? mentions that PowerPC has some jump instructions which have a branch-prediction hint as part of the encoding. It's a pretty rare architectural feature. Statically predicting branches at compile time is very hard to do accurately, so it's usually better to leave it up to hardware to figure it out.

Not much is officially published about exactly how the branch predictors and branch-target-buffers in the most recent Intel and AMD CPUs behave. The optimization manuals (easy to find on AMD's and Intel's web sites) give some advice, but don't document specific behaviour. Some people have run tests to try to divine the implementation, e.g. how many BTB entries Core2 has... Anyway, the idea of hinting the predictor explicitly has been abandoned (for now).

What is documented is for example that Core2 has a branch history buffer that can avoid mispredicting the loop-exit if the loop always runs a constant short number of iterations, < 8 or 16 IIRC. But don't be too quick to unroll, because a loop that fits in 64bytes (or 19uops on Penryn) won't have instruction fetch bottlenecks because it replays from a buffer... go read Agner Fog's pdfs, they're excellent.

See also Why did Intel change the static branch prediction mechanism over these years? : Intel since Sandybridge doesn't use static prediction at all, as far as we can tell from performance experiments that attempt to reverse-engineer what CPUs do. (Many older CPUs have static prediction as a fallback when dynamic prediction misses. The normal static prediction is forward branches are not-taken and backward branches are taken (because backwards branches are often loop branches).)

The effect of likely()/unlikely() macros using GNU C's __builtin_expect (like Drakosha's answer mentions) does not directly insert BP hints into the asm. (It might possibly do so with gcc -march=pentium4, but not when compiling for anything else).

The actual effect is to lay out the code so the fast path has fewer taken branches, and perhaps fewer instructions total. This will help branch prediction in cases where static prediction comes into play (e.g. dynamic predictors are cold, on CPUs which do fall back to static prediction instead of just letting branches alias each other in the predictor caches.)

See What is the advantage of GCC's __builtin_expect in if else statements? for a specific example of code-gen.

Taken branches cost slightly more than not-taken branches, even when predicted perfectly. When the CPU fetches code in chunks of 16 bytes to decode in parallel, a taken branch means that later instructions in that fetch block aren't part of the instruction stream to be executed. It creates bubbles in the front-end which can become a bottleneck in high-throughput code (which doesn't stall in the back-end on cache-misses, and has high instruction-level parallelism).

Jumping around between different blocks also potentially touches more cache-lines of code, increasing L1i cache footprint and maybe causing more instruction-cache misses if it was cold. (And potentially uop-cache footprint). So that's another advantage to having the fast path be short and linear.

GCC's profile-guided optimization normally makes likely/unlikely macros unnecessary. The compiler collects run-time data on which way each branch went for code-layout decisions, and to identify hot vs. cold blocks / functions. (e.g. it will unroll loops in hot functions but not cold functions.) See -fprofile-generate and -fprofile-use in the GCC manual. How to use profile guided optimizations in g++?

Otherwise GCC has to guess using various heuristics, if you didn't use likely/unlikely macros and didn't use PGO. -fguess-branch-probability is enabled by default at -O1 and higher.

https://www.phoronix.com/scan.php?page=article&item=gcc-82-pgo&num=1 has benchmark results for PGO vs. regular with gcc8.2 on a Xeon Scalable Server CPU. (Skylake-AVX512). Every benchmark got at least a small speedup, and some benefited by ~10%. (Most of that is probably from loop unrolling in hot loops, but some of it is presumably from better branch layout and other effects.)

回复收藏 0 原文