是否有可能告诉分支预测器跟随分支的可能性有多大?

发布于 2024-08-13 08:11:39 字数 173 浏览 10 评论 0原文

需要澄清的是,我在这里不追求任何形式的可移植性,因此任何将我绑定到某个盒子的解决方案都可以。

基本上,我有一个 if 语句,99% 的时间评估结果为 true,并且我正在努力维持最后一个时钟的性能,我可以发出某种编译器命令(使用 GCC 4.1.2 和 x86 ISA,如果这很重要)告诉分支预测器它应该缓存该分支?

Just to make it clear, I'm not going for any sort of portability here, so any solutions that will tie me to a certain box is fine.

Basically, I have an if statement that will 99% of the time evaluate to true, and am trying to eke out every last clock of performance, can I issue some sort of compiler command (using GCC 4.1.2 and the x86 ISA, if it matters) to tell the branch predictor that it should cache for that branch?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

牵强ㄟ 2024-08-20 08:11:39

是的,但它不会有没有效果。 Netburst 之前的旧(过时)架构是例外,即使如此,它也没有做任何可测量的事情。

英特尔在 Netburst 架构中引入了一个“分支提示”操作码,以及一些旧架构上用于冷跳的默认静态分支预测(向后预测采取,向前预测不采取)。 GCC 使用 __builtin_expect (x, Prediction) 来实现此功能,其中预测通常为 0 或 1。
在所有较新的处理器架构(>= Core 2)上,编译器发出的操作码都会被忽略。这实际上起作用的小角落案例是旧 Netburst 架构上的冷跳情况。英特尔现在建议不要使用静态分支提示,可能是因为他们认为代码大小的增加比可能的边际加速更有害。

除了预测器无用的分支提示之外,__builtin_expect 还有它的用处,编译器可以重新排序代码以提高缓存使用率或节省内存。

它没有按预期工作的原因有多种。

  • 处理器可以完美地预测小循环(n<64)。
  • 处理器可以完美地预测小的重复模式(n~7)。
  • 处理器本身可以比编译器/程序员在编译时更好地估计运行时分支的概率。
  • 分支的可预测性(=分支被正确预测的概率)比分支被采用的概率重要得多。不幸的是,这高度依赖于架构,并且预测分支的可预测性是出了名的困难。

在 Agner Fogs 手册 中了解有关分支预测内部工作的更多信息。
另请参阅 gcc 邮件列表

Yes, but it will have no effect. Exceptions are older (obsolete) architectures pre Netburst, and even then it doesn't do anything measurable.

There is an "branch hint" opcode Intel introduced with the Netburst architecture, and a default static branch prediction for cold jumps (backward predicted taken, forward predicted non taken) on some older architectures. GCC implements this with the __builtin_expect (x, prediction), where prediction is typically 0 or 1.
The opcode emitted by the compiler is ignored on all newer processor architecures (>= Core 2). The small corner case where this actually does something is the case of a cold jump on the old Netburst architecture. Intel recommends now not to use the static branch hints, probably because they consider the increase of the code size more harmful than the possible marginal speed up.

Besides the useless branch hint for the predictor, __builtin_expect has its use, the compiler may reorder the code to improve cache usage or save memory.

There are multiple reasons it doesn't work as expected.

  • The processor can predict small loops (n<64) perfectly.
  • The processor can predict small repeating patterns (n~7) perfectly.
  • The processor itself can estimate the probability of a branch during runtime better than the compiler/programmer during compile time.
  • The predictability (= probability a branch will get predicted correctly) of a branch is far more important than the probability that the branch is taken. Unfortunately, this is highly architecture-dependent, and predicting the predictability of branch is notoriously hard.

Read more about the inner works of the branch prediction at Agner Fogs manuals.
See also the gcc mailing list.

各自安好 2024-08-20 08:11:39

是的。 http://kerneltrap.org/node/4705

__builtin_expect 是一种方法
gcc(版本 >= 2.96)提供
程序员指示分支
预测信息到
编译器。返回值
__builtin_expect 是第一个参数(只能是整数)
传递给它。

if (__builtin_expect (x, 0))
                foo ();

     [This] would indicate that we do not expect to call `foo', since we
     expect `x' to be zero. 

Yes. http://kerneltrap.org/node/4705

The __builtin_expect is a method that
gcc (versions >= 2.96) offer for
programmers to indicate branch
prediction information to the
compiler. The return value of
__builtin_expect is the first argument (which could only be an integer)
passed to it.

if (__builtin_expect (x, 0))
                foo ();

     [This] would indicate that we do not expect to call `foo', since we
     expect `x' to be zero. 
伤痕我心 2024-08-20 08:11:39

Pentium 4(又名 Netburst 微架构)将分支预测器提示作为 jcc 指令的前缀,但只有 P4 曾经对它们做过任何事情。
请参阅http://ref.x86asm.net/geek32.html。和
Agner Fog 出色的 asm opt 指南第 3.5 节,来自 http://www.agner.org/optimize/。他还有 C++ 优化指南。

早期和后期的 x86 CPU 会默默地忽略这些前缀字节。 是否有任何性能使用可能/不可能提示的测试结果? 提到 PowerPC 有一些跳转指令,这些指令将分支预测提示作为编码的一部分。这是一个非常罕见的建筑特色。在编译时静态预测分支很难准确完成,因此通常最好将其留给硬件来解决。

关于最新 Intel 和 AMD CPU 中分支预测器和分支目标缓冲区的具体行为方式,官方并未发布太多信息。优化手册(很容易在 AMD 和 Intel 的网站上找到)提供了一些建议,但没有记录具体的行为。有些人已经运行测试来尝试预测实现,例如 Core2 有多少 BTB 条目...无论如何,显式暗示预测器的想法已经被放弃(暂时)。

例如,记录的内容是,Core2 有一个分支历史缓冲区,如果循环始终运行恒定的短迭代次数,则可以避免错误预测循环退出。 8 或 16 IIRC。但不要太快展开,因为适合 64 字节(或 Penryn 上的 19uops)的循环不会有指令获取瓶颈,因为它从缓冲区重播......去阅读 Agner Fog 的 pdf,它们是优秀

另请参阅 英特尔为何更改静态这些年来的分支预测机制?:据我们从试图对 CPU 功能进行逆向工程的性能实验可知,自 Sandybridge 以来,英特尔根本不使用静态预测。 (当动态预测失败时,许多较旧的 CPU 将静态预测作为后备。正常的静态预测是不采用前向分支而采用后向分支(因为后向分支通常是循环分支)。)


的效果使用 GNU C 的 __builtin_expect 的 >likely()/unlikely() 宏(如 Drakosha 的答案提到的)确实直接插入 BP 提示进入汇编。 (它可能会使用 gcc -march=pentium4 执行此操作,但在编译其他任何内容时则不会)。

实际效果是对代码进行布局,以便快速路径具有更少的分支,并且可能更少的指令总数。这将有助于在静态预测发挥作用的情况下进行分支预测(例如,动态预测器很冷,在确实回退到静态预测的 CPU 上,而不是仅仅让分支在预测器缓存中相互别名。)

请参阅 GCC的__builtin_expect在if else中的优点是什么语句? 获取代码生成的具体示例。

即使预测完美,采用的分支的成本也略高于未采用的分支。当 CPU 获取 16 字节块中的代码以并行解码时,所采取的分支意味着该获取块中的后续指令不是要执行的指令流的一部分。它在前端产生气泡,这可能成为高吞吐量代码的瓶颈(不会因缓存未命中而在后端停滞,并且具有高指令级并行性)。

在不同块之间跳转还可能会触及更多缓存代码行,增加 L1i 缓存占用空间,并且在冷时可能会导致更多指令缓存未命中。 (以及潜在的 uop 缓存占用空间)。因此,这是使快速路径短而线性的另一个优点。


GCC 的配置文件引导优化通常使可能/不可能的宏变得不必要。编译器收集运行时数据,了解每个分支的代码布局决策,并识别热块/函数与冷块/函数。 (例如,它将展开热函数中的循环,但不会展开冷函数中的循环。)请参阅 -fprofile-generate-fprofile-use GCC 手册如何在 g++ 中使用配置文件引导优化?

否则 GCC 有如果您没有使用可能/不可能的宏并且没有使用 PGO,则可以使用各种启发式进行猜测。 -fguess-branch-probability-O1 及更高版本默认启用。

https://www.phoronix.com /scan.php?page=article&item=gcc-82-pgo&num=1 具有在 Xeon 可扩展服务器 CPU 上使用 gcc8.2 的 PGO 与常规的基准测试结果。 (Skylake-AVX512)。每个基准测试都至少获得了小幅加速,有些基准提高了约 10%。 (其中大部分可能来自热循环中的循环展开,但其中一些可能来自更好的分支布局和其他效果。)

Pentium 4 (aka Netburst microarchitecture) had branch-predictor hints as prefixes to the jcc instructions, but only P4 ever did anything with them.
See http://ref.x86asm.net/geek32.html. And
Section 3.5 of Agner Fog's excellent asm opt guide, from http://www.agner.org/optimize/. He has a guide to optimizing in C++, too.

Earlier and later x86 CPUs silently ignore those prefix bytes. Are there any performance test results for usage of likely/unlikely hints? mentions that PowerPC has some jump instructions which have a branch-prediction hint as part of the encoding. It's a pretty rare architectural feature. Statically predicting branches at compile time is very hard to do accurately, so it's usually better to leave it up to hardware to figure it out.

Not much is officially published about exactly how the branch predictors and branch-target-buffers in the most recent Intel and AMD CPUs behave. The optimization manuals (easy to find on AMD's and Intel's web sites) give some advice, but don't document specific behaviour. Some people have run tests to try to divine the implementation, e.g. how many BTB entries Core2 has... Anyway, the idea of hinting the predictor explicitly has been abandoned (for now).

What is documented is for example that Core2 has a branch history buffer that can avoid mispredicting the loop-exit if the loop always runs a constant short number of iterations, < 8 or 16 IIRC. But don't be too quick to unroll, because a loop that fits in 64bytes (or 19uops on Penryn) won't have instruction fetch bottlenecks because it replays from a buffer... go read Agner Fog's pdfs, they're excellent.

See also Why did Intel change the static branch prediction mechanism over these years? : Intel since Sandybridge doesn't use static prediction at all, as far as we can tell from performance experiments that attempt to reverse-engineer what CPUs do. (Many older CPUs have static prediction as a fallback when dynamic prediction misses. The normal static prediction is forward branches are not-taken and backward branches are taken (because backwards branches are often loop branches).)


The effect of likely()/unlikely() macros using GNU C's __builtin_expect (like Drakosha's answer mentions) does not directly insert BP hints into the asm. (It might possibly do so with gcc -march=pentium4, but not when compiling for anything else).

The actual effect is to lay out the code so the fast path has fewer taken branches, and perhaps fewer instructions total. This will help branch prediction in cases where static prediction comes into play (e.g. dynamic predictors are cold, on CPUs which do fall back to static prediction instead of just letting branches alias each other in the predictor caches.)

See What is the advantage of GCC's __builtin_expect in if else statements? for a specific example of code-gen.

Taken branches cost slightly more than not-taken branches, even when predicted perfectly. When the CPU fetches code in chunks of 16 bytes to decode in parallel, a taken branch means that later instructions in that fetch block aren't part of the instruction stream to be executed. It creates bubbles in the front-end which can become a bottleneck in high-throughput code (which doesn't stall in the back-end on cache-misses, and has high instruction-level parallelism).

Jumping around between different blocks also potentially touches more cache-lines of code, increasing L1i cache footprint and maybe causing more instruction-cache misses if it was cold. (And potentially uop-cache footprint). So that's another advantage to having the fast path be short and linear.


GCC's profile-guided optimization normally makes likely/unlikely macros unnecessary. The compiler collects run-time data on which way each branch went for code-layout decisions, and to identify hot vs. cold blocks / functions. (e.g. it will unroll loops in hot functions but not cold functions.) See -fprofile-generate and -fprofile-use in the GCC manual. How to use profile guided optimizations in g++?

Otherwise GCC has to guess using various heuristics, if you didn't use likely/unlikely macros and didn't use PGO. -fguess-branch-probability is enabled by default at -O1 and higher.

https://www.phoronix.com/scan.php?page=article&item=gcc-82-pgo&num=1 has benchmark results for PGO vs. regular with gcc8.2 on a Xeon Scalable Server CPU. (Skylake-AVX512). Every benchmark got at least a small speedup, and some benefited by ~10%. (Most of that is probably from loop unrolling in hot loops, but some of it is presumably from better branch layout and other effects.)

寻梦旅人 2024-08-20 08:11:39

我建议不要担心分支预测,而是分析代码并优化代码以减少分支数量。一个示例是循环展开,另一个示例使用布尔编程技术而不是使用 if 语句。

大多数处理器喜欢预取语句。通常,分支语句会在处理器内生成故障,导致处理器刷新预取队列。这就是最大的惩罚。为了减少这种惩罚时间,请重写(和设计)代码,以便减少可用的分支。此外,某些处理器可以有条件地执行指令而无需分支。

我通过使用循环展开和大型 I/O 缓冲区将程序的执行时间从 1 小时优化为 2 分钟。在这种情况下,分支预测不会节省太多时间。

I suggest rather than worry about branch prediction, profile the code and optimize the code to reduce the number of branches. One example is loop unrolling and another using boolean programming techniques rather than using if statements.

Most processors love to prefetch statements. Generally, a branch statement will generate a fault within the processor causing it to flush the prefetch queue. This is where the biggest penalty is. To reduce this penalty time, rewrite (and design) the code so that fewer branches are available. Also, some processors can conditionally execute instructions without having to branch.

I've optimized a program from 1 hour of execution time to 2 minutes by using loop unrolling and large I/O buffers. Branch prediction would not have offered much time savings in this instance.

帅气尐潴 2024-08-20 08:11:39

SUN C Studio 有一些针对这种情况定义的编译指示。

如果条件表达式的一部分是函数调用或以函数调用开头,则此方法有效

但无法标记通用的 if/while 语句

SUN C Studio has some pragmas defined for this case.

#pragma rarely_called ()

This works if one part of a conditional expression is a function call or starts with a function call.

But there is no way to tag a generic if/while statement

无风消散 2024-08-20 08:11:39

不,因为没有汇编命令让分支预测器知道。不用担心,分支预测器非常智能。

另外,关于过早优化及其邪恶之处的强制性评论。

编辑:Drakosha 提到了 GCC 的一些宏。然而,我相信这是一种代码优化,实际上与分支预测无关。

No, because there's no assembly command to let the branch predictor know. Don't worry about it, the branch predictor is pretty smart.

Also, obligatory comment about premature optimization and how it's evil.

EDIT: Drakosha mentioned some macros for GCC. However, I believe this is a code optimization and actually has nothing to do with branch prediction.

脱离于你 2024-08-20 08:11:39

在我看来,这听起来有点矫枉过正——这种类型的优化将节省少量的时间。例如,使用更现代版本的 gcc 将对优化产生更大的影响。另外,尝试启用和禁用所有不同的优化标志;它们并不都能提高性能。

基本上,与许多其他富有成效的道路相比,这似乎不太可能产生任何重大影响。

编辑:感谢您的评论。我制作了这个社区维基,但将其保留下来以便其他人可以看到评论。

This sounds to me like overkill - this type of optimization will save tiny amounts of time. For example, using a more modern version of gcc will have a much greater influence on optimizations. Also, try enabling and disabling all the different optimization flags; they don't all improve performance.

Basically, it seems super unlikely this will make any significant difference compared to many other fruitful paths.

EDIT: thanks for the comments. I've made this community wiki, but left it in so others can see the comments.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文