CPU 架构的演变如何影响虚拟函数调用性能?

发布于 2024-12-02 05:29:05 字数 521 浏览 1 评论 0原文

几年前,我正在学习 x86 汇编程序、CPU 流水线、缓存未命中、分支预测以及所有这些爵士乐。

这是一个分为两半的故事。我读到了处理器中冗长管道的所有美妙优点,即指令重新排序、缓存预加载、依赖交错等。

缺点是任何偏离规范的代价都是巨大的。例如,IIRC 早期千兆赫时代的某个 AMD 处理器每次通过指针调用函数时都会有 40 个周期 的惩罚(!),这显然是正常的。

这不是一个可以忽略不计的“不用担心”的数字!请记住,“良好的设计”通常意味着“尽可能多地考虑你的函数”和“在数据类型中编码语义”这通常意味着虚拟接口。

代价是不执行此类操作的代码每个周期可能会获得两个以上的指令。这些是编写高性能 C++ 代码时需要担心的问题,因为这些代码重于对象设计而轻于数字运算。

据我了解,随着我们进入低功耗时代,长 CPU 流水线的趋势正在发生逆转。我的问题是:

最新一代的 x86 兼容处理器是否仍然会因虚拟函数调用、错误的分支预测等而遭受巨大的惩罚?

Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction, and all that jazz.

It was a tale of two halves. I read about all the wonderful advantages of the lengthy pipelines in the processor viz instruction reordering, cache preloading, dependency interleaving, etc.

The downside was that any deviation for the norm was enormously costly. For example, IIRC a certain AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function through a pointer (!) and this was apparently normal.

This is not a negligible "don't worry about it" number! Bear in mind that "good design" normally means "factor your functions as much as possible" and "encode semantics in the data types" which often implies virtual interfaces.

The trade-off is that code which doesn't perform such operations might get more than two instructions per cycle. These are numbers one wants to worry about when writing high-performance C++ code which is heavy on the object design and light on the number crunching.

I understand that the long-CPU-pipeline trend has been reversing as we enter the low-power era. Here's my question:

Does the latest generation of x86-compatible processors still suffer massive penalties for virtual function calls, bad branch predictions, etc?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

深海少女心 2024-12-09 05:29:05

早期千兆赫时代的 AMD 处理器每次调用函数都会有 40 个周期的损失

呵呵..这么大..

有一个“间接分支预测”方法,它有助于预测虚拟函数跳转,如果有相同的间接分支前段时间跳过。对于首次和错误预测的 virt 仍然存在惩罚。函数跳转。

支持范围从简单的“当且仅当先前的间接分支完全相同时正确预测”到非常复杂的两级数十或数百个条目,并检测单个间接 jmp 指令的 2-3 个目标地址的周期性交替。

这里发生了很多演变...

http://arstechnica.com /hardware/news/2006/04/core.ars/7

首次在 Pentium M 中引入:...间接分支预测器。

间接分支预测器

由于间接分支从寄存器加载其分支目标,而不是像直接分支那样立即可用,因此它们非常难以预测。 Core 的间接分支预测器是一个表,存储前端遇到的每个间接分支的首选目标地址的历史信息。因此,当前端遇到间接分支并预测它被采用时,它可以要求间接分支预测器将其定向到该分支可能需要的 BTB 中的地址。

http://www.realworldtech.com/page.cfm?ArticleID=rwt051607033728& ;p=3

间接分支预测首先在 Intel 的 Prescott 微架构和后来的 Pentium M 中引入。

所有分支错误预测中有 16-50% 是间接的(平均 29%)。间接分支错误预测的真正价值在于许多较新的脚本或高级语言,例如使用解释器的 Ruby、Perl 或 Python。其他常见的间接分支常见罪魁祸首包括虚函数(在 C++ 中使用)和对函数指针的调用。

http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436& ;p=5

AMD 采用了其中一些改进;例如,在巴塞罗那和更高版本的处理器中添加间接分支预测器数组。然而,K8 的分支预测器比 Core 2 更旧且不太准确。

http://www.agner .org/optimize/microarchitecture.pdf

3.12 旧处理器上的间接跳转
间接跳转、间接调用和返回每次都可能到达不同的地址。这
间接跳转或间接调用的预测方法是,在早于 PM 的处理器中
K10,只是为了预测它将到达与上次执行相同的目标。

和相同的 pdf,第 14 页

间接跳转预测
间接跳转或调用是具有两种以上可能的控制转移指令
目标。 C++ 程序可以生成间接跳转或调用...虚函数。间接跳转或调用是在汇编中生成的
指定寄存器或内存变量或索引数组作为跳转的目的地
或致电指令。许多处理器只为间接跳转或调用创建一个 BTB 条目。
这意味着它总是被预测会达到与上次相同的目标。
随着使用多态类的面向对象编程变得越来越普遍,
预测具有多个目标的间接调用的需求日益增长。这可以做到
通过为遇到的每个新跳转目标分配一个新的 BTB 条目。历史
缓冲区和模式历史表必须有空间容纳多于一位的信息
每次跳跃事件都是为了区分两个以上可能的目标。
PM 是第一个实现此方法的 x86 处理器。第 p 上的预测规则。 12 仍然
适用于可预测的理论最大周期的修改
完美的是 mn,其中 m 是每次间接跳跃的不同目标的数量,因为有
mn 个不同的可能的 n 长度子序列。然而,这个理论最大值不能
如果超过 BTB 或模式历史表的大小,则无法达到。

Agner 的手册对许多现代 CPU 中的分支预测器以及每个制造商的 cpu (x86/x86_64) 中预测器的演变进行了较长的描述。

还有很多理论上的“间接分支预测”方法(查看Google学术);甚至 wiki 也说了一些关于它的内容 http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_indirect_jumps< /a> /

对于来自 agner 微型的 Atoms:

间接分支的预测
根据我的测试,Atom 没有间接分支的模式预测器。间接
预测分支将达到与上次相同的目标。

因此,对于低功耗,间接分支预测并不是那么先进。 Via Nano 也是如此:

预计间接跳跃将到达与上次相同的目标。

我认为,低功耗 x86 的较短管道具有较低的惩罚,7-20 个刻度。

AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function

Huh.. so large..

There is an "Indirect branch prediction" method, which helps to predict virtual function jump, IF there was the same indirect jump some time ago. There is still a penalty for first and mispredicted virt. function jump.

Support varies from simple "predicted right if and only if the previous indirect branch was exactly the same" to very complex two-level tens or hundreds entries with detecting of periodic alternation of 2-3 target address for single indirect jmp instruction.

There was a lot of evolution here...

http://arstechnica.com/hardware/news/2006/04/core.ars/7

first introduced with the Pentium M: ... indirect branch predictor.

The indirect branch predictor

Because indirect branches load their branch targets from a register, instead of having them immediately available as is the case with direct branches, they're notoriously difficult to predict. Core's indirect branch predictor is a table that stores history information about the preferred target addresses of each indirect branch that the front end encounters. Thus when the front-end encounters an indirect branch and predicts it as taken, it can ask the indirect branch predictor to direct it to the address in the BTB that the branch will probably want.

http://www.realworldtech.com/page.cfm?ArticleID=rwt051607033728&p=3

Indirect branch prediction was first introduced with Intel’s Prescott microarchitecture and later the Pentium M.

between 16-50% of all branch mispredicts were indirect (29% on average). The real value of indirect branch misprediction is for many of the newer scripting or high level languages, such as Ruby, Perl or Python, which use interpreters. Other common indirect branch common culprits include virtual functions (used in C++) and calls to function pointers.

http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436&p=5

AMD has adopted some of these refinements; for instance adding indirect branch predictor arrays in Barcelona and later processors. However, the K8 has older and less accurate branch predictors than the Core 2.

http://www.agner.org/optimize/microarchitecture.pdf

3.12 Indirect jumps on older processors
Indirect jumps, indirect calls, and returns may go to a different address each time. The
prediction method for an indirect jump or indirect call is, in processors older than PM and
K10, simply to predict that it will go to the same target as last time it was executed.

and the same pdf, page 14

Indirect jump prediction
An indirect jump or call is a control transfer instruction that has more than two possible
targets. A C++ program can generate an indirect jump or call with... a virtual function. An indirect jump or call is generated in assembly by
specifying a register or a memory variable or an indexed array as the destination of a jump
or call instruction. Many processors make only one BTB entry for an indirect jump or call.
This means that it will always be predicted to go to the same target as it did last time.
As object oriented programming with polymorphous classes has become more common,
there is a growing need for predicting indirect calls with multiple targets. This can be done
by assigning a new BTB entry for every new jump target that is encountered. The history
buffer and pattern history table must have space for more than one bit of information for
each jump incident in order to distinguish more than two possible targets.
The PM is the first x86 processor to implement this method. The prediction rule on p. 12 still
applies with the modification that the theoretical maximum period that can be predicted
perfectly is mn, where m is the number of different targets per indirect jump, because there
are mn different possible n-length subsequences. However, this theoretical maximum cannot
be reached if it exceeds the size of the BTB or the pattern history table.

Agner's manual has a longer description of branch predictor in many modern CPUs and the evolution of predictor in cpus of every manufacturer (x86/x86_64).

Also a lot of theoretical "indirect branch prediction" methods (look in the Google scholar); even wiki said some words about it http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_indirect_jumps /

For Atoms from the agner's micro:

Prediction of indirect branches
The Atom has no pattern predictor for indirect branches according to my tests. Indirect
branches are predicted to go to the same target as last time.

So, for low power, indirect branch prediction is not so advanced. So does Via Nano:

Indirect jumps are predicted to go to the same target as last time.

I think, that shorter pipeline of lowpower x86 has lower penalty, 7-20 ticks.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文