CPU 架构的演变如何影响虚拟函数调用性能?
几年前,我正在学习 x86 汇编程序、CPU 流水线、缓存未命中、分支预测以及所有这些爵士乐。
这是一个分为两半的故事。我读到了处理器中冗长管道的所有美妙优点,即指令重新排序、缓存预加载、依赖交错等。
缺点是任何偏离规范的代价都是巨大的。例如,IIRC 早期千兆赫时代的某个 AMD 处理器每次通过指针调用函数时都会有 40 个周期 的惩罚(!),这显然是正常的。
这不是一个可以忽略不计的“不用担心”的数字!请记住,“良好的设计”通常意味着“尽可能多地考虑你的函数”和“在数据类型中编码语义”这通常意味着虚拟接口。
代价是不执行此类操作的代码每个周期可能会获得两个以上的指令。这些是编写高性能 C++ 代码时需要担心的问题,因为这些代码重于对象设计而轻于数字运算。
据我了解,随着我们进入低功耗时代,长 CPU 流水线的趋势正在发生逆转。我的问题是:
最新一代的 x86 兼容处理器是否仍然会因虚拟函数调用、错误的分支预测等而遭受巨大的惩罚?
Years ago I was learning about x86 assembler, CPU pipelining, cache misses, branch prediction, and all that jazz.
It was a tale of two halves. I read about all the wonderful advantages of the lengthy pipelines in the processor viz instruction reordering, cache preloading, dependency interleaving, etc.
The downside was that any deviation for the norm was enormously costly. For example, IIRC a certain AMD processor in the early-gigahertz era had a 40 cycle penalty every time you called a function through a pointer (!) and this was apparently normal.
This is not a negligible "don't worry about it" number! Bear in mind that "good design" normally means "factor your functions as much as possible" and "encode semantics in the data types" which often implies virtual interfaces.
The trade-off is that code which doesn't perform such operations might get more than two instructions per cycle. These are numbers one wants to worry about when writing high-performance C++ code which is heavy on the object design and light on the number crunching.
I understand that the long-CPU-pipeline trend has been reversing as we enter the low-power era. Here's my question:
Does the latest generation of x86-compatible processors still suffer massive penalties for virtual function calls, bad branch predictions, etc?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
呵呵..这么大..
有一个“间接分支预测”方法,它有助于预测虚拟函数跳转,如果有相同的间接分支前段时间跳过。对于首次和错误预测的 virt 仍然存在惩罚。函数跳转。
支持范围从简单的“当且仅当先前的间接分支完全相同时正确预测”到非常复杂的两级数十或数百个条目,并检测单个间接 jmp 指令的 2-3 个目标地址的周期性交替。
这里发生了很多演变...
http://arstechnica.com /hardware/news/2006/04/core.ars/7
http://www.realworldtech.com/page.cfm?ArticleID=rwt051607033728& ;p=3
http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436& ;p=5
http://www.agner .org/optimize/microarchitecture.pdf
和相同的 pdf,第 14 页
Agner 的手册对许多现代 CPU 中的分支预测器以及每个制造商的 cpu (x86/x86_64) 中预测器的演变进行了较长的描述。
还有很多理论上的“间接分支预测”方法(查看Google学术);甚至 wiki 也说了一些关于它的内容 http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_indirect_jumps< /a> /
对于来自 agner 微型的 Atoms:
因此,对于低功耗,间接分支预测并不是那么先进。 Via Nano 也是如此:
我认为,低功耗 x86 的较短管道具有较低的惩罚,7-20 个刻度。
Huh.. so large..
There is an "Indirect branch prediction" method, which helps to predict virtual function jump, IF there was the same indirect jump some time ago. There is still a penalty for first and mispredicted virt. function jump.
Support varies from simple "predicted right if and only if the previous indirect branch was exactly the same" to very complex two-level tens or hundreds entries with detecting of periodic alternation of 2-3 target address for single indirect jmp instruction.
There was a lot of evolution here...
http://arstechnica.com/hardware/news/2006/04/core.ars/7
http://www.realworldtech.com/page.cfm?ArticleID=rwt051607033728&p=3
http://www.realworldtech.com/page.cfm?ArticleID=RWT102808015436&p=5
http://www.agner.org/optimize/microarchitecture.pdf
and the same pdf, page 14
Agner's manual has a longer description of branch predictor in many modern CPUs and the evolution of predictor in cpus of every manufacturer (x86/x86_64).
Also a lot of theoretical "indirect branch prediction" methods (look in the Google scholar); even wiki said some words about it http://en.wikipedia.org/wiki/Branch_predictor#Prediction_of_indirect_jumps /
For Atoms from the agner's micro:
So, for low power, indirect branch prediction is not so advanced. So does Via Nano:
I think, that shorter pipeline of lowpower x86 has lower penalty, 7-20 ticks.