现代 CPU 内循环间接优化

发布于 2024-09-14 13:54:47 字数 413 浏览 3 评论 0原文

来自 http://www.boost.org/community/implementation_variations.html

“..除非处于内部循环的深处,否则诸如将类从虚拟成员更改为非虚拟成员或删除间接级别之类的编码差异不太可能产生任何可测量的差异,而且即使在内部循环中,现代 CPU 也经常执行此类竞争代码序列。在相同数量的时钟周期内!”

我试图理解“即使在内循环中”部分。具体来说,CPU 实现什么机制来在相同数量的时钟周期内执行两个代码(虚拟代码、非虚拟代码或附加的间接级别)?我了解指令流水线和缓存,但是如何在与非虚拟调用相同数量的时钟周期内执行虚拟调用?间接是如何“丢失”的?

From http://www.boost.org/community/implementation_variations.html

"... coding differences such as changing a class from virtual to non-virtual members or removing a level of indirection are unlikely to make any measurable difference unless deep in an inner loop. And even in an inner loop, modern CPUs often execute such competing code sequences in the same number of clock cycles!"

I am trying to understand the "even in the inner loop" part. Specifically what mechanisms do CPUs implement to execute the two codes (virtual vs non-virtual or an additional level of indirection) within the same number of clock cycles? I know about instruction pipelining and caching, but how is it possible to perform a virtual call within the same number of clock cycles as a non-virtual call? How is the indirection "lost"?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

独夜无伴 2024-09-21 13:54:47

缓存(例如分支目标缓存),并行加载单元(管道的一部分,还有诸如“未击中”之类的东西,不会使管道停顿),以及 乱序执行可能有助于将加载-加载-分支转换为更接近的内容到固定的分支。流水线的解码或分支预测阶段中的指令折叠/消除(正确的术语是什么?)也可能有所贡献。

不过,所有这些都依赖于很多不同的事情:有多少个不同的分支目标(例如,您可能会触发多少个不同的虚拟重载),您循环了多少个事情(分支目标缓存是否“温暖”? icache/dcache 怎么样?),虚拟表或间接表如何在内存中布局(它们是否对缓存友好,或者每个新的 vtable 加载是否可能驱逐旧的 vtable?),缓存是否由于以下原因而反复失效多核乒乓球等...

(免责声明:我绝对不是这方面的专家,我的很多知识都来自研究有序嵌入式处理器,所以其中一些是推断。如果您有更正,请随意来评论!)

确定它是否会成为特定程序的问题的正确方法当然是进行分析。如果可以的话,请借助硬件计数器来实现这一点——它们可以告诉您很多有关管道各个阶段所发生情况的信息。


编辑:

正如 Hans Passant 在上述评论中指出的那样 现代 CPU 内循环间接优化,让这两件事花费相同时间的关键是能够在每个周期有效地“退休”多个指令。指令消除可以帮助解决这个问题,但是超标量设计可能更重要(点击未命中是一个非常小且具体的示例,完全冗余的负载单元可能是更好的示例)。

让我们采取一种理想的情况,假设直接分支只有一条指令:

branch dest

...间接分支是三个(也许你可以得到两条指令,但它大于一条):

load vtable from this
load dest from vtable
branch dest

让我们假设一个绝对完美的情况:*this并且整个 vtable 都在 L1 高速缓存中,L1 高速缓存足够快,可以支持两个加载的每条指令成本分摊一个周期。 (您甚至可以假设处理器对负载进行了重新排序,并将它们与较早的指令混合在一起,以便有时间让它们在分支之前完成;这对于本例来说并不重要。)还假设分支目标缓存很热,并且没有管道分支的刷新成本,并且分支指令归结为单个周期(摊销)。

因此,第一个示例的理论最短时间为 1 个周期(摊销)。

第二个示例的理论最小值(缺少指令消除或冗余功能单元或允许每个周期退出多条指令的东西)是 3 个周期(有 3 条指令)!

间接加载总是会更慢,因为有更多的指令,直到您进行诸如超标量设计之类的允许每个周期退出多条指令的设计。

一旦你有了这个,两个例子的最小值就会变成 0 到 1 个周期之间的值,前提是其他一切都是理想的。可以说,第二个示例必须有比第一个示例更理想的环境才能真正达到理论最小值,但现在这是可能的。

在您关心的某些情况下,您可能不会达到这两个示例的最小值。要么分支目标缓存变冷,要么虚函数表不在数据缓存中,要么机器无法重新排序指令以充分利用冗余功能单元。

...这就是分析的用武之地,无论如何这通常是一个好主意。

可以首先对虚拟技术抱有轻微的偏执。请参阅 Noel Llopis 关于面向数据的设计的文章,优秀的面向对象编程的陷阱幻灯片,以及 迈克·阿克顿 (Mike Acton) 脾气暴躁但富有教育意义的演讲。现在,如果您正在处理大量数据,您会突然进入 CPU 可能已经满意的模式。

像虚拟这样的高级语言特性通常是表达性和控制性之间的权衡。不过,老实说,我认为,只要提高您对虚拟实际功能的认识(不要害怕时不时地阅读反汇编视图,并且一定要看看您的 CPU 架构手册),您就会倾向于使用它当它有意义时,而不是当它没有意义时,分析器可以根据需要涵盖其余部分。

关于“不要使用虚拟”或“虚拟使用不太可能产生可衡量的差异”的一刀切的说法让我感到不满。现实通常更复杂,要么你会处于一种你足够关心的情况,以至于要分析或避免它,要么你处于另外 95% 的情况,除了可能的教育内容之外,它可能不值得关心。

Caching (e.g. branch target caching), parallel load units (part of pipelining, but also things like "hit under miss" which don't stall the pipeline), and out-of-order execution are likely to help transform a load-load-branch into something that is closer to a fixed branch. Instruction folding/elimination (what's the proper term for this?) in the decode or branch prediction stage of the pipeline may also contribute.

All of this relies on a lot of different things, though: how many different branch targets there are (e.g. how many different virtual overloads are you likely to trigger), how many things you loop over (is the branch target cache "warm"? how about the icache/dcache?), how the virtual tables or indirection tables are layed out in memory (are they cache-friendly, or is each new vtable load possibly evicting an old vtable?), is the cache being invalidated repeatedly due to multicore ping-ponging, etc...

(Disclaimer: I'm definitely not an expert here, and a lot of my knowledge comes from studying in-order embedded processors, so some of this is extrapolation. If you have corrections, feel free to comment!)

The correct way to determine if it's going to be a problem for a specific program is of course to profile. If you can, do so with the help of hardware counters -- they can tell you a lot about what's going on in the various stages of the pipeline.


Edit:

As Hans Passant points out in an above comment Modern CPU Inner Loop Indirection Optimizations, the key to getting these two things to take the same amount of time is the ability to effectively "retire" more than one instruction per cycle. Instruction elimination can help with this, but superscalar design is probably more important (hit under miss is a very small and specific example, fully redundant load units might be a better one).

Let's take an ideal situation, and assume a direct branch is just one instruction:

branch dest

...and an indirect branch is three (maybe you can get it in two, but it's greater than one):

load vtable from this
load dest from vtable
branch dest

Let's assume an absolutely perfect situation: *this and the entire vtable are in L1 cache, L1 cache is fast enough to support amortized one cycle per instruction cost for the two loads. (You can even assume the processor reordered the loads and intermixed them with earlier instructions to allow time for them to complete before the branch; it doesn't matter for this example.) Also assume the branch target cache is hot, and there's no pipeline flush cost for the branch, and the branch instruction comes down to a single cycle (amortized).

The theoretical minimum time for the first example is therefore 1 cycle (amortized).

The theoretical minimum for the second example, absent instruction elimination or redundant functional units or something that will allow retiring more than one instruction per cycle, is 3 cycles (there are 3 instructions)!

The indirect load will always be slower, because there are more instructions, until you reach into something like superscalar design that allows retiring more than one instruction per cycle.

Once you have this, the minimum for both examples becomes something between 0 and 1 cycles, again, provided everything else is ideal. Arguably you have to have more ideal circumstances for the second example to actually reach that theoretical minimum than for the first example, but it's now possible.

In some of the cases you'd care about, you're probably not going to reach that minimum for either example. Either the branch target cache will be cold, or the vtable won't be in the data cache, or the machine won't be capable of reordering the instructions to take full advantage of the redundant functional units.

...this is where profiling comes in, which is generally a good idea anyway.

You can just espouse a slight paranoia about virtuals in the first place. See Noel Llopis's article on data oriented design, the excellent Pitfalls of Object-Oriented Programming slides, and Mike Acton's grumpy-yet-educational presentations. Now you've suddenly moved into patterns that the CPU is already likely to be happy with, if you're processing a lot of data.

High level language features like virtual are usually a tradeoff between expressiveness and control. I honestly think, though, by just increasing your awareness of what virtual is actually doing (don't be afraid to read the disassembly view from time to time, and definitely peek at your CPU's architecture manuals), you'll tend to use it when it makes sense and not when it doesn't, and a profiler can cover the rest if needed.

One-size-fits-all statements about "don't use virtual" or "virtual use is unlikely to make a measurable difference" make me grouchy. The reality is usually more complicated, and either you're going to be in a situation where you care enough to profile or avoid it, or you're in that other 95% where it's probably not worth caring except for the possible educational content.

苏佲洛 2024-09-21 13:54:47

流水线是主要方式。

加载一条指令、对其进行解码、执行其操作以及加载间接内存引用可能需要 20 个时钟周期。但由于采用了流水线,处理器可以在流水线的不同阶段同时执行 19 条其他指令的一部分,从而每个时钟周期的总吞吐量为 1 条指令,无论该指令通过流水线实际需要多长时间。

Pipelining is the main way.

It might take 20 clock cycles to load an instruction, decode it, perform it's actions and load indirect memory references. But due to the pipleline the processor can be executing parts of 19 other instructions at the same time in different stages of the pipeline giving an overall throughput of 1 instruction every clock cycle regardless of how long it actually takes to feed that instruction through the pipeline.

剪不断理还乱 2024-09-21 13:54:47

我认为发生的情况是处理器有一个特殊的缓存,它保存分支和间接跳转的位置和目标。如果在 $12345678 处遇到间接跳转,并且上次遇到它时转到地址 $12348765,则处理器甚至可以在解析分支地址之前开始推测执行地址 $12348765 处的指令。在许多情况下,在函数的内部循环中,特定的间接跳转在整个循环期间始终会跳转到相同的地址。因此,间接跳转缓存可以避免分支惩罚。

What happens, I think is that the processor has a special cache which holds the locations and targets of branches and indirect jumps. If an indirect jump is encountered at $12345678, and the last time it was encountered it went to address $12348765, the processor can start speculative execution of the instructions at address $12348765 even before it resolves the address of the branch. In many cases, within the inner loop of a function, a particular indirect jump will always jump to the same address throughout the duration of the loop. The indirect-jump cache can thus avoid branching penalties.

神魇的王 2024-09-21 13:54:47

现代 CPU 使用自适应分支预测技术,可以预测许多间接跳转,例如通过虚拟函数的 vtable 实现实现的跳转。请参阅http://en.wikipedia.org/wiki/Branch_prediction#Prediction_of_indirect_jumps

Modern CPUs use an adaptive branch prediction technique which can predict many indirect jumps such as you get with a vtable implementation of virtual functions. See http://en.wikipedia.org/wiki/Branch_prediction#Prediction_of_indirect_jumps

吻安 2024-09-21 13:54:47

如果CPU已经在缓存中拥有内存地址,那么执行加载指令就很简单了。

If the CPU already has the memory address in cache, then executing a load instruction is trivial, if that.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文