如何衡量分支错误预测的影响?
我目前正在分析二分搜索的实现。使用一些特殊指令来测量这一点,我注意到代码的错误预测率约为 20%。我很好奇是否有任何方法可以检查我可能因此而失去多少个周期。它是基于 MIPS 的架构。
I'm currently profiling an implementation of binary search. Using some special instructions to measure this I noticed that the code has about a 20% misprediction rate. I'm curious if there is any way to check how many cycles I'm potentially losing due to this. It's a MIPS based architecture.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
每次迭代会丢失 0.2 * N 个周期,其中 N 是在错误预测的分支之后刷新管道所需的周期数。假设 N = 10,那么这意味着每次迭代总共会损失 2 个时钟。除非您有一个非常小的内部循环,否则这可能不会对性能造成重大影响。
You're losing 0.2 * N cycles per iteration, where N is the number of cycles that it takes to flush the pipelines after a mispredicted branch. Suppose N = 10 then that means you are losing 2 clocks per iteration on aggregate. Unless you have a very small inner loop then this is probably not going to be a significant performance hit.
在文档中查找适合您的 CPU 的信息。如果您无法具体找到此信息,则 CPU 管道的长度是一个相当不错的估计。
鉴于它是 MIPS 并且是 300MHz 系统,我猜测它的管道相当短。可能有 4-5 个阶段,因此每次错误预测花费 3-4 个周期的成本可能是合理的猜测。
Look it up in the docs for your CPU. If you can't find this information specifically, the length of the CPU's pipeline is a fairly good estimate.
Given that it's MIPS and it's a 300MHz system, I'm going to guess that it's a fairly short pipeline. Probably 4-5 stages, so a cost of 3-4 cycles per mispredict is probably a reasonable guess.
您可以将近似错误预测成本计算为错误预测数量和错误预测成本(通常是管道某些部分的函数)的乘积。
在有序 CPU 上, ://en.wikipedia.org/wiki/Out-of-order_execution" rel="nofollow noreferrer">乱序 CPU,但是,这样的一般计算通常是不可能的。 Flight1 中可能有大量指令,其中只有部分指令因错误预测而被刷新。周围的代码可能受到一个或多个相关指令链的延迟限制,或者可能受到执行单元、重命名吞吐量等资源的吞吐量限制,或者可能介于两者之间。
在这样的核心上,即使在性能计数器的帮助下,每次错误预测的惩罚也很难确定。您可以找到整篇论文回到主题:发现整个基准测试的平均惩罚大小为 9 到 35 个周期:如果您查看一小段代码,范围会更大:零惩罚很容易证明,您可以创建一个惩罚为数百个周期的场景。
如果你只是想确定二分搜索中的错误预测成本,你会怎么办?一个简单的方法就是控制错误预测的数量并测量差异!如果您将基准输入设置为具有一系列行为,从始终遵循相同的分支模式开始,一直到具有随机模式,您可以绘制错误预测计数与运行时退化的关系。如果您这样做,请分享您的结果!
1数百条现代大核(例如 x86、ARM 和 POWER 架构提供的核)正在运行的指令。
On an in-order CPU you may be able to calculate the approximate mispredict cost as a product of the number of mispredicts and the mispredict cost (which is generally a function of some part of the pipeline)
On a modern out-of-order CPU, however, such a general calculation is usually not possible. There may be a large number of instructions in flight1, only some of which are flushed by a misprediction. The surrounding code may be latency bound by one or more chains of dependent instructions, or it may be throughput bound on resources like execution units, renaming throughput, etc, or it may be somewhere in-between.
On such a core, the penalty per misprediction is very difficult to determine, even with the help of performance counters. You can find entire papers dedicated to the topic: that one found a penalty size of ranging from 9 to 35 cycles averaged across entire benchmarks: if you look at some small piece of code the range will be even larger: a penalty of zero is easy to demonstrate, and you could create a scenario where the penalty is in the 100s of cycles.
Where does that leave you, just trying to determine the misprediction cost in your binary search? Well a simple approach is just to control the number of mispredictions and measure the difference! If you set up your benchmark input have a range of behavior, starting with always following the same branch pattern, all the way to having a random pattern, you can plot the misprediction count versus runtime degradation. If you do, share your result!
1Hundreds of instructions in-flight in the case of modern big cores such as those offered by the x86, ARM and POWER architectures.
查看该信息的规格,如果失败,请运行它十亿次,并在程序外部计时(秒表某物)。然后运行它,不要错过并进行比较。
Look at your specs for that info and if that fails, run it a billion times and time it external to your program (stop watch of something.) Then run it with without a miss and compare.