如何在多核处理器上进行基准测试
我正在寻找在多核处理器上执行微基准测试的方法。
背景:
大约在同一时间,桌面处理器引入了乱序执行,这使得性能难以预测,但它们(也许并非巧合)还引入了特殊指令来获得非常精确的计时。这些指令的示例是 x86 上的 rdtsc 和 PowerPC 上的 rftb 。这些指令给出的计时比系统调用所允许的更精确,允许程序员对他们的心进行微基准测试,无论是好是坏。
在具有多个核心的更现代的处理器上,其中一些核心有时会休眠,计数器在核心之间不同步。我们被告知 rdtsc 不再可以安全地用于基准测试,但当我们向我们解释替代解决方案时,我一定是在打瞌睡。
问题:
某些系统可能会保存和恢复性能计数器,并提供 API 调用来读取正确的总和。如果您知道此调用对于任何操作系统来说是什么,请在回答中告诉我们。
某些系统可能允许关闭核心,只留下一个运行。我知道当从开发人员工具安装了正确的首选项窗格时,Mac OS X Leopard 就会这样做。您认为这会让 rdtsc
再次安全使用吗?
更多背景:
请假设我在尝试进行微基准测试时知道自己在做什么。如果您认为如果优化的收益无法通过对整个应用程序进行计时来衡量,那么它就不值得优化,我同意您的观点,但是
在替代数据结构完成之前我无法对整个应用程序进行计时,这将导致需要很长时间。事实上,如果微基准测试没有希望,我现在可以决定放弃实施;
我需要在我无法控制截止日期的出版物中提供数据。
I am looking for ways to perform micro-benchmarks on multi-core processors.
Context:
At about the same time desktop processors introduced out-of-order execution that made performance hard to predict, they, perhaps not coincidentally, also introduced special instructions to get very precise timings. Example of these instructions are rdtsc
on x86 and rftb
on PowerPC. These instructions gave timings that were more precise than could ever be allowed by a system call, allowed programmers to micro-benchmark their hearts out, for better or for worse.
On a yet more modern processor with several cores, some of which sleep some of the time, the counters are not synchronized between cores. We are told that rdtsc
is no longer safe to use for benchmarking, but I must have been dozing off when we were explained the alternative solutions.
Question:
Some systems may save and restore the performance counter and provide an API call to read the proper sum. If you know what this call is for any operating system, please let us know in an answer.
Some systems may allow to turn off cores, leaving only one running. I know Mac OS X Leopard does when the right Preference Pane is installed from the Developers Tools. Do you think that this make rdtsc
safe to use again?
More context:
Please assume I know what I am doing when trying to do a micro-benchmark. If you are of the opinion that if an optimization's gains cannot be measured by timing the whole application, it's not worth optimizing, I agree with you, but
I cannot time the whole application until the alternative data structure is finished, which will take a long time. In fact, if the micro-benchmark were not promising, I could decide to give up on the implementation now;
I need figures to provide in a publication whose deadline I have no control over.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在 OSX(ARM、Intel 和 PowerPC)上,您想要使用
mach_absolute_time( )
:请注意,无需为此限制为一个内核。操作系统处理
mach_absolute_time( )
所需的幕后修复,以便在多核(和多插槽)环境中提供有意义的结果。On OSX (ARM, Intel and PowerPC), you want to use
mach_absolute_time( )
:Note that there's no need to limit to one core for this. The OS handles the fix-up required behind the scenes for
mach_absolute_time( )
to give meaninful results in a multi-core (and multi-socket) environment.内核正在返回“rtdsc”的正确同步值。如果您有一台多插槽机器,则必须将进程固定到一个插槽。这不是问题。
主要问题是调度程序使数据不可靠。
Linux内核有一些性能API> 2.6.31 但我还没看过。
窗口> Vista 在这方面做得很好,使用 QueryThreadCycleTime 和 QueryProcessCycleTime。
我不确定 OSX,但据我所知“mach_absolute_time”不会调整预定时间。
The cores are returning the correct synced values for "rtdsc". If you have a multisocket machine you have to fix the processe to one socket. This is not the problem.
The main problem is that the scheduler is making the data unreliable.
There is some performance API for Linux Kernel > 2.6.31 but i haven't looked at it.
Windows > Vista is doing a great job here, use QueryThreadCycleTime and QueryProcessCycleTime.
I'm not sure about OSX but AFAIK "mach_absolute_time" does not adjust the scheduled time.