探查器报告的时间与真实时间 - 为什么存在差异?
我有两段代码执行相同的操作。一块是我自己写的,另一块是第三方写的。它们都被编译成一个可执行文件。第三方代码似乎能够比我的代码更快地完成其工作。它每秒可以执行 1,500 次操作,而我的为 500 次。然后,我在 VTune 中运行可执行文件,使用调用图分析选项,希望这能揭示我在哪里浪费了时间。不幸的是,VTune 诊断显示了它认为每个函数所花费的微秒数,声称我的函数和第三方函数每次调用都花费了大约 0.002 秒。这对于我的代码来说似乎是正确的,但与我对第三方代码速度的(手动)测量完全不一致。
怎么会发生这种事呢?
编辑:这两段代码都很大,并且调用它们自己的复杂的子函数树。
编辑:我应该指出,第三方代码是纯 C++,而我的代码本质上是刚刚在 C++ 编译器中编译的 C 代码。
编辑:VTune 是一个非常复杂的软件包,其中包含大量我不理解的配置选项。是否有一些设置可以减少这种不准确性?
I have two chunks of code that do the same operation. One chunk written by myself, the other written by a third party. They are both compiled into a single executable. The third party code appears to be able to do its job much faster than mine. It can perform 1,500 operations per second compared to my 500. I then ran the executable within VTune, employing the callgraph profiling option, hoping this would reveal where I was wasting time. Unfortunately the VTune diagnostics, which shows the number of microseconds it thinks each function takes, claims that both my function and the third party function are taking about 0.002 seconds per call. That's appears spot on for my code but is completely at odds with my (manual) measurement of the speed of the third party code.
How can this happen?
EDIT: both chunks of code are large and call their own complex trees of sub functions.
EDIT: I should point out that the third party code is pure C++ whereas my code is essentially C code that has just been compiled in a C++ compiler.
EDIT: VTune is a very complex package with loads of configuration options I don't understand. Might there be some settings to play with that may reduce this inaccuracy?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您对“真实时间”的定义可能需要修改。在比较苹果和梨时,您不能声称分析器是错误的。
分析器可用于相对计时;使用分析器查找代码中的“热点”,然后使用该信息来优化该区域。
实际说明:寻找采样分析器,它通常比跟踪/仪表分析器具有更少的开销/影响
(PS 也请阅读薛定谔/海森堡)
Your definition of 'true timings' might need revision. You cannot claim that the profiler is wrong when comparing apples and pears.
Profilers can be used for relative timing; use a profiler to find the 'hot-spot' in your code, then use the information to optimize that area.
On a practical note: look for a sampling profiler, which usually has much less overhead/impact than a tracing/instrumenting profiler
(PS Also read up on Schrodinger/Heisenberg)
我见过分析器人为地增加某些函数/系统调用的报告时间的情况。第三方库可能正在使用一些此类调用并与之挂钩。
您是否尝试过使用高性能时钟(Solaris 中的 gethrtime 或 Windows 中的 QueryPerformanceCounter )并测量函数的总时间作为健全性检查?
您的操作听起来真的很慢,受 CPU 限制 - 它们是 I/O 限制吗?您的 I/O 代码是否不如库的优化?这根本不一定会出现在 CPU 配置文件报告中。
I've seen cases where profilers artificially inflate the reported time for certain functions/system calls. It could be that th 3rd party library is using some such call and getting pegged for it.
Have you tried using the high performance clock (
gethrtime
in Solaris orQueryPerformanceCounter
in Windows) and measuring the total times of the functions as a sanity check?Your operations sound really slow to be CPU bound - are they I/O bound? Is your I/O code less optimized than the library's? That wouldn't necessarily show up in a CPU profile report at all.
如果您使用挂起时间(即经过的秒数而不是 CPU 计数器),您还需要考虑阻塞系统调用所花费的时间。例如,假设您没有执行太多文件 I/O,则您可能会花费大量时间将信息打印到控制台。控制台 I/O 不会显示为 CPU 时间,因为大部分时间只是等待更新控制台。
您可以使用
GetThreadTimes(...)
来确定您在代码与系统代码上花费的时间。我已经使用它和系统调用采样来减少上下文切换(并最终提高整体性能)。If you are using wall time (i.e., elapsed seconds instead of CPU counters), you also need to account for time spent in blocking system calls. For example, assuming you aren't doing much file I/O, you are probably spending a lot of time printing information to the console. Console I/O will not show up as CPU time since most of that time is simply waiting to update the console.
You can use
GetThreadTimes(...)
to determine how much time you are spending in your code vs. system code. I have used this and the system call sampling to reduce context switches (and ultimately increase overall performance).