分析 C++在存在积极内联的情况下?
我试图使用 gprof 找出我的 C++ 程序将时间花在哪里。这是我的困境:如果我使用与发布版本相同的优化设置进行编译,几乎所有内容都会被内联,而 gprof 毫无帮助地告诉我,我 90% 的时间都花在了核心例程上,其中所有内容都被内联了。另一方面,如果我在禁用内联的情况下进行编译,则程序运行速度会慢一个数量级。
我想知道当我的程序在启用内联的情况下编译时,从我的核心例程调用的过程花费了多少时间。
我在四核 Intel 机器上运行 64 位 Ubuntu 9.04。我研究了 google-perftools,但这似乎在 x86_64 上效果不佳。无法在 32 位计算机上运行。
有人对启用内联时如何更有效地分析我的应用程序有建议吗?
编辑:这是对我的问题的一些澄清。如果一开始不清楚,我深表歉意。
我想知道我的申请中时间都花在哪里了。对我的优化构建进行分析后,gprof 告诉我,大约 90% 的时间都花在 main 上,其中所有内容都是内联的。在分析之前我就已经知道了!
我想知道的是内联函数花费了多少时间,最好不要在构建选项中禁用优化或内联。在禁用内联的情况下进行分析时,应用程序的速度会慢一个数量级。执行时间的这种差异是一个便利问题,而且,我不相信禁用内联构建的程序的性能配置文件将与启用内联构建的程序的性能配置文件强烈对应。
简而言之:有没有一种方法可以在不禁用优化或内联的情况下获取有关 C++ 程序的有用分析信息?
I am trying to figure out where my C++ program is spending its time, using gprof. Here's my dilemma: if I compile with the same optimization settings I use for my release build, pretty much everything gets inlined, and gprof tells me, unhelpfully, that 90% of my time is spent in a core routine, where everything was inlined. On the other hand, if I compile with inlining disabled, the program runs an order of magnitude slower.
I want to find out how much time procedures called from my core routine are taking, when my program is compiled with inlining enabled.
I am running 64-bit Ubuntu 9.04 on a quad-core Intel machine. I looked into google-perftools, but that doesn't seem to work well on x86_64. Running on a 32-bit machine is not an option.
Does anyone have suggestions as to how I can more effectively profile my application, when inlining is enabled?
Edit: Here is some clarification of my problem. I apologize if it was not clear initially.
I want to find where the time was being spent in my application. Profiling my optimized build resulted in gprof telling me that ~90% of the time is spent in main, where everything was inlined. I already knew that before profiling!
What I want to find out is how much time the inlined functions are taking, preferably, without disabling optimization or inlining in my build options. The application is something like an order of magnitude slower when profiling with inlining disabled. This difference in execution time is a convenience issue, but also, I am not confident that the performance profile of the program built with inlining disabled will strongly correspond to the performance profile of the program built with inlining enabled.
In short: is there a way to get useful profiling information on a C++ program without disabling optimization or inlining?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我假设您想要做的是找出哪些代码行的成本足以值得优化。这与计时函数有很大不同。 您可以比 gprof 做得更好。
这是关于如何执行此操作的相当完整的说明。< /a>
您可以手动执行此操作,或使用可以提供相同信息的分析器之一,例如 oprofile 和 RotateRight/Zoom。
顺便说一句,只有当被内联的例程很小并且不调用函数本身,并且调用它们的行在足够的时间内处于活动状态且具有重要意义时,内联才具有重要价值。
至于调试和发布构建之间的数量级性能比,可能是由于多种原因,可能是内联,也可能不是。您可以使用上面提到的 stackshot 方法来查找确定这两种情况都会发生什么。我发现调试构建可能会因为其他原因而变慢,例如递归数据结构验证。
I assume what you want to do is find out which lines of code are costing you enough to be worth optimizing. That is very different from timing functions. You can do better than gprof.
Here's a fairly complete explanation of how to do it.
You can do it by hand, or use one of the profilers that can provide the same information, such as oprofile, and RotateRight/Zoom.
BTW, inlining is of significant value only if the routines being inlined are small and don't call functions themselves, and if the lines where they are being called are active enough of the time to be significant.
As for the order of magnitude performance ratio between debug and release build, it may be due to a number of things, maybe or maybe not the inlining. You can use the stackshot method mentioned above to find out for certain just what's going on in either case. I've found that debug builds can be slow for other reasons, like recursive data structure validation, for example.
您可以使用更强大的分析器,例如英特尔的 VTune,它可以为您提供装配线级别的性能细节。
http://software.intel.com/en-us/intel-vtune/< /a>
它适用于 Windows 和 Linux,但确实要花钱......
You can use a more powerful profiler, such as Intel's VTune, which can give you assembly line level of performance detail.
http://software.intel.com/en-us/intel-vtune/
It's for Windows and Linux, but does cost money...
使用CPU的高性能计时机制开发一些宏(例如,x86)--不依赖系统调用的例程,并将运行核心循环的单个线程绑定到特定 CPU (设置亲和力)。您需要实现以下宏。
我在每个我感兴趣的函数中都放置了类似的东西,我确保宏将定时调用放入调用树的上下文中 - 这可能是最准确的分析方法。
注意:
此方法由您的代码驱动,并且不依赖外部工具以任何方式窥探您的代码。当涉及到小段代码时,监听、采样和中断驱动的分析是不准确的。此外,您希望控制收集计时数据的位置和时间 - 例如代码中的特定构造,例如循环、递归调用链的开始或大容量内存分配。
-- 编辑 --
您可能会对 此答案中我的一个问题的链接感兴趣。
Develop a few macros using the high performance timing mechanism of your CPU (e.g., x86) -- the routines that don't rely on system calls, and bind a single thread running your core loop to a specific CPU (set the affinity). You would need to implement the following macro's.
I had something like this that I placed in every function I was interested in, I made sure the macro's placed the timing calls into the context of the call tree -- this is possibly the most accurate way to profile.
Note:
This method is driven by your code -- and does not rely on an external tool to snoop your code in any way. Snooping, Sampling and interrupt driven profiling is inaccurate when it comes to small sections of code. Besides, you want to control where and when the timing data is collected -- like at specific constructs in your code, like loops, the beginning of a recursive call-chain or mass memory allocations.
-- edit --
You might be interested in the link from this answer to one of my questions.
valgrind 会更有帮助吗?
与 KCachegrind GUI 相结合,它提供了一种免费且简单的方式来浏览适合内联的带注释的代码代码。
这里有一个非常简单的说明: http://web.stanford.edu/class/ cs107/guide_callgrind.html
Would valgrind be any more helpful?
Combined with KCachegrind GUI it offers a free and easy way of browsing annotated code suitable for inlined code.
Here you have pretty a straightforward instruction: http://web.stanford.edu/class/cs107/guide_callgrind.html
您可以使用 gcov 提供逐行执行计数。这至少应该告诉您哪些内联函数是瓶颈。
You can use gcov to give you line-by-line execution counts. This should at least tell you which inlined functions are the bottleneck.
代码运行速度较慢并不重要(当然,不考虑您的方便) - 分析器仍然会告诉您每个函数所花费的正确时间比例。
It doesn't matter that the code is running slower (your convenience aside, of course) - the profiler will still tell you the correct proportion of time spent in each function.