分析 C++在存在积极内联的情况下？

发布于 2024-08-18 10:14:34 字数 706 浏览 4 评论 0原文

我试图使用 gprof 找出我的 C++ 程序将时间花在哪里。这是我的困境：如果我使用与发布版本相同的优化设置进行编译，几乎所有内容都会被内联，而 gprof 毫无帮助地告诉我，我 90% 的时间都花在了核心例程上，其中所有内容都被内联了。另一方面，如果我在禁用内联的情况下进行编译，则程序运行速度会慢一个数量级。

我想知道当我的程序在启用内联的情况下编译时，从我的核心例程调用的过程花费了多少时间。

我在四核 Intel 机器上运行 64 位 Ubuntu 9.04。我研究了 google-perftools，但这似乎在 x86_64 上效果不佳。无法在 32 位计算机上运行。

有人对启用内联时如何更有效地分析我的应用程序有建议吗？

编辑：这是对我的问题的一些澄清。如果一开始不清楚，我深表歉意。

我想知道我的申请中时间都花在哪里了。对我的优化构建进行分析后，gprof 告诉我，大约 90% 的时间都花在 main 上，其中所有内容都是内联的。在分析之前我就已经知道了！

我想知道的是内联函数花费了多少时间，最好不要在构建选项中禁用优化或内联。在禁用内联的情况下进行分析时，应用程序的速度会慢一个数量级。执行时间的这种差异是一个便利问题，而且，我不相信禁用内联构建的程序的性能配置文件将与启用内联构建的程序的性能配置文件强烈对应。

简而言之：有没有一种方法可以在不禁用优化或内联的情况下获取有关 C++ 程序的有用分析信息？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

深爱不及久伴 2024-08-25 10:14:34

我假设您想要做的是找出哪些代码行的成本足以值得优化。这与计时函数有很大不同。您可以比 gprof 做得更好。

这是关于如何执行此操作的相当完整的说明。< /a>

您可以手动执行此操作，或使用可以提供相同信息的分析器之一，例如 oprofile 和 RotateRight/Zoom。

顺便说一句，只有当被内联的例程很小并且不调用函数本身，并且调用它们的行在足够的时间内处于活动状态且具有重要意义时，内联才具有重要价值。

至于调试和发布构建之间的数量级性能比，可能是由于多种原因，可能是内联，也可能不是。您可以使用上面提到的 stackshot 方法来查找确定这两种情况都会发生什么。我发现调试构建可能会因为其他原因而变慢，例如递归数据结构验证。

回复收藏 0 原文

网白 2024-08-25 10:14:34

您可以使用更强大的分析器，例如英特尔的 VTune，它可以为您提供装配线级别的性能细节。

http://software.intel.com/en-us/intel-vtune/< /a>

它适用于 Windows 和 Linux，但确实要花钱......

回复收藏 0 原文

夜访吸血鬼 2024-08-25 10:14:34

使用CPU的高性能计时机制开发一些宏（例如，x86）--不依赖系统调用的例程，并将运行核心循环的单个线程绑定到特定 CPU (设置亲和力）。您需要实现以下宏。

PROF_INIT //allocate any variables -- probably a const char
PROF_START("name") // start a timer
PROF_STOP() // end a timer and calculate the difference -- 
            // which you write out using a async fd

我在每个我感兴趣的函数中都放置了类似的东西，我确保宏将定时调用放入调用树的上下文中 - 这可能是最准确的分析方法。

注意：

此方法由您的代码驱动，并且不依赖外部工具以任何方式窥探您的代码。当涉及到小段代码时，监听、采样和中断驱动的分析是不准确的。此外，您希望控制收集计时数据的位置和时间 - 例如代码中的特定构造，例如循环、递归调用链的开始或大容量内存分配。

-- 编辑 --

您可能会对此答案中我的一个问题的链接感兴趣。

Develop a few macros using the high performance timing mechanism of your CPU (e.g., x86) -- the routines that don't rely on system calls, and bind a single thread running your core loop to a specific CPU (set the affinity). You would need to implement the following macro's.

PROF_INIT //allocate any variables -- probably a const char
PROF_START("name") // start a timer
PROF_STOP() // end a timer and calculate the difference -- 
            // which you write out using a async fd

I had something like this that I placed in every function I was interested in, I made sure the macro's placed the timing calls into the context of the call tree -- this is possibly the most accurate way to profile.

Note:

This method is driven by your code -- and does not rely on an external tool to snoop your code in any way. Snooping, Sampling and interrupt driven profiling is inaccurate when it comes to small sections of code. Besides, you want to control where and when the timing data is collected -- like at specific constructs in your code, like loops, the beginning of a recursive call-chain or mass memory allocations.

-- edit --

You might be interested in the link from this answer to one of my questions.

回复收藏 0 原文