评估 callgrind 通话配置文件的工具?
与这个问题有关,您建议使用哪种工具来评估分析数据用 callgrind 创建的?
它不一定要有图形界面,但应该以简洁、清晰和易于解释的方式准备结果。我了解例如kcachegrind
,但该程序缺少一些功能,例如显示的表格的数据导出或简单地从显示中复制行。
Somehow related to this question, which tool would you recommend to evaluate the profiling data created with callgrind?
It does not have to have a graphical interface, but it should prepare the results in a concise, clear and easy-to-interpret way. I know about e.g. kcachegrind
, but this program is missing some features such as data export of the tables shown or simply copying lines from the display.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
几年前,我编写了一个在 DOS 下运行的分析器。
如果您使用 KCacheGrind,这就是我希望它做的事情。写起来可能不太难,或者你可以手工完成。
KCacheGrind 有一个工具栏按钮“强制转储”,您可以使用它在随机时间手动触发转储。在等待程序的时间间隔内,随机或伪随机地捕获堆栈跟踪是该技术的核心。
不需要太多样本 - 通常 20 个就足够了。如果瓶颈成本很高,例如超过 50%,则 5 个样本可能就足够了。
样品的处理非常简单。每个堆栈跟踪由一系列代码行(实际上是地址)组成,其中除了最后一行之外都是函数/方法调用。
收集样本上出现的所有代码行的列表,并消除重复。
对于每一行代码,计算它出现在样本中的比例。例如,如果您获取 20 个样本,并且该代码行出现在其中 3 个样本上,即使它在某些样本中出现多次(由于递归),计数也为 3/20 或 15%。这是对每个语句成本的直接衡量。
显示成本最高的 100 行左右代码。您的瓶颈就在该列表中。
我通常对这些信息所做的就是选择一行成本较高的行,然后手动获取堆栈样本直到它出现(或查看我已经得到的样本),并问自己“为什么要执行该行代码,不仅是在本地意义上,而且是在全球意义上。”另一种说法是“从全局意义上讲,程序试图在采样时的时间片内完成什么”。我问这个问题的原因是因为这告诉我是否真的有必要花费该线路的成本。
我不想批评人们开发分析器所做的所有伟大工作,但遗憾的是,关于这个主题有很多根深蒂固的神话,包括:
通过大量样本进行精确测量非常重要。相反,重点应该放在发现瓶颈上。精确测量并不是其先决条件。对于成本在 10% 到 90% 之间的典型瓶颈,测量可能相当粗略。
函数比代码行更重要。如果您发现一个成本高昂的函数,您仍然必须在其中搜索成为瓶颈的行。该信息就在那里,在堆栈跟踪中 - 无需寻找它。
您需要区分 CPU 和挂钟时间。如果您正在等待,那就是挂钟时间(手表时间?)。例如,如果您有一个由无关 I/O 组成的瓶颈,您是否想忽略它,因为它不是 CPU 时间?
独占时间和包含时间之间的区别很有用。仅当您正在计时函数并且您想要一些线索是否时间花在被调用者上时,这才有意义。如果你查看代码行,唯一重要的是包含时间。另一种说法是,每条指令都是调用指令,即使它只调用微代码。
递归很重要。这是无关紧要的,因为它不会影响一条线所在的样本部分,因此负责。
行或函数的调用计数很重要。无论它是快且被调用次数过多,还是慢且被调用一次,成本都是它使用的时间百分比,这就是堆栈样本的估计值。
采样的性能很重要。我不介意采取堆栈样本并在继续之前查看它几分钟,假设这不会使瓶颈移动。
这里更完整解释。
Years ago I wrote a profiler to run under DOS.
If you are using KCacheGrind here's what I would have it do. It might not be too difficult to write it, or you can just do it by hand.
KCacheGrind has a toolbar button "Force Dump", with which you can trigger a dump manually at a random time. The capture of stack traces at random or pseudo-random times, in the interval when you are waiting for the program, is the heart of the technique.
Not many samples are needed - 20 is usually more than enough. If a bottleneck costs a large amount, like more than 50%, 5 samples may be quite enough.
The processing of the samples is very simple. Each stack trace consists of a series of lines of code (actually addresses), where all but the last are function/method calls.
Collect a list of all the lines of code that appear on the samples, and eliminate duplicates.
For each line of code, count what fraction of samples it appears on. For example, if you take 20 samples, and the line of code appears on 3 of them, even if it appears more than once in some sample (due to recursion) the count is 3/20 or 15%. That is a direct measure of the cost of each statement.
Display the most costly 100 or so lines of code. Your bottlenecks are in that list.
What I typically do with this information is choose a line with high cost, and then manually take stack samples until it appears (or look at the ones I've already got), and ask myself "Why is it doing that line of code, not just in a local sense, but in a global sense." Another way to put it is "What in a global sense was the program trying to accomplish at the time slice when that sample was taken". The reason I ask this is because that tells me if it was really necessary to be spending what that line is costing.
I don't want to be critical of all the great work people do developing profilers, but sadly there is a lot of firmly entrenched myth on the subject, including:
that precise measuring, with lots of samples, is important. Rather the emphasis should be on finding the bottlenecks. Precise measurement is not a prerequisite for that. For typical bottlenecks, costing between 10% and 90%, the measurement can be quite coarse.
that functions matter more than lines of code. If you find a costly function, you still have to search within it for the lines that are the bottleneck. That information is right there, in the stack traces - no need to hunt for it.
that you need to distinguish CPU from wall-clock time. If you're waiting for it, it's wall clock time (wrist-watch time?). If you have a bottleneck consisting of extraneous I/O, for example, do you want to ignore that because it's not CPU time?
that the distinction between exclusive time and inclusive time is useful. That only makes sense if you're timing functions and you want some clue whether the time is spent not in callees. If you look at lines of code, the only thing that matters is inclusive time. Another way to put it is, every instruction is a call instruction, even if it only calls microcode.
that recursion matters. It is irrelevant, because it doesn't affect the fraction of samples a line is on and is therefore responsible for.
that the invocation count of a line or function matters. Whether it's fast and is called too many times, or slow and called once, the cost is the percent of time it uses, and that's what the stack samples estimate.
that performance of sampling matters. I don't mind taking a stack sample and looking at it for several minutes before continuing, assuming that doesn't make the bottlenecks move.
Here's a more complete explanation.
有一些 CLI 工具可用于处理 callgrind 数据:
以及 cachegrind 工具,它可以显示 callgrind.out 中的一些信息
There are some CLI tools for working with callgrind data:
and cachegrind tool which can show some information from callgrind.out