linux perf:如何解释和查找热点
我今天尝试了 linux 的 perf 实用程序,但在解释其结果时遇到了困难。我习惯了 valgrind 的 callgrind,这当然是与基于采样的 perf 方法完全不同的方法。
我做了什么:
perf record -g -p $(pidof someapp)
perf report -g -n
现在我看到这样的东西:
+ 16.92% kdevelop libsqlite3.so.0.8.6 [.] 0x3fe57 ↑ + 10.61% kdevelop libQtGui.so.4.7.3 [.] 0x81e344 ▮ + 7.09% kdevelop libc-2.14.so [.] 0x85804 ▒ + 4.96% kdevelop libQtGui.so.4.7.3 [.] 0x265b69 ▒ + 3.50% kdevelop libQtCore.so.4.7.3 [.] 0x18608d ▒ + 2.68% kdevelop libc-2.14.so [.] memcpy ▒ + 1.15% kdevelop [kernel.kallsyms] [k] copy_user_generic_string ▒ + 0.90% kdevelop libQtGui.so.4.7.3 [.] QTransform::translate(double, double) ▒ + 0.88% kdevelop libc-2.14.so [.] __libc_malloc ▒ + 0.85% kdevelop libc-2.14.so [.] memcpy ...
好吧,这些函数可能很慢,但是我如何找出它们是从哪里调用的?由于所有这些热点都位于外部库中,我看不出有什么办法可以优化我的代码。
基本上我正在寻找某种带有累积成本注释的调用图,其中我的函数比我调用的库函数具有更高的包容性采样成本。
这可以用 perf 实现吗?如果是这样 - 怎么办?
注意:我发现“E”打开了调用图并提供了更多信息。但调用图通常不够深入和/或随机终止,而没有提供有关在何处花费了多少信息的信息。示例:
- 10.26% kate libkatepartinterfaces.so.4.6.0 [.] Kate::TextLoader::readLine(int&... Kate::TextLoader::readLine(int&, int&) Kate::TextBuffer::load(QString const&, bool&, bool&) KateBuffer::openFile(QString const&) KateDocument::openFile() 0x7fe37a81121c
这可能是我在 64 位上运行的问题吗?另请参阅:http://lists.fedoraproject.org/pipermail/devel/2010-November/144952 .html (我没有使用 fedora,但似乎适用于所有 64 位系统)。
I tried out linux' perf utility today and am having trouble in interpreting its results. I'm used to valgrind's callgrind which is of course a totally different approach to the sampling based method of perf.
What I did:
perf record -g -p $(pidof someapp)
perf report -g -n
Now I see something like this:
+ 16.92% kdevelop libsqlite3.so.0.8.6 [.] 0x3fe57 ↑ + 10.61% kdevelop libQtGui.so.4.7.3 [.] 0x81e344 ▮ + 7.09% kdevelop libc-2.14.so [.] 0x85804 ▒ + 4.96% kdevelop libQtGui.so.4.7.3 [.] 0x265b69 ▒ + 3.50% kdevelop libQtCore.so.4.7.3 [.] 0x18608d ▒ + 2.68% kdevelop libc-2.14.so [.] memcpy ▒ + 1.15% kdevelop [kernel.kallsyms] [k] copy_user_generic_string ▒ + 0.90% kdevelop libQtGui.so.4.7.3 [.] QTransform::translate(double, double) ▒ + 0.88% kdevelop libc-2.14.so [.] __libc_malloc ▒ + 0.85% kdevelop libc-2.14.so [.] memcpy ...
Ok, these functions might be slow, but how do I find out where they are getting called from? As all these hotspots lie in external libraries I see no way to optimize my code.
Basically I am looking for some kind of callgraph annotated with accumulated cost, where my functions have a higher inclusive sampling cost than the library functions I call.
Is this possible with perf? If so - how?
Note: I found out that "E" unwraps the callgraph and gives somewhat more information. But the callgraph is often not deep enough and/or terminates randomly without giving information about how much info was spent where. Example:
- 10.26% kate libkatepartinterfaces.so.4.6.0 [.] Kate::TextLoader::readLine(int&... Kate::TextLoader::readLine(int&, int&) Kate::TextBuffer::load(QString const&, bool&, bool&) KateBuffer::openFile(QString const&) KateDocument::openFile() 0x7fe37a81121c
Could it be an issue that I'm running on 64 bit? See also: http://lists.fedoraproject.org/pipermail/devel/2010-November/144952.html (I'm not using fedora but seems to apply to all 64bit systems).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
在 Linux 3.7 中,perf 终于能够使用 DWARF 信息来生成调用图:
很整洁,但是与 VTune、KCacheGrind 或类似的相比,curses GUI 很糟糕……我建议尝试使用 FlameGraphs,这是一个非常简洁的可视化:<一个href="http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html">http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
注意:在报告步骤中,
- g graph
使结果输出易于理解“相对于总数”的百分比,而不是“相对于父级”的数字。--no-children
将仅显示自我成本,而不是包容性成本 - 我也认为这一功能非常宝贵。如果您有新的性能和 Intel CPU,还可以尝试 LBR 展开器,它具有更好的性能并生成更小的结果文件:
这里的缺点是与默认的 DWARF 展开器配置相比,调用堆栈深度更加有限。
With Linux 3.7 perf is finally able to use DWARF information to generate the callgraph:
Neat, but the curses GUI is horrible compared to VTune, KCacheGrind or similar... I recommend to try out FlameGraphs instead, which is a pretty neat visualization: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
Note: In the report step,
-g graph
makes the results output simple to understand "relative to total" percentages, rather than "relative to parent" numbers.--no-children
will show only self cost, rather than inclusive cost - a feature that I also find invaluable.If you have a new perf and Intel CPU, also try out the LBR unwinder, which has much better performance and produces far smaller result files:
The downside here is that the call stack depth is more limited compared to the default DWARF unwinder configuration.
您应该尝试一下热点:
https://www.kdab.com/hotspot-gui-linux-perf- profiler/
它可以在 github 上找到: https://github.com/KDAB/hotspot
它是为了能够为您生成火焰图的示例。
You should give hotspot a try:
https://www.kdab.com/hotspot-gui-linux-perf-profiler/
It's available on github: https://github.com/KDAB/hotspot
It is for example able to generate flamegraphs for you.
您确定您的应用程序
someapp
是使用 gcc 选项-fno-omit-frame-pointer
(可能还有其依赖库)构建的吗?像这样的东西:
Are you sure that your application
someapp
is built with the gcc option-fno-omit-frame-pointer
(and possibly its dependant libraries) ?Something like this:
您可以使用
perf annotate
获得非常详细的源代码级别报告,请参阅源代码级别使用 perf annotate 进行分析。它看起来像这样(无耻地从网站上窃取):当你编译你的代码。
You can get a very detailed, source level report with
perf annotate
, see Source level analysis with perf annotate. It will look something like this (shamelessly stolen from the website):Don't forget to pass the
-fno-omit-frame-pointer
and the-ggdb
flags when you compile your code.除非您的程序只有很少的函数并且几乎不调用系统函数或 I/O,否则对程序计数器进行采样的分析器不会告诉您太多信息,正如您所发现的那样。
事实上,著名的分析器 gprof 是专门为了解决仅自时间分析的无用问题而创建的(并不是说它成功了)。
真正起作用的是在挂钟时间(从而包括I/O时间)对调用堆栈进行采样(从而找出调用来自哪里),并按行或按指令报告(从而查明您应该调查的函数调用,而不仅仅是它们所在的函数)。
此外,您应该查找的统计数据是堆栈时间百分比,而不是调用次数,而不是平均包含函数时间。 尤其不是“自拍时间”。
如果调用指令(或非调用指令)有 38% 的时间位于堆栈上,那么如果您可以摆脱它,您会节省多少? 38%!
很简单,不是吗?
此类分析器的一个示例是 Zoom。
关于此主题,还有更多问题需要理解。
添加:@caf 让我寻找
perf
信息,并且由于您包含了命令行参数-g
它确实收集了堆栈样本。然后您可以获得call-tree报告。然后,如果您确保按照挂钟时间进行采样(这样您就可以获得等待时间以及 CPU 时间),那么您就已经几乎得到了您所需要的。
Unless your program has very few functions and hardly ever calls a system function or I/O, profilers that sample the program counter won't tell you much, as you're discovering.
In fact, the well-known profiler gprof was created specifically to try to address the uselessness of self-time-only profiling (not that it succeeded).
What actually works is something that samples the call stack (thereby finding out where the calls are coming from), on wall-clock time (thereby including I/O time), and report by line or by instruction (thereby pinpointing the function calls that you should investigate, not just the functions they live in).
Furthermore, the statistic you should look for is percent of time on stack, not number of calls, not average inclusive function time. Especially not "self time".
If a call instruction (or a non-call instruction) is on the stack 38% of the time, then if you could get rid of it, how much would you save? 38%!
Pretty simple, no?
An example of such a profiler is Zoom.
There are more issues to be understood on this subject.
Added: @caf got me hunting for the
perf
info, and since you included the command-line argument-g
it does collect stack samples. Then you can get a call-tree report.Then if you make sure you're sampling on wall-clock time (so you get wait time as well as cpu time) then you've got almost what you need.