当前位置：文江博客话题详情

gprof 的替代品

发布于 2024-08-12 13:16:09 字数 30 浏览 9 评论 0原文

还有哪些程序可以与 gprof 做同样的事情？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梨涡少年 2024-08-19 13:16:09

gprof （阅读论文）存在是有历史原因的。
如果您认为它会帮助您发现性能问题，那么它从未被宣传过。
论文是这样说的：

该概况可用于比较和评估以下方面的成本：
各种实现。

它并没有说它可以用来识别要评估的各种实现，尽管它暗示它可以在特殊情况下：

特别是如果发现程序的一小部分主导了它的
执行时间。

那些不那么局部化的问题怎么办？
这些不重要吗？
不要对从未声称具有此功能的 gprof 抱有期望。
它只是一个测量工具，并且仅是CPU密集型操作的测量工具。

试试这个。
以下是 44 倍加速的示例。
速度提升了 730 倍。
这是一个 8 分钟的视频演示。
以下是统计信息的说明。
这是对批评的回答。

有一个关于程序的简单观察。在给定的执行中，每条指令都占总时间的一部分（尤其是call指令），从某种意义上说，如果它不存在，则不会花费时间。在此期间，指令位于堆栈上**。理解了这一点后，您就会发现 -

gprof 体现了有关性能的某些神话，例如：

程序计数器采样很有用。
仅当您有不必要的热点瓶颈（例如大标量值数组的冒泡排序）时，它才有用。例如，一旦您将其更改为使用字符串比较的排序，它仍然是瓶颈，但程序计数器采样不会看到它，因为现在热点位于字符串比较中。另一方面，如果要对扩展程序计数器（调用堆栈）进行采样，则会清楚地显示调用字符串比较的点（即排序循环）。 事实上，gprof 是一种弥补仅限 PC 采样的局限性的尝试。
计时函数比捕获耗时的行更重要代码。
这个神话的原因是 gprof 无法捕获堆栈样本，因此它会对函数进行计时，计算它们的调用次数，并尝试捕获调用图。然而，一旦识别出一个代价高昂的函数，您仍然需要在其内部查找负责该时间的行。如果有堆栈样本，您不需要查看，这些行将位于样本上。（一个典型的函数可能有 100 - 1000 条指令。一个函数调用是 1 条指令，因此定位昂贵的调用的精度要高 2-3 个数量级。）
调用图很重要。
关于程序，您需要了解的不是它在哪里花费时间，而是为什么。当它在函数中花费时间时，堆栈上的每一行代码都给出了其存在原因的推理链中的一个链接。如果您只能看到堆栈的一部分，那么您只能看到部分原因，因此您无法确定该时间是否确实必要。
调用图告诉你什么？每条弧线都告诉您某个函数 A 在某个时间段内正在调用某个函数 B。即使 A 只有一行这样的代码调用 B，该行也只给出了一小部分原因。如果你足够幸运，也许这条线有一个糟糕的理由。通常，您需要同时查看多条线路才能找到不良原因（如果存在）。如果 A 在多个地方调用 B，那么它告诉您的信息就更少。
递归是一个棘手且令人困惑的问题。
这只是因为 gprof 和其他分析器认为需要生成调用图，然后将时间归因于节点。如果有堆栈样本，则样本上出现的每一行代码的时间成本是一个非常简单的数字 - 它所在样本的分数。如果存在递归，则给定的行可以在样本上出现多次。
没关系。假设每 N 毫秒采样一次，并且该线出现在其中的 F% 上（单独或单独）。如果可以使该行不花费任何时间（例如通过删除它或在它周围分支），那么这些样本将消失，并且时间将减少 F%。
时间测量的准确性（因此需要大量样本）非常重要。
想一想。如果一行代码出现在 5 个样本中的 3 个上，那么如果您可以像灯泡一样将其射出，则所用时间大约会减少 60%。现在，您知道，如果您采集了 5 个不同的样本，您可能只会看到 2 次，或者最多 4 次。因此 60% 的测量值更像是 40% 到 80% 的一般范围。如果只有40%，你会说这个问题不值得解决吗？那么，当您真正想要的是发现问题时，时间准确性还有什么意义呢？
500 或 5000 个样本可以更精确地测量问题，但无法更准确地发现问题。
对语句或函数调用进行计数很有用。
假设您知道某个函数已被调用 1000 次。你能从中看出它花费了多少时间吗？您还需要知道平均运行需要多长时间，将其乘以计数，然后除以总时间。平均调用时间可能从纳秒到秒不等，因此仅计数并不能说明太多问题。如果存在堆栈样本，则例程或任何语句的成本只是其所在样本的分数。如果例程或语句可以不花时间，原则上可以节省这部分时间，因此这与性能有最直接的关系。
封锁时无需采集样本
造成这种误解的原因有两个：1）当程序等待时，PC 采样毫无意义；2）过分关注计时的准确性。然而，对于 (1)，程序很可能正在等待它所要求的某些内容，例如文件 I/O，您需要知道，以及哪些堆栈样本显示。（显然，您希望在等待用户输入时排除样本。）对于（2），如果程序只是因为与其他进程竞争而等待，那么在运行时这可能会以相当随机的方式发生。
因此，虽然程序可能需要更长的时间，但这不会对重要的统计数据（语句在堆栈上的时间百分比）产生很大影响。
“自拍时间”很重要
仅当您在函数级别而不是线路级别进行测量时，自时间才有意义，并且您认为需要帮助来辨别函数时间是否进入纯粹的本地计算与调用的例程。如果在行级别进行汇总，则一行如果位于堆栈末尾则表示自身时间，否则表示包含时间。无论哪种方式，它的成本都是它所在的堆栈样本的百分比，因此无论哪种情况都可以为您找到它。
必须以高频率采集样本
这是因为性能问题可能会快速发生，并且样本必须频繁才能解决。但是，如果问题的成本是 10 秒（或其他）总运行时间中的 20%，那么无论问题是否发生，总时间内的每个样本都有 20% 的机会命中它像这样的一个整体
.....XXXXXXXX........................
.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^（20 个样本, 4 次点击)
或者像这样的许多小块
X...X...XX.X.........X.....X....X.....
.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^（20 个样本, 3 次点击)
无论哪种方式，无论采样多少样本，命中次数平均约为五分之一。（平均值 = 20 * 0.2 = 4。标准偏差 = +/- sqrt(20 * 0.2 * 0.8) = 1.8。）
您正在尝试找到瓶颈
就好像只有一个一样。考虑以下执行时间线：vxvWvzvWvxvWvYvWvxvWv.vWvxvWvYvW
它由真正有用的工作组成，用 . 表示。 vWxYz 存在性能问题，分别占用 1/2、1/4、1/8、1/16、1/32 的时间。采样很容易找到v。它被删除，留下
xWzWxWYWxW.WxWYW
现在，程序的运行时间减少了一半，而且 W 的时间也减少了一半，并且很容易找到。它被删除，留下
xzxYx.xY
此过程持续进行，每次都会消除按百分比计算的最大性能问题，直到找不到任何可消除的问题。现在唯一执行的是.，它的执行时间是原始程序的1/32。这就是放大效应，通过这种效应，消除任何问题都会使余数按百分比变大，因为分母减小了。
另一个关键点是，每个问题都必须被发现 - 这 5 个问题都不能漏掉。任何未发现和修复的问题都会严重降低最终的加速比。仅仅找到一些，但不是全部，还不够“足够好”。

添加：我只想指出gprof受欢迎的一个原因 - 它正在被教授，
大概是因为它是免费的，易于教学，而且已经存在很长时间了。
快速谷歌搜索可以找到一些教授（或似乎教授）的学术机构：

伯克利分校克莱姆森分校
科罗拉多州厄勒姆公爵佛罗里达州立大学印第安纳州密西根州立大学
NCSA. 伊利诺伊州 NCSU 纽约大学或普林斯顿大学 PSU
斯坦福大学 UCSD UMD UMICH 犹他州 UTEXAS UTK Wustl

** 除了请求完成工作的其他方式外，不会留下任何痕迹告诉原因，例如通过消息发布。

gprof (read the paper) exists for historical reasons.
If you think it will help you find performance problems, it was never advertised as such.
Here's what the paper says:

The proﬁle can be used to compare and assess the costs of
various implementations.

It does not say it can be used to identify the various implementations to be assessed, although it does imply that it could, under special circumstances:

especially if small portions of the program are found to dominate its
execution time.

What about problems that are not so localized?
Do those not matter?
Don't place expectations on gprof that were never claimed for it.
It is only a measurement tool, and only of CPU-bound operations.

Try this instead.
Here's an example of a 44x speedup.
Here's a 730x speedup.
Here's an 8-minute video demonstration.
Here's an explanation of the statistics.
Here's an answer to critiques.

There's a simple observation about programs. In a given execution, every instruction is responsible for some fraction of the overall time (especially call instructions), in the sense that if it were not there, the time would not be spent. During that time, the instruction is on the stack **. When that is understood, you can see that -

gprof embodies certain myths about performance, such as:

that program counter sampling is useful.
It is only useful if you have an unnecessary hotspot bottleneck such as a bubble sort of a big array of scalar values. As soon as you, for example, change it into a sort using string-compare, it is still a bottleneck, but program counter sampling will not see it because now the hotspot is in string-compare. On the other hand if it were to sample the extended program counter (the call stack), the point at which the string-compare is called, the sort loop, is clearly displayed. In fact, gprof was an attempt to remedy the limitations of pc-only sampling.
that timing functions is more important than capturing time-consuming lines of code.
The reason for that myth is that gprof was not able to capture stack samples, so instead it times functions, counts their invocations, and tries to capture the call graph. However, once a costly function is identified, you still need to look inside it for the lines that are responsible for the time. If there were stack samples you would not need to look, those lines would be on the samples. (A typical function might have 100 - 1000 instructions. A function call is 1 instruction, so something that locates costly calls is 2-3 orders of magnitude more precise.)
that the call graph is important.
What you need to know about a program is not where it spends its time, but why. When it is spending time in a function, every line of code on the stack gives one link in the chain of reasoning of why it is there. If you can only see part of the stack, you can only see part of the reason why, so you can't tell for sure if that time is actually necessary.
What does the call graph tell you? Each arc tells you that some function A was in the process of calling some function B for some fraction of the time. Even if A has only one such line of code calling B, that line only gives a small part of the reason why. If you are lucky enough, maybe that line has a poor reason. Usually, you need to see multiple simultaneous lines to find a poor reason if it is there. If A calls B in more than one place, then it tells you even less.
that recursion is a tricky confusing issue.
That is only because gprof and other profilers perceive a need to generate a call-graph and then attribute times to the nodes. If one has samples of the stack, the time-cost of each line of code that appears on samples is a very simple number - the fraction of samples it is on. If there is recursion, then a given line can appear more than once on a sample.
No matter. Suppose samples are taken every N ms, and the line appears on F% of them (singly or not). If that line can be made to take no time (such as by deleting it or branching around it), then those samples would disappear, and the time would be reduced by F%.
that accuracy of time measurement (and therefore a large number of samples) is important.
Think about it for a second. If a line of code is on 3 samples out of five, then if you could shoot it out like a light bulb, that is roughly 60% less time that would be used. Now, you know that if you had taken a different 5 samples, you might have only seen it 2 times, or as many as 4. So that 60% measurement is more like a general range from 40% to 80%. If it were only 40%, would you say the problem is not worth fixing? So what's the point of time accuracy, when what you really want is to find the problems?
500 or 5000 samples would have measured the problem with greater precision, but would not have found it any more accurately.
that counting of statement or function invocations is useful.
Suppose you know a function has been called 1000 times. Can you tell from that what fraction of time it costs? You also need to know how long it takes to run, on average, multiply it by the count, and divide by the total time. The average invocation time could vary from nanoseconds to seconds, so the count alone doesn't tell much. If there are stack samples, the cost of a routine or of any statement is just the fraction of samples it is on. That fraction of time is what could in principle be saved overall if the routine or statement could be made to take no time, so that is what has the most direct relationship to performance.
that samples need not be taken when blocked
The reasons for this myth are twofold: 1) that PC sampling is meaningless when the program is waiting, and 2) the preoccupation with accuracy of timing. However, for (1) the program may very well be waiting for something that it asked for, such as file I/O, which you need to know, and which stack samples reveal. (Obviously you want to exclude samples while waiting for user input.) For (2) if the program is waiting simply because of competition with other processes, that presumably happens in a fairly random way while it's running.
So while the program may be taking longer, that will not have a large effect on the statistic that matters, the percentage of time that statements are on the stack.
that "self time" matters
Self time only makes sense if you are measuring at the function level, not line level, and you think you need help in discerning if the function time goes into purely local computation versus in called routines. If summarizing at the line level, a line represents self time if it is at the end of the stack, otherwise it represents inclusive time. Either way, what it costs is the percentage of stack samples it is on, so that locates it for you in either case.
that samples have to be taken at high frequency
This comes from the idea that a performance problem may be fast-acting, and that samples have to be frequent in order to hit it. But, if the problem is costing, 20%, say, out of a total running time of 10 sec (or whatever), then each sample in that total time will have a 20% chance of hitting it, no matter if the problem occurs in a single piece like this
.....XXXXXXXX...........................
.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^ (20 samples, 4 hits)
or in many small pieces like this
X...X...X.X..X.........X.....X....X.....
.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^ (20 samples, 3 hits)
Either way, the number of hits will average about 1 in 5, no matter how many samples are taken, or how few. (Average = 20 * 0.2 = 4. Standard deviation = +/- sqrt(20 * 0.2 * 0.8) = 1.8.)
that you are trying to find the bottleneck
as if there were only one. Consider the following execution timeline: vxvWvzvWvxvWvYvWvxvWv.vWvxvWvYvW
It consists of real useful work, represented by .. There are performance problems vWxYz taking 1/2, 1/4, 1/8, 1/16, 1/32 of the time, respectively. Sampling finds v easily. It is removed, leaving
xWzWxWYWxW.WxWYW
Now the program takes half as long to run, and now W takes half the time, and is found easily. It is removed, leaving
xzxYx.xY
This process continues, each time removing the biggest, by percentage, performance problem, until nothing to remove can be found. Now the only thing executed is ., which executes in 1/32 of the time used by the original program. This is the magnification effect, by which removing any problem makes the remainder larger, by percent, because the denominator is reduced.
Another crucial point is that every single problem must be found - missing none of the 5. Any problem not found and fixed severely reduces the final speedup ratio. Just finding some, but not all, is not "good enough".

ADDED: I would just like to point out one reason why gprof is popular - it is being taught,
presumably because it's free, easy to teach, and it's been around a long time.
A quick Google search locates some academic institutions that teach it (or appear to):

berkeley bu clemson
colorado duke earlham fsu indiana mit msu
ncsa.illinois ncsu nyu ou princeton psu
stanford ucsd umd umich utah utexas utk wustl

** With the exception of other ways of requesting work to be done, that don't leave a trace telling why, such as by message posting.

回复收藏 0 原文

天生の放荡 2024-08-19 13:16:09

Valgrind 有一个指令计数分析器，带有一个非常好的可视化工具，称为 KCacheGrind。正如 Mike Dunlavey 所建议的那样，Valgrind 会计算堆栈上存在过程的指令的比例，尽管我很遗憾地说，在存在相互递归的情况下，它似乎会变得混乱。但可视化工具非常好，比 gprof 领先很多年。

回复收藏 0 原文

自控 2024-08-19 13:16:09

由于我在这里没有看到任何有关 perf 的信息，这是一个相对较新的工具，用于分析 Linux 上的内核和用户应用程序，因此我决定添加此信息。

首先 - 这是一个关于使用 perf 进行 Linux 分析的教程 < /a>

如果您的 Linux 内核高于 2.6.32，则可以使用 perf；如果较旧，则可以使用 oprofile。这两个程序都不需要您对程序进行检测（如 gprof 需要）。但是，为了在 perf 中正确获取调用图，您需要使用 -fno-omit-frame-pointer 构建程序。例如：g++ -fno-omit-frame-pointer -O2 main.cpp。

您可以使用perf top查看应用程序的“实时”分析：

sudo perf top -p `pidof a.out` -K

或者您可以记录正在运行的应用程序的性能数据并随后对其进行分析：

1) 记录性能数据：

perf record -p `pidof a.out`

或记录 10 秒：

perf record -p `pidof a.out` sleep 10

或使用调用图 () 进行记录

perf record -g -p `pidof a.out`

2) 分析记录的数据

perf report --stdio
perf report --stdio --sort=dso -g none
perf report --stdio -g none
perf report --stdio -g

或者您可以记录应用程序的性能数据，然后通过这种方式启动应用程序并等待其退出来分析它们：

perf record ./a.out

这是一个分析测试程序的示例

测试程序位于文件main.cpp中（我将main.cpp放在消息的底部）：

我以这种方式编译它：

g++ -m64 -fno-omit-frame-pointer -g main.cpp -L.  -ltcmalloc_minimal -o my_test

我使用libmalloc_minimial.so 因为它是使用 -fno-omit-frame-pointer 编译的，而 libc malloc 似乎是在没有此选项的情况下编译的。
然后我运行我的测试程序

./my_test 100000000

然后我记录正在运行的进程的性能数据：

perf record -g  -p `pidof my_test` -o ./my_test.perf.data sleep 30

然后我分析每个模块的负载：

性能报告--stdio -g none --sort comm,dso -i ./my_test.perf.data

# Overhead  Command                 Shared Object
# ........  .......  ............................
#
    70.06%  my_test  my_test
    28.33%  my_test  libtcmalloc_minimal.so.0.1.0
     1.61%  my_test  [kernel.kallsyms]

然后分析每个函数的负载：

性能报告--stdio -g none -i ./my_test.perf.data | C++过滤器

# Overhead  Command                 Shared Object                       Symbol
# ........  .......  ............................  ...........................
#
    29.30%  my_test  my_test                       [.] f2(long)
    29.14%  my_test  my_test                       [.] f1(long)
    15.17%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator new(unsigned long)
    13.16%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator delete(void*)
     9.44%  my_test  my_test                       [.] process_request(long)
     1.01%  my_test  my_test                       [.] operator delete(void*)@plt
     0.97%  my_test  my_test                       [.] operator new(unsigned long)@plt
     0.20%  my_test  my_test                       [.] main
     0.19%  my_test  [kernel.kallsyms]             [k] apic_timer_interrupt
     0.16%  my_test  [kernel.kallsyms]             [k] _spin_lock
     0.13%  my_test  [kernel.kallsyms]             [k] native_write_msr_safe

     and so on ...

然后分析调用链：

性能报告--stdio -g graph -i ./my_test.perf.data | C++过滤器

# Overhead  Command                 Shared Object                       Symbol
# ........  .......  ............................  ...........................
#
    29.30%  my_test  my_test                       [.] f2(long)
            |
            --- f2(long)
               |
                --29.01%-- process_request(long)
                          main
                          __libc_start_main

    29.14%  my_test  my_test                       [.] f1(long)
            |
            --- f1(long)
               |
               |--15.05%-- process_request(long)
               |          main
               |          __libc_start_main
               |
                --13.79%-- f2(long)
                          process_request(long)
                          main
                          __libc_start_main

    15.17%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator new(unsigned long)
            |
            --- operator new(unsigned long)
               |
               |--11.44%-- f1(long)
               |          |
               |          |--5.75%-- process_request(long)
               |          |          main
               |          |          __libc_start_main
               |          |
               |           --5.69%-- f2(long)
               |                     process_request(long)
               |                     main
               |                     __libc_start_main
               |
                --3.01%-- process_request(long)
                          main
                          __libc_start_main

    13.16%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator delete(void*)
            |
            --- operator delete(void*)
               |
               |--9.13%-- f1(long)
               |          |
               |          |--4.63%-- f2(long)
               |          |          process_request(long)
               |          |          main
               |          |          __libc_start_main
               |          |
               |           --4.51%-- process_request(long)
               |                     main
               |                     __libc_start_main
               |
               |--3.05%-- process_request(long)
               |          main
               |          __libc_start_main
               |
                --0.80%-- f2(long)
                          process_request(long)
                          main
                          __libc_start_main

     9.44%  my_test  my_test                       [.] process_request(long)
            |
            --- process_request(long)
               |
                --9.39%-- main
                          __libc_start_main

     1.01%  my_test  my_test                       [.] operator delete(void*)@plt
            |
            --- operator delete(void*)@plt

     0.97%  my_test  my_test                       [.] operator new(unsigned long)@plt
            |
            --- operator new(unsigned long)@plt

     0.20%  my_test  my_test                       [.] main
     0.19%  my_test  [kernel.kallsyms]             [k] apic_timer_interrupt
     0.16%  my_test  [kernel.kallsyms]             [k] _spin_lock
     and so on ...

所以此时你就知道你的程序把时间花在哪里了。

这是用于测试的 main.cpp：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

time_t f1(time_t time_value)
{
  for (int j =0; j < 10; ++j) {
    ++time_value;
    if (j%5 == 0) {
      double *p = new double;
      delete p;
    }
  }
  return time_value;
}

time_t f2(time_t time_value)
{
  for (int j =0; j < 40; ++j) {
    ++time_value;
  }
  time_value=f1(time_value);
  return time_value;
}

time_t process_request(time_t time_value)
{

  for (int j =0; j < 10; ++j) {
    int *p = new int;
    delete p;
    for (int m =0; m < 10; ++m) {
      ++time_value;
    }
  }
  for (int i =0; i < 10; ++i) {
    time_value=f1(time_value);
    time_value=f2(time_value);
  }
  return time_value;
}

int main(int argc, char* argv2[])
{
  int number_loops = argc > 1 ? atoi(argv2[1]) : 1;
  time_t time_value = time(0);
  printf("number loops %d\n", number_loops);
  printf("time_value: %d\n", time_value );

  for (int i =0; i < number_loops; ++i) {
    time_value = process_request(time_value);
  }
  printf("time_value: %ld\n", time_value );
  return 0;
}

Since I did't see here anything about perf which is a relatively new tool for profiling the kernel and user applications on Linux I decided to add this information.

First of all - this is a tutorial about Linux profiling with perf

You can use perf if your Linux Kernel is greater than 2.6.32 or oprofile if it is older. Both programs don't require from you to instrument your program (like gprof requires). However in order to get call graph correctly in perf you need to build you program with -fno-omit-frame-pointer. For example: g++ -fno-omit-frame-pointer -O2 main.cpp.

You can see "live" analysis of your application with perf top:

sudo perf top -p `pidof a.out` -K

Or you can record performance data of a running application and analyze them after that:

1) To record performance data:

perf record -p `pidof a.out`

or to record for 10 secs:

perf record -p `pidof a.out` sleep 10

or to record with call graph ()

perf record -g -p `pidof a.out`

2) To analyze the recorded data

perf report --stdio
perf report --stdio --sort=dso -g none
perf report --stdio -g none
perf report --stdio -g

Or you can record performace data of a application and analyze them after that just by launching the application in this way and waiting for it to exit:

perf record ./a.out

This is an example of profiling a test program

The test program is in file main.cpp (I will put main.cpp at the bottom of the message):

I compile it in this way:

g++ -m64 -fno-omit-frame-pointer -g main.cpp -L.  -ltcmalloc_minimal -o my_test

I use libmalloc_minimial.so since it is compiled with -fno-omit-frame-pointer while libc malloc seems to be compiled without this option.
Then I run my test program

./my_test 100000000

Then I record performance data of a running process:

perf record -g  -p `pidof my_test` -o ./my_test.perf.data sleep 30

Then I analyze load per module:

perf report --stdio -g none --sort comm,dso -i ./my_test.perf.data

# Overhead  Command                 Shared Object
# ........  .......  ............................
#
    70.06%  my_test  my_test
    28.33%  my_test  libtcmalloc_minimal.so.0.1.0
     1.61%  my_test  [kernel.kallsyms]

Then load per function is analyzed:

perf report --stdio -g none -i ./my_test.perf.data | c++filt

# Overhead  Command                 Shared Object                       Symbol
# ........  .......  ............................  ...........................
#
    29.30%  my_test  my_test                       [.] f2(long)
    29.14%  my_test  my_test                       [.] f1(long)
    15.17%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator new(unsigned long)
    13.16%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator delete(void*)
     9.44%  my_test  my_test                       [.] process_request(long)
     1.01%  my_test  my_test                       [.] operator delete(void*)@plt
     0.97%  my_test  my_test                       [.] operator new(unsigned long)@plt
     0.20%  my_test  my_test                       [.] main
     0.19%  my_test  [kernel.kallsyms]             [k] apic_timer_interrupt
     0.16%  my_test  [kernel.kallsyms]             [k] _spin_lock
     0.13%  my_test  [kernel.kallsyms]             [k] native_write_msr_safe

     and so on ...

Then call chains are analyzed:

perf report --stdio -g graph -i ./my_test.perf.data | c++filt

# Overhead  Command                 Shared Object                       Symbol
# ........  .......  ............................  ...........................
#
    29.30%  my_test  my_test                       [.] f2(long)
            |
            --- f2(long)
               |
                --29.01%-- process_request(long)
                          main
                          __libc_start_main

    29.14%  my_test  my_test                       [.] f1(long)
            |
            --- f1(long)
               |
               |--15.05%-- process_request(long)
               |          main
               |          __libc_start_main
               |
                --13.79%-- f2(long)
                          process_request(long)
                          main
                          __libc_start_main

    15.17%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator new(unsigned long)
            |
            --- operator new(unsigned long)
               |
               |--11.44%-- f1(long)
               |          |
               |          |--5.75%-- process_request(long)
               |          |          main
               |          |          __libc_start_main
               |          |
               |           --5.69%-- f2(long)
               |                     process_request(long)
               |                     main
               |                     __libc_start_main
               |
                --3.01%-- process_request(long)
                          main
                          __libc_start_main

    13.16%  my_test  libtcmalloc_minimal.so.0.1.0  [.] operator delete(void*)
            |
            --- operator delete(void*)
               |
               |--9.13%-- f1(long)
               |          |
               |          |--4.63%-- f2(long)
               |          |          process_request(long)
               |          |          main
               |          |          __libc_start_main
               |          |
               |           --4.51%-- process_request(long)
               |                     main
               |                     __libc_start_main
               |
               |--3.05%-- process_request(long)
               |          main
               |          __libc_start_main
               |
                --0.80%-- f2(long)
                          process_request(long)
                          main
                          __libc_start_main

     9.44%  my_test  my_test                       [.] process_request(long)
            |
            --- process_request(long)
               |
                --9.39%-- main
                          __libc_start_main

     1.01%  my_test  my_test                       [.] operator delete(void*)@plt
            |
            --- operator delete(void*)@plt

     0.97%  my_test  my_test                       [.] operator new(unsigned long)@plt
            |
            --- operator new(unsigned long)@plt

     0.20%  my_test  my_test                       [.] main
     0.19%  my_test  [kernel.kallsyms]             [k] apic_timer_interrupt
     0.16%  my_test  [kernel.kallsyms]             [k] _spin_lock
     and so on ...

So at this point you know where your program spends time.

And this is main.cpp for the test:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

time_t f1(time_t time_value)
{
  for (int j =0; j < 10; ++j) {
    ++time_value;
    if (j%5 == 0) {
      double *p = new double;
      delete p;
    }
  }
  return time_value;
}

time_t f2(time_t time_value)
{
  for (int j =0; j < 40; ++j) {
    ++time_value;
  }
  time_value=f1(time_value);
  return time_value;
}

time_t process_request(time_t time_value)
{

  for (int j =0; j < 10; ++j) {
    int *p = new int;
    delete p;
    for (int m =0; m < 10; ++m) {
      ++time_value;
    }
  }
  for (int i =0; i < 10; ++i) {
    time_value=f1(time_value);
    time_value=f2(time_value);
  }
  return time_value;
}

int main(int argc, char* argv2[])
{
  int number_loops = argc > 1 ? atoi(argv2[1]) : 1;
  time_t time_value = time(0);
  printf("number loops %d\n", number_loops);
  printf("time_value: %d\n", time_value );

  for (int i =0; i < number_loops; ++i) {
    time_value = process_request(time_value);
  }
  printf("time_value: %ld\n", time_value );
  return 0;
}

回复收藏 0 原文