gprof 与 cachegrind 配置文件

发布于 2024-11-15 03:27:38 字数 1961 浏览 1 评论 0原文

在尝试优化代码时,我对 kcachegrdind 和 gprof 生成的配置文件的差异感到有点困惑。具体来说,如果我使用 gprof (使用 -pg 开关等进行编译),我会得到以下结果:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 89.62      3.71     3.71   204626     0.02     0.02  objR<true>::R_impl(std::vector<coords_t, std::allocator<coords_t> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&) const
  5.56      3.94     0.23 18018180     0.00     0.00  W2(coords_t const&, coords_t const&)
  3.87      4.10     0.16   200202     0.00     0.00  build_matrix(std::vector<coords_t, std::allocator<coords_t> > const&)
  0.24      4.11     0.01   400406     0.00     0.00  std::vector<double, std::allocator<double> >::vector(std::vector<double, std::allocator<double> > const&)
  0.24      4.12     0.01   100000     0.00     0.00  Wrat(std::vector<coords_t, std::allocator<coords_t> > const&, std::vector<coords_t, std::allocator<coords_t> > const&)
  0.24      4.13     0.01        9     1.11     1.11  std::vector<short, std::allocator<short> >* std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<std::vector<short, std::alloca

这似乎表明我不需要费心寻找除了 ::R_impl(... )

同时,如果我在没有 -pg 开关的情况下进行编译并运行 valgrind --tool=callgrind ./a.out ,我会得到一些东西相当不同:这是一个屏幕截图kcachegrind 输出

在此处输入图像描述

如果我正确解释这一点,似乎表明 ::R_impl(...) 只花费大约 50% 的时间,而另一半则花在线性代数上(Wrat(...)特征值< /code> 和底层 lapack 调用),它位于 gprof 配置文件的下方。

我知道 gprofcachegrind 使用不同的技术,如果它们的结果有所不同,我也不会担心。但在这里,它看起来非常不同,我不知道如何解释它们。有什么想法或建议吗?

While trying to optimize a code, I'm a bit puzzled by differences in profiles produced by kcachegrdind and gprof. Specifically, if I use gprof (compiling with the -pg switch, etc), I have this:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 89.62      3.71     3.71   204626     0.02     0.02  objR<true>::R_impl(std::vector<coords_t, std::allocator<coords_t> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&) const
  5.56      3.94     0.23 18018180     0.00     0.00  W2(coords_t const&, coords_t const&)
  3.87      4.10     0.16   200202     0.00     0.00  build_matrix(std::vector<coords_t, std::allocator<coords_t> > const&)
  0.24      4.11     0.01   400406     0.00     0.00  std::vector<double, std::allocator<double> >::vector(std::vector<double, std::allocator<double> > const&)
  0.24      4.12     0.01   100000     0.00     0.00  Wrat(std::vector<coords_t, std::allocator<coords_t> > const&, std::vector<coords_t, std::allocator<coords_t> > const&)
  0.24      4.13     0.01        9     1.11     1.11  std::vector<short, std::allocator<short> >* std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<std::vector<short, std::alloca

Which seems to suggest that I need not bother looking anywhere but ::R_impl(...)

At the same time, if I compile without the -pg switch and run valgrind --tool=callgrind ./a.out instead, I have something rather different: here's a screenshot of the kcachegrind output

enter image description here

If I interpret this correctly, it seems to suggest that ::R_impl(...) only takes about 50% of time, while the other half is spent in linear algebra (Wrat(...), eigenvalues and the underlying lapack calls ) which was way down below in the gprof profile.

I understand that gprof and cachegrind use different techniques, and I'd not bother if their results were somewhat different. But here, it looks very different, and I'm at loss as to how to interpret those. Any ideas or suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

负佳期 2024-11-22 03:27:38

您正在查看错误的列。您必须查看 kcachegrind 输出中的第二列,即名为“self”的列。这是特定子例程仅在不考虑其子例程的情况下花费的时间。第一列包含累积时间(它等于主机器时间的 100%),但它的信息量并不大(在我看来)。

请注意,从 kcachegrind 的输出中,您可以看到该过程的总时间为 53.64 秒,而子例程“R_impl”所花费的时间为 46.72 秒,占总时间的 87%。所以 gprof 和 kcachegrind 几乎完全一致。

You are looking at the wrong column. You have to look at the second column in kcachegrind output, the one named "self". This is the time spent by the particular subroutine only without considering its children. The first column has the cumulative time (it is equal to 100% of machine time for the main) and it is not that informative (in my opinion).

Note that from the output of kcachegrind you can see that the total time of the process is 53.64 second while the time spent in the subroutine "R_impl" is 46.72 second which is 87% of the total time. So gprof and kcachegrind agree almost perfectly.

鹿港小镇 2024-11-22 03:27:38

gprof 是一个仪表化分析器,callgrind 是一个采样分析器。使用仪表分析器,每个函数进入和退出都会产生开销,这可能会导致分析结果出现偏差,特别是如果您的函数相对较小且被调用多次的话。采样分析器往往更准确 - 它们会稍微减慢整个程序的执行速度,但这往往会对所有函数产生相同的相对影响。

尝试 Zoom from RotateRight 的 30 天免费评估 - 我怀疑它会给你一个更符合 callgrind 的个人资料 比使用 gprof 更好。

gprof is an instrumented profiler, callgrind is a sampling profiler. With an instrumented profiler you get overhead for every function entry and exit, which can skew the profile, particularly if you have relatively small functions which are called many times. Sampling profilers tend to be more accurate - they slow the overall program execution slightly, but this tends to have the same relative effect on all functions.

Try the free 30 day evaluation of Zoom from RotateRight - I suspect it will give you a profile which agrees more with callgrind than with gprof.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文