gprof 或 kcachegrind 哪个是最可靠的分析工具?

发布于 2024-11-15 01:25:19 字数 167 浏览 6 评论 0原文

使用 gprofkcachegrind 对一些 C++ 数字运算代码进行分析,对于对执行时间影响最大​​的函数(50-80% 取决于输入)给出了相似的结果,但对于函数在 10-30% 之间,这两种工具给出的结果不同。这是否意味着其中之一不可靠?你会在这里做什么?

Profiling some C++ number crunching code with both gprof and kcachegrind gives similar results for the functions that contribute most to the execution time (50-80% depending on input) but for functions between 10-30% both these tools give different results. Does it mean one of them is not reliable? What would yo do here?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

鹿港小镇 2024-11-22 01:25:19

gprof 实际上非常原始。这就是它的作用。
1)它以恒定的速率对程序计数器进行采样,并记录每个函数中有多少个样本(不包括时间)。
2)计算任意函数A调用任意函数B的次数。
由此可以找出每个函数总共被调用了多少次,以及它的平均独占时间是多少。
为了获得每个函数的平均包含时间,它在调用图中向上传播独占时间。

如果您期望其具有某种准确性,则应该注意一些问题。
首先,它只计算 CPU 进程时间,这意味着它对 I/O 或其他系统调用是无视的。
其次,递归混淆了它。
第三,函数始终遵循平均运行时间的前提,无论何时调用或谁调用它们,都是非常可疑的。
第四,您需要了解函数(及其调用图)而不是代码行的概念只是一个流行的假设,仅此而已。
第五,测量精度甚至与寻找“瓶颈”相关的观念也只是一个流行的假设,仅此而已。

Callgrind 可以在线路级别工作 - 这很好。不幸的是,它也存在其他问题。

如果您的目标是找到“瓶颈”(而不是获得一般测量结果),则应该查看按行报告百分比的挂钟时间堆栈采样器,例如 缩放
原因很简单,但可能很陌生。

假设您有一个程序,其中有一堆相互调用的函数,总共需要 10 秒。此外,还有一个采样器,不仅可以对程序计数器进行采样,还可以对整个调用堆栈进行采样,并且始终以恒定的速率(例如每秒 100 次)进行采样。 (暂时忽略其他进程。)

因此,最终您将获得 1000 个调用堆栈样本。
选择其中出现多个的代码 L 的任意行。
假设您可以以某种方式优化该行,通过避免它、删除它或将其传递给一个非常非常快的处理器。

这些样本会发生什么?

由于代码 L 行现在(基本上)根本不需要时间,因此没有样本可以命中它,因此这些样本将消失,从而减少样本总数,从而减少总时间!
事实上,总时间会减少L在堆栈上的时间分数,这大致是包含它的样本的分数。

我不想太过统计,但很多人认为你需要大量样本,因为他们认为测量的准确性很重要。
事实并非如此,如果您这样做的原因是为了找出需要解决的问题来提高速度。
重点是找到要解决的问题,而不是测量它。
L 行在 F 的一小部分时间内位于堆栈上,对吧?
那么每个样本都有击中它的概率 F,对吧?就像抛硬币一样。
有一个关于这一点的理论,称为继承规则
它说(在简化但一般的假设下),如果你抛硬币 N 次,并看到“正面”S 次,你可以将硬币 F 的公平性估计为(平均) (S+1)/ (N+2)。
因此,如果您抽取三个个样本,并在其中两个上看到L,您知道F是什么吗? 当然不是。
但您确实知道平均是 (2+1)/(3+2) 或60%
这就是通过“优化掉”L 线可以节省(平均)多少时间。
当然,堆栈示例向您准确地显示了 L 行(“瓶颈”**)所在的位置。
你没有测量到小数点后两位或三位真的很重要吗?

顺便说一句,它不受上述所有其他问题的影响。

**我一直在“瓶​​颈”周围加上引号,因为大多数软件变慢的原因与瓶颈没有任何共同之处。
更好的比喻是“流失”——只是不必要地浪费时间的东西。

gprof is actually quite primitive. Here's what it does.
1) It samples the program counter at a constant rate and records how many samples land in each function (exclusive time).
2) It counts how many times any function A calls any function B.
From that it can find out how many times each function was called in total, and what it's average exclusive time was.
To get average inclusive time of each function it propagates exclusive time upward in the call graph.

If you're expecting this to have some kind of accuracy, you should be aware of some issues.
First, it only counts CPU-time-in-process, meaning it is blind to I/O or other system calls.
Second, recursion confuses it.
Third, the premise that functions always adhere to an average run time, no matter when they are called or who calls them, is very suspect.
Forth, the notion that functions (and their call graph) are what you need to know about, rather than lines of code, is simply a popular assumption, nothing more.
Fifth, the notion that accuracy of measurement is even relevant to finding "bottlenecks" is also just a popular assumption, nothing more.

Callgrind can work at the level of lines - that's good. Unfortunately it shares the other problems.

If your goal is to find "bottlenecks" (as opposed to getting general measurements), you should take a look at wall-clock time stack samplers that report percent-by-line, such as Zoom.
The reason is simple but possibly unfamiliar.

Suppose you have a program with a bunch of functions calling each other that takes a total of 10 seconds. Also, there is a sampler that samples, not just the program counter, but the entire call stack, and it does it all the time at a constant rate, like 100 times per second. (Ignore other processes for now.)

So at the end you have 1000 samples of the call stack.
Pick any line of code L that appears on more than one of them.
Suppose you could somehow optimize that line, by avoiding it, removing it, or passing it off to a really really fast processor.

What would happen to those samples?

Since that line of code L now takes (essentially) no time at all, no sample can hit it, so those samples would just disappear, reducing the total number of samples, and therefore the total time!
In fact the overall time would be reduced by the fraction of time L had been on the stack, which is roughly the fraction of samples that contained it.

I don't want to get too statistical, but many people think you need a lot of samples, because they think accuracy of measurement is important.
It isn't, if the reason you're doing this is to find out what to fix to get speedup.
The emphasis is on finding what to fix, not on measuring it.
Line L is on the stack some fraction F of the time, right?
So each sample has a probability F of hitting it, right? Just like flipping a coin.
There is a theory of this, called the Rule of Succession.
It says that (under simplifying but general assumptions), if you flip a coin N times, and see "heads" S times, you can estimate the fairness of the coin F as (on average) (S+1)/(N+2).
So, if you take as few as three samples, and see L on two of them, do you know what F is? Of course not.
But you do know on average it is (2+1)/(3+2) or 60%.
So that's how much time you could save (on average) by "optimizing away" line L.
And, of course, the stack samples showed you exactly where line L (the "bottleneck"**) is.
Did it really matter that you didn't measure it to two or three decimal places?

BTW, it is immune to all the other problems mentioned above.

**I keep putting quotes around "bottleneck" because what makes most software slow has nothing in common with the neck of a bottle.
A better metaphor is a "drain" - something that just needlessly wastes time.

情丝乱 2024-11-22 01:25:19

gprof 的计时数据是统计数据(请参阅 分析详细信息 文档)。

另一方面,KCacheGrind 使用 valgrind 它实际上解释了所有代码。

因此,如果 valgrind 建模的 CPU 接近您的真实 CPU,KCacheGrind 可以“更准确”(以更多开销为代价)

选择哪一种还取决于您可以处理哪种类型的开销。根据我的经验,gprof 增加的运行时开销(即执行时间)较少,但它更具侵入性(即 -pg 向每个函数添加代码) 。所以根据情况,选择on或other更合适。

为了获得“更好的”gprof 数据,请运行更长时间的代码(并在尽可能广泛的测试数据上运行)。您拥有的越多,测量结果的统计效果就越好。

gprof's timing data is statistical (read about it in details of profiling docs).

On the other hand, KCacheGrind uses valgrind which actually interprets all the code.

So KCacheGrind can be "more accurate" (at the expense of more overhead) if the CPU modeled by valgrind is close to your real CPU.

Which one to choose also depends on what type of overhead you can handle. In my experience, gprof adds less runtime overhead (execution time that is), but it is more intrusive (i.e. -pg adds code to each and every one of your functions). So depending on the situation, on or the other is more appropriate.

For "better" gprof data, run your code longer (and on as wide a range of test data you can). The more you have, the better the measurements will be statistically.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文