推荐的开源分析器
我正在尝试寻找开源分析器,而不是使用我必须支付 $$$ 的商业分析器之一。 当我在 SourceForge 上进行搜索时,我发现了这四个我认为很有前途的 C++ 分析器:
- Shiny: C++ Profiler
- Low Fat Profiler
- Luke Stackwalker
- FreeProfiler
我不确定哪一个分析器是最好使用的在了解我的程序的性能方面。 很高兴听到一些建议。
I'm trying to find open source profilers rather than using one of the commercial profilers which I have to pay $$$ for. When I performed a search on SourceForge, I have come across these four C++ profilers that I thought were quite promising:
- Shiny: C++ Profiler
- Low Fat Profiler
- Luke Stackwalker
- FreeProfiler
I'm not sure which one of the profilers would be the best one to use in terms of learning about the performance of my program. It would be great to hear some suggestions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以尝试Windows 性能工具包。 完全免费使用。 此博客条目 有一个关于如何进行基于样本的分析的示例。
You could try Windows Performance Toolkit. Completely free to use. This blog entry has an example of how to do sample-based profiling.
有不止一种方法可以做到这一点。
不要忘记 no-profiler大多数
分析器假设您需要 1) 计时的高统计精度(大量样本),以及 2) 问题识别的低精度(函数和调用图)。
这些优先事项可以颠倒过来。 即问题可以定位到精确的机器地址,而成本精度是样本数量的函数。
大多数实际问题的成本至少为 10%,其中高精度并不重要。
示例:如果某件事使您的程序花费了应有时间的 2 倍,则意味着其中某些代码的成本为 50%。 如果在速度缓慢时对调用堆栈进行 10 个样本,则其中大约 5 行将出现精确的代码行。 程序越大,问题就越有可能是堆栈中间某处的函数调用。
我知道这是违反直觉的。
注意:xPerf 已经接近了,但还没有完全达到(据我所知)。 它获取调用堆栈的样本并保存它们 - 这很好。 我认为它需要的是:
它应该只在您需要时才采集样本。 事实上,您必须过滤掉不相关的样本。
在堆栈视图中,它应该显示调用发生的特定行或地址,而不仅仅是整个函数。 (也许它可以做到这一点,我无法从博客中看出。)
如果您单击以获取以单个调用指令或叶指令为中心的蝴蝶视图,它应该显示的不是 CPU 分数,而是包含该指令的堆栈样本的比例。 这将是该指令成本的直接衡量标准,以时间的一小部分形式表示。 (也许它可以做到这一点,我不知道。)
因此,例如,即使指令是对文件打开的调用或其他使线程空闲的指令,它仍然会花费挂钟时间,您需要知道这一点。
注意:我刚刚查看了卢克·斯塔克沃克(Luke Stackwalker),同样的评论也适用。 我认为它走在正确的轨道上,但需要 UI 工作。
添加:更仔细地查看了 LukeStackwalker 后,我担心它会成为测量函数比定位语句更重要这一假设的牺牲品。 因此,在调用堆栈的每个样本上,它都会更新函数级计时信息,但它对行号信息所做的只是跟踪每个函数中的最小和最大行号,其中,采样越多,那些人的距离越来越远。 所以它基本上丢弃了最重要的信息——行号信息。 重要的原因是,如果您决定优化一个函数,您需要知道其中的哪些行需要工作,以及这些行位于堆栈样本上(在它们被丢弃之前)。
有人可能会反对,如果保留行号信息,它会很快耗尽存储空间。 两个答案。 1) 样本上出现的线条数量有限,而且反复出现。 2)不需要那么多样本——一直以来都假设需要高统计精度的测量,但从未证明其合理性。
我怀疑其他堆栈采样器(例如 xPerf)也有类似的问题。
There's more than one way to do it.
Don't forget the no-profiler method.
Most profilers assume you need 1) high statistical precision of timing (lots of samples), and 2) low precision of problem identification (functions & call-graphs).
Those priorities can be reversed. I.e. the problem can be located to the precise machine address, while cost precision is a function of the number of samples.
Most real problems cost at least 10%, where high precision is not essential.
Example: If something is making your program take 2 times as long as it should, that means there is some code in it that costs 50%. If you take 10 samples of the call stack while it is being slow, the precise line(s) of code will be present on roughly 5 of them. The larger the program is, the more likely the problem is a function call somewhere mid-stack.
It's counter-intuiitive, I know.
NOTE: xPerf is nearly there, but not quite (as far as I can tell). It takes samples of the call stack and saves them - that's good. Here's what I think it needs:
It should only take samples when you want them. As it is, you have to filter out the irrelevant ones.
In the stack view it should show specific lines or addresses at which calls take place, not just whole functions. (Maybe it can do this, I couldn't tell from the blog.)
If you click to get the butterfly view, centered on a single call instruction, or leaf instruction, it should show you not the CPU fraction, but the fraction of stack samples containing that instruction. That would be a direct measure of the cost of that instruction, as a fraction of time. (Maybe it can do this, I couldn't tell.)
So, for example, even if an instruction were a call to file-open or something else that idles the thread, it still costs wall clock time, and you need to know that.
NOTE: I just looked over Luke Stackwalker, and the same remarks apply. I think it is on the right track but needs UI work.
ADDED: Having looked over LukeStackwalker more carefully, I'm afraid it falls victim to the assumption that measuring functions is more important than locating statements. So on each sample of the call stack, it updates the function-level timing info, but all it does with the line-number info is keep track of min and max line numbers in each function, which, the more samples it takes, the farther apart those get. So it basically throws away the most important information - the line number information. The reason that is important is that if you decide to optimize a function, you need to know which lines in it need work, and those lines were on the stack samples (before they were discarded).
One might object that if the line number information were retained it would run out of storage quickly. Two answers. 1) There are only so many lines that show up on the samples, and they show up repeatedly. 2) Not so many samples are needed - the assumption that high statistical precision of measurement is necessary has always been assumed, but never justified.
I suspect other stack samplers, like xPerf, have similar issues.
它不是开源的,但 AMD CodeAnalyst 是免费的。 尽管名称如此,它也适用于 Intel CPU。 有适用于 Windows(与 Visual Studio 集成)和 Linux 的版本。
It's not open source, but AMD CodeAnalyst is free. It also works on Intel CPUs despite the name. There are versions available for both Windows (with Visual Studio integration) and Linux.
从列出的那些中,我发现 Luke Stackwalker 工作得最好 - 我喜欢它的 GUI,它很容易运行。
其他类似的是 Very Sleepy - 类似的功能,采样似乎更可靠,GUI 可能有点难使用(不是图形化的)。
在与他们相处了一段时间后,我发现了一个非常重要的缺点。 虽然两者都尝试以 1 毫秒分辨率进行采样,但实际上他们没有实现这一目标,因为他们的采样方法(附加进程的 StackWalk64)太慢了。 对于我的应用程序,获取调用堆栈大约需要 5-20 毫秒。 这不仅会使结果不精确,还会使结果出现偏差,因为短调用堆栈的执行速度更快,因此往往会获得更多命中。
From those who have listed, I have found Luke Stackwalker to work best - I liked its GUI, it was easy to get running.
Other similar is Very Sleepy - similar functionality, sampling seems more reliable, GUI perhaps a little bit harder to use (not that graphical).
After spending some more time with them, I have found one quite important drawback. While both try to sample at 1 ms resolution, in practice they do not achieve it because their sampling method (StackWalk64 of the attached process) is way too slow. For my application it takes something like 5-20 ms to get a callstack. Not only this makes your results imprecise, it also makes them skewed, as short callstacks are walked faster, therefore tend to get more hits.
我们使用 LtProf 并且对此感到满意。 不是开源的,但只有 $$,而不是 $$$ :-)
We use LtProf and have been happy with it. Not open source, but only $$, not $$$ :-)