我应该如何解释 OProfile 输出？

发布于 2024-09-29 13:37:11 字数 2406 浏览 4 评论 0原文

我最近尝试使用 OProfile 分析我的应用程序。收集到的数据对我来说已经非常有价值，但我在精确解释它时遇到了困难。在设置并运行 oprofile 后运行我的应用程序后，我生成了报告并得到：

root@se7xeon:src# opreport image:test -l -t 1
溢出统计数据不可用
CPU：P4 / Xeon，带 2 个超线程，速度 3191.66 MHz（估计）
计算单位掩码为 0x01（强制）的 GLOBAL_POWER_EVENTS 事件（处理器未停止的时间）计数 750000
样本 % 符号名称
215522 84.9954 cci::Image::interpolate(unsigned char*, cci::Matrix const&) const
17998 7.0979 cci::Calc::diff(无符号字符常量*，无符号字符常量*)
13171 5.1942 cci::Image::getIRect(unsigned char*, int, int) const
第5519章 2.1765 cci::Image::getFRect(unsigned char*, double, double) const

好的，所以我的插值函数占用了应用程序（太长）执行时间的 84%。那么研究一下似乎是个好主意：

root@se7xeon:src# opannotate image:test --source
[...]

/* cci::Image::interpolate(unsigned char*, cci::Matrix<cci::Point2DF> const&) const total: 215522   84.9954 */  
1392  0.5529 :void Image::interpolate(CCIPixel *output, const Matrix<Point2DF> &inputPoints) const throw()  
4  0.0016 :{  
[...]  
:                col0 = static_cast<int>(point[idx].x);  
3  0.0012 :      col1 = col0+1;  
629  0.2498 :    row0 = static_cast<int>(point[idx].y);  
385  0.1529 :    row1 = row0+1;  
56214 22.3266 :  if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height)  
:                {  
:                        col0 = row0 = col1 = row1 = 0;  
:                }

如果我理解正确的话，if 条件占程序执行时间的 22% 以上。左大括号和函数声明似乎需要时间，这是否应该与函数调用开销相对应（“将参数压入堆栈，跳转，弹出参数”序列）？

我更改了源代码中的一些内容（与后来的瓶颈相关，因为我不知道如何优化 if），重新编译，再次运行 oprofile （不要忘记 opcontrol --reset）。现在带注释的代码在同一个地方看起来像这样：

6  0.0024 :     curPx = point[idx].x;  
628  0.2477 :   curPy = point[idx].y;  
410  0.1617 :   col0 = static_cast<int>(curPx);  
57910 22.8380 : col1 = col0+1;  
:               row0 = static_cast<int>(curPy);  
:               row1 = row0+1;  
:               if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height)  
:               {  
:                   col0 = row0 = col1 = row1 = 0;  
:               }

这次 if 基本上不花时间（？），最昂贵的指令是“col1 = col0 + 1”，整个耗时块似乎已经转移向上。怎么会这样呢？这是否可以完全信任来查明源中的瓶颈？

我的另一个疑问是，当我设置opcontrol时，我将跟踪事件输入为GLOBAL_POWER_EVENTS，样本数为750k。在输出中，插值函数似乎占据了84%，但其内部记录的样本数仅略高于200k。这甚至不到所要求数量的 50%。我是否可以理解，剩余的约 500k 样本是由输出中未列出的应用程序（内核、Xorg 等）获取的？

原文

I tried profiling my application with OProfile recently. The data gathered is already very valuable to me, but I'm having difficulties with its precise interpretation. After running my app with oprofile set up and running, I generated the report and got:

root@se7xeon:src# opreport image:test -l -t 1
Overflow stats not available
CPU: P4 / Xeon with 2 hyper-threads, speed 3191.66 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 750000
samples % symbol name
215522 84.9954 cci::Image::interpolate(unsigned char*, cci::Matrix const&) const
17998 7.0979 cci::Calc::diff(unsigned char const*, unsigned char const*)
13171 5.1942 cci::Image::getIRect(unsigned char*, int, int) const
5519 2.1765 cci::Image::getFRect(unsigned char*, double, double) const

Okay, so my interpolation function is responsible for 84% of the application's (too long) execution time. Seems a good idea to look into it then:

root@se7xeon:src# opannotate image:test --source
[...]

/* cci::Image::interpolate(unsigned char*, cci::Matrix<cci::Point2DF> const&) const total: 215522   84.9954 */  
1392  0.5529 :void Image::interpolate(CCIPixel *output, const Matrix<Point2DF> &inputPoints) const throw()  
4  0.0016 :{  
[...]  
:                col0 = static_cast<int>(point[idx].x);  
3  0.0012 :      col1 = col0+1;  
629  0.2498 :    row0 = static_cast<int>(point[idx].y);  
385  0.1529 :    row1 = row0+1;  
56214 22.3266 :  if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height)  
:                {  
:                        col0 = row0 = col1 = row1 = 0;  
:                }

If I understand correctly, the if conditional is responsible for over 22% of the program's execution time. The opening brace and the function declaration seem to take time, is that supposed to correspond to the function call overhead ("push parameters on stack, jump, pop parameters" sequence)?

I changed some things in the source (related to a later bottleneck because I had no idea how to optimize an if), recompiled, ran through oprofile again (not forgetting opcontrol --reset). Now the annotated code looks like this in the same place:

6  0.0024 :     curPx = point[idx].x;  
628  0.2477 :   curPy = point[idx].y;  
410  0.1617 :   col0 = static_cast<int>(curPx);  
57910 22.8380 : col1 = col0+1;  
:               row0 = static_cast<int>(curPy);  
:               row1 = row0+1;  
:               if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height)  
:               {  
:                   col0 = row0 = col1 = row1 = 0;  
:               }

This time the if takes basically no time at all (?), the most expensive instruction is "col1 = col0 + 1", and the whole time-taking block seems to have shifted upwards. How can this be? Can this be trusted at all to pinpoint bottlenecks in the source?

An another point of doubt for me is that when I set up opcontrol, I entered the traced event as GLOBAL_POWER_EVENTS, with the number of samples being 750k. In the output, the interpolation function seems to take 84%, but the number of samples recorded inside it is only a little bit above 200k. That isn't even 50% of the requested number. Am I to understand that the remaining ~500k samples was taken by applications not listed in the output (kernel, Xorg, etc.)?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

物价感观 2024-10-06 13:37:11

在分析优化代码时，您确实不能依赖准确的源代码行。编译器移动的东西太多了。

为了获得准确的图片，您需要查看代码反汇编器的输出。

回复收藏 0 原文

违心° 2024-10-06 13:37:11

OProfile 可以（他们告诉我）获取挂钟时间（而不是 CPU）的堆栈样本，并且它可以为您提供行级百分比。您正在寻找的是大部分堆栈样本中包含的行。

在完成手动调整代码之前我不会打开编译器优化，因为它只是隐藏了一些东西。

当您说插值例程使用了 84% 的时间时，这就引发了一个问题。整个程序总共需要一些时间，对吧？这需要 100% 的时间。如果将程序的时间减少一半，或者加倍，它仍然需要 100% 的时间。 84%的插值是否太多取决于它是否做得多余。

所以我建议你不要问例行程序的百分比是否太多。相反，您会查找需要大量时间的代码行，并询问是否可以对其进行优化。看到区别了吗？优化代码后，它可以大大减少总体运行时间，但它可能仍然占较小总数的很大一部分。
当没有任何内容占据很大比例时，代码就不是最佳的。
当所有占很大比例的事情都无法改进时，代码就是最佳的。

我不关心只给出数字的东西。我想要的是洞察力。例如，如果该例程占 84% 的时间，那么如果您采取 10 个堆栈样本，它将位于其中的 8.4 个上。确切的数字并不重要。重要的是要理解为什么它在那里。真的有必要在里面呆那么久吗？这就是查看堆栈样本可以告诉您的信息。也许您实际上进行插值的频率是必要的两倍？通过分析原因，人们经常发现他们试图加速的例程几乎不需要被调用，也许根本不需要。我无法猜测你的情况。只有通过检查程序状态的洞察力才能告诉您这一点。