需要分析帮助

发布于 2024-09-05 14:42:50 字数 672 浏览 17 评论 0原文

我有一个分析问题 - 假设我有以下代码...

void main()
{
    well_written_function();
    badly_written_function();
}
void well_written_function()
{
    for (a small number)
    {
        highly_optimised_subroutine();
    }
}
void badly_written_function()
{
    for (a wastefully and unnecessarily large number)
    {
        highly_optimised_subroutine();
    }
}
void highly_optimised_subroutine()
{
    // lots of code
}

如果我在 vtune （或其他分析器）下运行它，则很难发现有什么问题。所有热点都将出现在已优化的标记为“//大量代码”的部分中。 badly_writing_function() 不会以任何方式突出显示，即使它是所有问题的原因。

vtune 是否有某些功能可以帮助我找到问题？

是否有某种模式可以让我找到 badly_writing_function() 及其所有子函数所花费的时间？

原文

I have a profiling issue - imagine I have the following code...

void main()
{
    well_written_function();
    badly_written_function();
}
void well_written_function()
{
    for (a small number)
    {
        highly_optimised_subroutine();
    }
}
void badly_written_function()
{
    for (a wastefully and unnecessarily large number)
    {
        highly_optimised_subroutine();
    }
}
void highly_optimised_subroutine()
{
    // lots of code
}

If I run this under vtune (or other profilers) it is very hard to spot that anything is wrong. All the hotspots will appear in the section marked "// lots of code" which is already optimised. The badly_written_function() will not be highlighted in any way even though it is the cause of all the trouble.

Is there some feature of vtune that will help me find the problem?

Is there some sort of mode whereby I can find the time taken by badly_written_function() and all of its sub-functions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

﹏半生如梦愿梦如真 2024-09-12 14:42:52

这通常称为“调用图配置文件”，我相当确定 Visual Studio 会这样做。

回复收藏 0 原文

柠檬色的秋千 2024-09-12 14:42:52

推出您自己的非常简单分析器并不难。插入 main():

int main()
{
    profileCpuUsage(1);                 // start timer #1
    well_written_function();
    profileCpuUsage(2);                 // stop timer #1, and start timer #2
    badly_written_function();
    profileCpuUsage(-1);                // print stats for timers #1 and #2
    return 0;
}

其中：

#define NUMBER(a) ((int)(sizeof(a) / sizeof(a)[0]))

void profileCpuUsage(int slice)
{
    static struct {
        int iterations;
        double elapsedTime;
    } slices[30];                             // 0 is a don't care slice

    if (slice < 0) {                          // -1 = print
        if (slices[0].iterations)
            for (slice = 1; slice < NUMBER(slices); slice++)
                printf("Slice %2d  Iterations %7d  Seconds %7.3f\n", slice,
                    slices[slice].iterations, slices[slice].elapsedTime);
    }
    else {
        static int i;                         // = previous slice
        static double t;                      // = previous t1
        const double t1 = realElapsedTime();  // see below for definition
        assert (slice < NUMBER(slices));
        slices[i].iterations  += 1;
        slices[i].elapsedTime += t1 - t;      // i = 0 first time through
        i = slice;
        t = t1;
    }
}

现在不可否认的是，在您的简单示例中使用此 profileCpuUsage() 并没有增加太多好处。它的缺点是要求您通过在合适的位置调用 profileCpuUsage() 来手动检测代码。

但优点包括：

您可以对任何代码片段进行计时，而不仅仅是程序。
当您进行二分搜索来查找和/或删除代码热点时，添加和删除速度很快。
它只关注您感兴趣的代码。
可移植！
KISS

一件棘手的不可移植的事情是定义函数 realElapsedTime() ，以便它提供足够的粒度来获取有效时间。这通常对我有用（在 CYGWIN 下使用 Windows API）：

#include <windows.h>
double realElapsedTime(void)   // <-- granularity about 50 microsec on test machines
{
    static LARGE_INTEGER freq, start;
    LARGE_INTEGER count;
    if (!QueryPerformanceCounter(&count))
        assert(0 && "QueryPerformanceCounter");
    if (!freq.QuadPart) {      // one time initialization
        if (!QueryPerformanceFrequency(&freq))
            assert(0 && "QueryPerformanceFrequency");
        start = count;
    }
    return (double)(count.QuadPart - start.QuadPart) / freq.QuadPart;
}

对于纯 Unix，有一个共同点：

double realElapsedTime(void)                      // returns 0 first time called
{
    static struct timeval t0;
    struct timeval tv;
    gettimeofday(&tv, 0);
    if (!t0.tv_sec)
        t0 = tv;
    return tv.tv_sec - t0.tv_sec + (tv.tv_usec - t0.tv_usec) / 1000000.;
}

realElapsedTime() 给出挂钟时间，而不是进程时间，这通常是我想要的。

还有其他不太便携的方法可以使用 RDTSC 实现更细的粒度；例如，参见 http://en.wikipedia.org/wiki/Time_Stamp_Counter 及其链接，但我还没有尝试过这些。

编辑： ravenspoint 的非常好的答案似乎与我的没有太大不同。而且他的答案使用了漂亮的描述性字符串，而不仅仅是丑陋的数字，这让我经常感到沮丧。但这只需大约十几行就可以解决（但这几乎使行数翻倍！）。

请注意，我们希望避免使用 malloc()，而且我什至对 strcmp() 有点怀疑。所以切片的数量永远不会增加。哈希冲突只是被标记，而不是被解决：人工分析器可以通过手动将切片数量从 30 增加或通过更改描述来解决此问题。 未经测试

static unsigned gethash(const char *str)    // "djb2", for example 
{
    unsigned c, hash = 5381;
    while ((c = *str++))
        hash = ((hash << 5) + hash) + c;    // hash * 33 + c 
    return hash;
}

void profileCpuUsage(const char *description)
{
    static struct {
        int iterations;
        double elapsedTime;
        char description[20];               // added!
    } slices[30];

    if (!description) {
        // print stats, but using description, mostly unchanged...
    }
    else {
        const int slice = gethash(description) % NUMBER(slices);
        if (!slices[slice].description[0]) { // if new slice
            assert(strlen(description) < sizeof slices[slice].description);
            strcpy(slices[slice].description, description);
        }
        else if (!!strcmp(slices[slice].description, description)) {
            strcpy(slices[slice].description, "!!hash conflict!!");
        }
        // remainder unchanged...
    }
}

另一点是，通常您会希望针对发布版本禁用此分析；这也适用于 ravenspoint 的答案。这可以通过使用邪恶宏来定义它的技巧来完成：

#define profileCpuUsage(foo)                // = nothing

如果这样做，您当然需要在定义中添加括号以禁用禁用宏：

void (profileCpuUsage)(const char *description)...

Rolling your own very simple profiler is not that hard. Insert into main():

int main()
{
    profileCpuUsage(1);                 // start timer #1
    well_written_function();
    profileCpuUsage(2);                 // stop timer #1, and start timer #2
    badly_written_function();
    profileCpuUsage(-1);                // print stats for timers #1 and #2
    return 0;
}

where:

#define NUMBER(a) ((int)(sizeof(a) / sizeof(a)[0]))

void profileCpuUsage(int slice)
{
    static struct {
        int iterations;
        double elapsedTime;
    } slices[30];                             // 0 is a don't care slice

    if (slice < 0) {                          // -1 = print
        if (slices[0].iterations)
            for (slice = 1; slice < NUMBER(slices); slice++)
                printf("Slice %2d  Iterations %7d  Seconds %7.3f\n", slice,
                    slices[slice].iterations, slices[slice].elapsedTime);
    }
    else {
        static int i;                         // = previous slice
        static double t;                      // = previous t1
        const double t1 = realElapsedTime();  // see below for definition
        assert (slice < NUMBER(slices));
        slices[i].iterations  += 1;
        slices[i].elapsedTime += t1 - t;      // i = 0 first time through
        i = slice;
        t = t1;
    }
}

Now admittedly in your simple example using this profileCpuUsage() doesn't add much benefit. And it has disadvantage of requiring you to manually instrument your code by calling profileCpuUsage() at suitable locations.

But advantages include:

You can time any code fragment, not just procedures.
It is quick to add and delete, as you do a binary search to find and/or remove code hotspots.
It focuses only on the code you are interested in.
Portable!
KISS

One tricky non-portable thing is to define the function realElapsedTime() so that it provides enough granularity to get valid times. This generally works for me (using the Windows API under CYGWIN):

#include <windows.h>
double realElapsedTime(void)   // <-- granularity about 50 microsec on test machines
{
    static LARGE_INTEGER freq, start;
    LARGE_INTEGER count;
    if (!QueryPerformanceCounter(&count))
        assert(0 && "QueryPerformanceCounter");
    if (!freq.QuadPart) {      // one time initialization
        if (!QueryPerformanceFrequency(&freq))
            assert(0 && "QueryPerformanceFrequency");
        start = count;
    }
    return (double)(count.QuadPart - start.QuadPart) / freq.QuadPart;
}

For straight Unix there is the common:

double realElapsedTime(void)                      // returns 0 first time called
{
    static struct timeval t0;
    struct timeval tv;
    gettimeofday(&tv, 0);
    if (!t0.tv_sec)
        t0 = tv;
    return tv.tv_sec - t0.tv_sec + (tv.tv_usec - t0.tv_usec) / 1000000.;
}

realElapsedTime() gives wall-clock time, not process time, which is usually what I want.

There are also other less-portable methods to achieve finer granularity using RDTSC; see for example http://en.wikipedia.org/wiki/Time_Stamp_Counter, and its links, but I haven't tried these.

Edit: ravenspoint's very nice answer seems to be not too dissimilar from mine. And his answer uses nice descriptive strings, rather than just ugly numbers, which I was often frustrated with. But this can be fixed with only about a dozen extra lines (but this almost doubles the line count!).

Note that we want to avoid any usage of malloc(), and I'm even a bit dubious about strcmp(). So the number of slices is never increased. And hash collisions are simply flagged it rather being resolved: the human profiler can fix this by manually bumping up the number of slices from 30, or by changing the description. Untested

static unsigned gethash(const char *str)    // "djb2", for example 
{
    unsigned c, hash = 5381;
    while ((c = *str++))
        hash = ((hash << 5) + hash) + c;    // hash * 33 + c 
    return hash;
}

void profileCpuUsage(const char *description)
{
    static struct {
        int iterations;
        double elapsedTime;
        char description[20];               // added!
    } slices[30];

    if (!description) {
        // print stats, but using description, mostly unchanged...
    }
    else {
        const int slice = gethash(description) % NUMBER(slices);
        if (!slices[slice].description[0]) { // if new slice
            assert(strlen(description) < sizeof slices[slice].description);
            strcpy(slices[slice].description, description);
        }
        else if (!!strcmp(slices[slice].description, description)) {
            strcpy(slices[slice].description, "!!hash conflict!!");
        }
        // remainder unchanged...
    }
}

And another point is that typically you'll want to disable this profiling for release versions; this also applies to ravenspoint's answer. This can be done by the trick of using an evil macro to define it away:

#define profileCpuUsage(foo)                // = nothing

If this is done, you will of course need to add parentheses to the definition to disable the disabling macro:

void (profileCpuUsage)(const char *description)...

回复收藏 0 原文

似狗非友 2024-09-12 14:42:52

我可以推荐我自己的开源分析器 raven::set::cRunWatch 吗？它正是针对这个问题而设计的，并且可以在使用 Visual Studio 2008 标准版的 Windows 上运行，因此您无需为包含探查器的版本付费。

我已经拿走了你的代码，稍微重新排列了它，这样它就可以在没有前向声明的情况下进行编译，并添加了对 cRunWatch 的必要调用，

// RunWatchDemo.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

void highly_optimised_subroutine()
{
    raven::set::cRunWatch runwatch("highly_optimised_subroutine");
    Sleep( 2 );
}


void badly_written_function()
{
    raven::set::cRunWatch runwatch("badly_written_function");
    for (int k = 1; k < 1000; k++ )
    {
        highly_optimised_subroutine();
    }
}

void well_written_function()
{
    raven::set::cRunWatch runwatch("well_written_function");
   for (int k = 1; k < 10; k++ )
    {
        highly_optimised_subroutine();
    }
}


int _tmain(int argc, _TCHAR* argv[])
{
raven::set::cRunWatch::Start();

    well_written_function();
    badly_written_function();

raven::set::cRunWatch::Report();

    return 0;
}

当运行时会产生输出

raven::set::cRunWatch code timing profile
                    Scope   Calls       Mean (secs)     Total
highly_optimised_subroutine     1008    0.002921        2.944146
   badly_written_function        1      2.926662        2.926662
    well_written_function        1      0.026239        0.026239

这表明 badly_writing_function 是非常接近的第二次用户，因此是罪魁祸首。

您可以从此处获取cRunWatch 您将在用户指南中识别示例代码:-)

May I suggest my own open source profiler raven::set::cRunWatch? It is designed for exactly this problem and works on Windows using Visual Studio 2008 Standard Edition, so you do not need to pay for the version that has the profiler included.

I have taken your code, rearranged it slightly so it compiles without forward declarations and added the necessary calls to cRunWatch

// RunWatchDemo.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

void highly_optimised_subroutine()
{
    raven::set::cRunWatch runwatch("highly_optimised_subroutine");
    Sleep( 2 );
}


void badly_written_function()
{
    raven::set::cRunWatch runwatch("badly_written_function");
    for (int k = 1; k < 1000; k++ )
    {
        highly_optimised_subroutine();
    }
}

void well_written_function()
{
    raven::set::cRunWatch runwatch("well_written_function");
   for (int k = 1; k < 10; k++ )
    {
        highly_optimised_subroutine();
    }
}


int _tmain(int argc, _TCHAR* argv[])
{
raven::set::cRunWatch::Start();

    well_written_function();
    badly_written_function();

raven::set::cRunWatch::Report();

    return 0;
}

When run this produces the output

raven::set::cRunWatch code timing profile
                    Scope   Calls       Mean (secs)     Total
highly_optimised_subroutine     1008    0.002921        2.944146
   badly_written_function        1      2.926662        2.926662
    well_written_function        1      0.026239        0.026239

This shows badly_written_function to be the very close second time user, and therefore the culprit.

You can obtain cRunWatch from here You will recognize the sample code in the User Guide :-)

回复收藏 0 原文

歌枕肩 2024-09-12 14:42:52

一般来说，您希望观察函数的总时间而不是自身时间，以确保您查看的时间包括被调用函数的时间。

在 VTune 中，我建议使用自顶向下选项卡。或者，更好的是，如果您使用的是最新更新，请尝试新的实验性主叫方-被叫方视图。您可以在此处获取有关此内容的详细信息 - http://software.intel.com/en -us/forums/topic/376210。它获得函数及其总时间的平面列表，以便您可以查看程序中最耗时的子树。

回复收藏 0 原文

~没有更多了~