测量所选循环的执行时间

发布于 2024-08-31 10:51:39 字数 577 浏览 8 评论 0原文

我想测量 C 程序中选定循环的运行时间,以便了解执行程序(在 Linux 上)的总时间的百分比花在这些循环上。我应该能够指定应该测量性能的循环。在过去的几天里,我尝试了几种工具(vtune、hpctoolkit、oprofile),但似乎没有一个能做到这一点。他们都会找到性能瓶颈并只显示这些瓶颈的时间。这是因为这些工具仅存储高于阈值(~1 毫秒)的时间。因此,如果一个循环花费的时间少于该时间,则不会报告其执行时间。

gprof 的基本块计数功能依赖于现在不支持的旧编译器中的功能。

我可以使用 gettimeofday 或类似的东西手动编写一个简单的计时器,但在某些情况下它不会给出准确的结果。例如:

for (i = 0; i < 1000; ++i)
{
    for (j  = 0; j < N; ++j)
    {
        //do some work here
    }
}

现在我想测量内部循环所花费的总时间,并且我必须在第一个循环内调用 gettimeofday 。因此 gettimeofday 本身将被调用 1000 次,这会引入其自身的开销,并且结果将不准确。

I want to measure the running times of selected loops in a C program so as to see what percentage of the total time for executing the program (on linux) is spent in these loops. I should be able to specify the loops for which the performance should be measured. I have tried out several tools (vtune, hpctoolkit, oprofile) in the last few days and none of them seem to do this. They all find the performance bottlenecks and just show the time for those. Thats because these tools only store the time taken that is above a threshold (~1ms). So if one loop takes lesser time than that then its execution time won't be reported.

The basic block counting feature of gprof depends on a feature in older compilers thats not supported now.

I could manually write a simple timer using gettimeofday or something like that but for some cases it won't give accurate results. For ex:

for (i = 0; i < 1000; ++i)
{
    for (j  = 0; j < N; ++j)
    {
        //do some work here
    }
}

Now here I want to measure the total time spent in the inner loop and I will have to put a call to gettimeofday inside the first loop. So gettimeofday itself will get called a 1000 times which introduces its own overhead and the result will be inaccurate.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

孤单情人 2024-09-07 10:51:39

除非 CPU 周围有电路内仿真器或分线盒,否则不存在对单循环或单指令进行计时这样的事情。您需要将测试运行量增加到每次至少需要几秒钟的时间,以减少由于 CPU、操作系统等中发生的其他事情而导致的错误。

如果您想确切地了解特定的时间循环需要执行,并且执行时间不到 1 秒,您将需要人为地增加迭代次数以获得高于“本底噪声”的数字。然后,您可以将该数字除以人为膨胀的迭代次数,以获得一个表示一次通过目标循环需要多长时间的数字。

如果您想比较不同循环样式或技术的性能,那么同样的事情也成立:您将需要增加迭代次数或通过测试代码,以获得您的测量结果您感兴趣的内容主导了您正在测量的时间片。

无论您是使用 CPU 提供的亚毫秒高性能计数器、系统日期时钟还是挂钟来测量测试所用时间来测量性能,这都是事实。

否则,您只是在测量白噪声。

Unless you have an in circuit emulator or break-out box around your CPU, there's no such thing as timing a single-loop or single-instruction. You need to bulk up your test runs to something that takes at least several seconds each in order to reduce error due to other things going on in the CPU, OS, etc.

If you're wanting to find out exactly how much time a particular loop takes to execute, and it takes less than, say, 1 second to execute, you're going to need to artificially increase the number of iterations in order to get a number that is above the "noise floor". You can then take that number and divide it by the number of artificially inflated iterations to get a figure that represents how long one pass through your target loop will take.

If you're wanting to compare the performance of different loop styles or techniques, the same thing holds: you're going to need to increase the number of iterations or passes through your test code in order to get a measurement in which what you're interested in dominates the time slice you're measuring.

This is true whether you're measuring performance using sub-millisecond high performance counters provided by the CPU, the system date time clock, or a wall clock to measure the elapsed time of your test.

Otherwise, you're just measuring white noise.

[浮城] 2024-09-07 10:51:39

通常,如果您想测量内部循环所花费的时间,您可以将时间获取例程放在外部循环之外,然后除以(外部)循环计数。如果您希望任何 j 的内循环时间相对恒定,那就是。

任何分析指令都会产生自己的开销,但无论插入何处,开销大概都是相同的,因此“一切都会水到渠成”。大概您正在寻找两个比较进程的运行时之间存在相当大差异的点,其中像这样的一对函数调用不会成为问题(因为您也需要在“结束”处调用一个函数来获取时间delta),因为一个例程的成本是另一个例程的 2 倍或更多。

大多数平台也提供某种更高分辨率的计时器,尽管我们在这里使用的计时器隐藏在 API 后面,以便“客户端”代码是跨平台的。我确信只要稍加观察,您就可以将其打开。尽管即使在这里,您也不太可能获得超过 1 毫秒的准确度,因此最好连续运行代码几次并对整个运行进行计时(然后除以循环计数,当然)。

Typically if you want to measure the time spent in the inner loop, you'll put the time get routines outside of the outer loop and then divide by the (outer) loop count. If you expect the time of the inner loop to be relatively constant for any j, that is.

Any profiling instructions incur their own overhead, but presumably the overhead will be the same regardless of where it's inserted so "it all comes out in the wash." Presumably you're looking for spots where there are considerable differences between the runtimes of two compared processes, where a pair of function calls like this won't be an issue (since you need one at the "end" too, to get the time delta) since one routine will be 2x or more costly over the other.

Most platforms offer some sort of higher resolution timer, too, although the one we use here is hidden behind an API so that the "client" code is cross-platform. I'm sure with a little looking you can turn it up. Although even here, there's little likelihood that you'll get better than 1ms accuracy, so it's preferable to run the code several times in a row and time the whole run (then divide by the loop count, natch).

稳稳的幸福 2024-09-07 10:51:39

我很高兴您正在寻找百分比,因为这很容易获得。只需让它运行即可。如果它运行得很快,请在它周围放置一个外循环,这样它就需要很长时间。这不会影响百分比。当它运行时,获取stackshots。您可以使用 gdb 中的 Ctrl-Break 来完成此操作,也可以使用 pstacklsstack。只需查看堆栈截图中显示您关心的代码的百分比即可。

假设循环花费了一小部分时间,例如 0.2 (20%),并且您采集了 N=20 个样本。那么应该显示它们的样本数平均为 20 * 0.2 = 4,样本数的标准差将为 sqrt(20 * 0.2 * 0.8) = sqrt(3.2) = 1.8,所以如果你想要更高的精度,采取更多样本。 (我个人认为精度被高估了。)

I'm glad you're looking for percentage, because that's easy to get. Just get it running. If it runs quickly, put an outer loop around it so it takes a good long time. That won't affect the percentages. While it's running, get stackshots. You can do this with Ctrl-Break in gdb, or you can use pstack or lsstack. Just look to see what percentage of stackshots display the code you care about.

Suppose the loops take some fraction of time, like 0.2 (20%) and you take N=20 samples. Then the number of samples that should show them will average 20 * 0.2 = 4, and the standard deviation of the number of samples will be sqrt(20 * 0.2 * 0.8) = sqrt(3.2) = 1.8, so if you want more precision, take more samples. (I personally think precision is overrated.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文