如何做好复杂功能的基准测试?

发布于 2024-10-09 12:25:46 字数 163 浏览 7 评论 0原文

我即将开始对 C 中的一组复杂函数进行非常详细的基准测试。这是“科学级别”的细节。我想知道,进行认真的基准测试的最佳方法是什么?我正在考虑运行它们,例如,每个运行 10 次,平均计时结果并给出标准开发,例如,仅使用 。为了获得良好的基准,你们会做什么?

I am about to embark in very detailed benchmarking of a set of complex functions in C. This is "science level" detail. I'm wondering, what would be the best way to do serious benchmarking? I was thinking about running them, say, 10 times each, averaging the timing results and give the standard dev, for instance, just using <time.h>. What would you guys do to obtain good benchmarks?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

七色彩虹 2024-10-16 12:25:46

当所讨论的分布近似正态时,报告平均值和标准差可以很好地描述分布。然而,计算性能测量却很少如此。相反,性能测量往往更接近泊松分布。这是有道理的,因为计算机上的随机事件不会导致程序运行得更快。本质上,所有测量噪声都在于发生了多少导致其速度减慢的随机事件。 (相比之下,正态分布根本没有直观意义;它需要相信程序在负时间内完成的概率不为零)。

有鉴于此,我发现报告程序多次运行的最短时间而不是平均值最有用;分布中的噪声通常是测量系统的噪声,而不是有关算法的有意义的信息。对于具有提前退出条件和其他快捷方式的复杂算法,您需要更加小心,但每次运行处理代表性输入平衡的多次运行的最小值通常效果很好。

“每次 10 次”对我来说听起来很少迭代。我通常会执行数千次(或更多,取决于功能/系统)的运行,除非这是完全不可行的。至少,您需要确保运行计时足够长的时间,以摆脱对系统状态的任何依赖,其中一些状态可能会在相当大的时间粒度上发生变化。

您应该注意的另一件事是,基本上每个系统都有一个可用的特定于平台的计时器,该计时器比可用的 准确得多。找出您的目标平台上的内容并使用它。

Reporting an average and standard deviation gives a good description of a distribution when the distribution in question is approximately normal. However, this is rarely true of computational performance measurements. Instead, performance measurements tend to more closely resemble a poisson distribution. This makes sense, because not many random events on a computer will cause a program to go faster; essentially all of the measurement noise is in how many random events occur that cause it to slow down. (A normal distribution, by contrast, makes no intuitive sense at all; it would require the belief that a program has a non-zero probability of finishing in negative time).

In light of this, I find it most useful to report the minimum time over many runs of a program, rather than the average; the noise in the distribution is typically noise of the measuring system, rather than meaningful information about the algorithm. For complex algorithms that have early out conditions, and other shortcuts, you need to be a little more careful, but the minimum of many runs where each run handles a representative balance of inputs usually works well.

"10 times each" sounds like very few iterations to me. I generally do something on the order of thousands (or more, depending on the function/system) of runs unless that's completely infeasible. At a bare minimum, you need to make sure that you run the timing for sufficiently long as to shake out any dependence on system state, some of which may change at fairly large time granularity.

The other thing you should be aware of is that essentially every system has a platform-specific timer available that is much more accurate than what is available <time.h>. Find out what it is on your target platform[s] and use it instead.

窗影残 2024-10-16 12:25:46

我假设您正在查看程序中的纯算法计算基准测试,并且没有可能需要不可预测的时间的用户输入或输出。
现在,对于纯粹的数字处理程序,您的结果可能会根据程序实际运行的时间而有所不同,这将受到系统中其他正在进行的活动的影响。可能还有其他因素可供您选择忽略,具体取决于所需的准确度级别,即缓存未命中造成的影响、通过内存层次结构的不同访问时间”
其中一种方法是按照您的建议计算多次运行的平均值。
或者您可以尝试查看汇编代码并查看生成的指令。然后根据处理器获取这些指令的周期计数。根据您要进行基准测试的代码量,此方法可能不实用。如果您特别关注内存层次结构的影响,那么您可能需要非常仔细地控制执行环境,即加载程序的位置、加载数据的位置等。但正如我所提到的,根据所需的精度,您可能会吸收由于内存而引起的变化统计变异中的层次结构”。
您可能需要仔细设计函数的测试输入,以确保路径覆盖范围,并且可以选择将性能统计数据发布为测试输入的函数。这将显示函数在输入范围内的行为方式

I am assuming you are looking at benchmarking pure Algorithmic computation in your program and there is no user input or output which can take unpredictable time.
Now for purely number crunching programs, your results could vary based on the time your program actually runs which will be impacted by other ongoing activities in the system. There could be other factor which you may choose to ignore depending upon level of accuracy desired i.e. impact due to cache miss, different access time through the memory hierarchy"
One of the methods is as you suggested calculation average over a number of runs.
Or you could try to look at the assembly code and see the instructions generated. And then based on the processor get the cycle count for these instructions. This method may not be practical depending on the amount of code you are looking to benchmark. If you are particular about memory hierarchy impact then you may want to control execution environment very carefully i.e. where program is loaded, where its data is loaded etc. But as I mentioned depending on the accuracy desired, you may absorb the variation caused due to memory hierarchy in you statistical variation" .
You may need to carefully design the test input for you functions to ensure the path coverage and may choose to publish statistics of performance as a function of test input. This will show how function behaves across range of inputs

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文