基准测试注意事项和确定性数据收集

发布于 2024-11-24 04:31:50 字数 466 浏览 1 评论 0原文

我正在编写一个 C++ 基准测试程序,其中涉及对许多函数调用进行计时。函数会被重复调用,每次都会被记录下来,以便以后进行统计分析。要求函数在多个线程上同时运行,因此为了保证基准测试的准确性和公平性,它在实时操作系统上运行,并控制调度行为。以下是我的担忧:

是否有确定的方法来收集计时数据?我看过 printf 和 stringstream,但由于内存和字符串流,似乎都没有确定性行为。缓冲操作。出于同样的原因,它们也不会在 O(1) 中执行,我是对的吗?目前我正在使用一个大型字符数组和一个自定义 strcat 函数,以便可以在 O(1) 内收集每个时间值。然后在测试结束时收集所有数据后打印该数组。

我使用 clock_gettime 进行计时,clock_getres 为我提供了 1ns 的分辨率。这个值可信吗?

到目前为止,我做的事情是否正确?在编写基准测试时是否还应该注意其他问题?

I am writing a c++ benchmarking program, which involves timing a number of function calls. The functions are called repeatedly and each time is recorded for statistical analysis later. It is required that the functions be run simultaneously on multiple threads and thus to ensure accuracy and fairness of the benchmark, it is run on a real-time OS, with the scheduling behavior being controlled. The following are my concerns:

Are there deterministic ways of collecting the timing data? I have looked at printf and stringstream but neither seems to have deterministic behavior due to memory & buffer operations. They also do not perform in O(1) for the same reason, am I right? Currently I am using a large char array and a custom strcat function so that each time value can be collected in O(1). This array is then printed at the end of the test, when all data has been collected.

I am using clock_gettime for timings and clock_getres gives me a resolution of 1ns. Can this value be trusted?

Am I doing things right so far, and are there any other issues that I should be aware of when writing the benchmark?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

乜一 2024-12-01 04:31:50

调用高频定时器并将样本写入输出流是获取性能数据的完全明智的方法。但有一些棘手的问题需要小心。

  • 事实上,您不应该使用 printf 和 stringstream - 不仅因为它们的执行时间是可变的且定义不明确,而且还因为它们非常慢,特别是如果您每微秒将性能数据格式化为字符串!最好将二进制数据写入预先分配的缓冲区(例如结构数组),然后在测试完成后对其进行格式化。这会更快并且给你带来更一致的写入开销。
  • 如果编写内核的人不是傻瓜,那么具有高分辨率计时器(例如CLOCK_PROCESS_CPUTIME_ID)的clock_gettime应该是可靠的。如果您想查询CPU,您可以查看性能应用程序编程接口库直接计时器,但这不是必需的。
  • 多线程本质上可能是混乱的(在确定性意义上),因为线程会相互争夺 CPU 缓存和内存带宽。您可以获得随机变化的结果,具体取决于同时调度的线程是否碰巧接触相同的内存,或者始终从数据缓存中逐出彼此的工作 - 并且这会因运行而异,具体取决于数据的放置方式内存中的资源以及哪些线程正在运行。但这没关系:工程中的许多过程都是随机的。只需多次运行基准测试,即可获得具有统计意义的性能数据平均值和偏差。

或者,如果您确实需要 100% 的确定性,则需要确保您的线程以相同的顺序进行调度,运行相同的量子,并将每次运行的数据放入相同的内存地址中。

Calling high-frequency timers and writing samples into an output stream is a perfectly sensible way to get performance data. But there are a few tricky gotchas to be careful of.

  • Indeed you shouldn't use printf and stringstream -- not only because their execution time is variable and poorly defined, but also because they're just darn slow, especially if you're formatting your perf data into strings every microsecond! It's much better to write binary data into a preallocated buffer, like an array of structures, and then format them later after your test is done. That will be faster and give you a more consistent write overhead.
  • clock_gettime with the high-resolution timer (eg CLOCK_PROCESS_CPUTIME_ID) should be reliable if the person who wrote your kernel wasn't a dunce. You can look into the Performance Application Programming Interface library if you want to query the CPU timers directly, but that shouldn't be necessary.
  • Multithreading can be inherently chaotic (in the determinism sense) because the threads are fighting each other for CPU cache and memory bandwidth. You can get stochastically varying results depending on whether simultaneously scheduled threads happen to be touching the same memory, or are evicting each other's work from data caches all the time -- and this will vary from run to run depending on exactly how the data is laid out in memory and which threads are running. But that's fine: lots of processes in engineering are stochastic. Just run your benchmark many times and get a statistically significant average and deviation for your perf numbers.

Or, if you truly need to have 100% determinism, you'll need to ensure that your threads schedule in the same order, run for the same quanta, and put their data in the same memory addresses for each run.

清风疏影 2024-12-01 04:31:50

出于实际性能考虑,不要使用大 O 表示法。

也就是说,对于问题的其余部分:

性能收集将需要一些时间(O(1) 仍然是有意义的时间,只是它不依赖于您的数据)。您需要使其最有效。

这意味着:

  1. 不要使用 printf 等,而是写入特殊的内存区域,稍后您将从中提取数据。

  2. 出于同样的原因,不要使用 strcat,而是使用二进制数据的 struct。完成后最后解析它。

  3. 不要测量每个呼叫,而是考虑测量平均值(即:测量不是每个呼叫,而是每个 1000 个呼叫,并取平均值以提取单个呼叫的大致成本)。这将使您的测量开销倍数减少。虽然这种可能性并不总是存在,但请考虑一下。

  4. clock_gettime 通常是可信的,但这取决于您的操作系统和硬件 - 检查一下,有时硬件时钟分辨率可能不会像您希望的那么小。

Do not use the big-O notation for the real life performance considerations.

That said, to the rest of the question:

The performance gathering will take some time (O(1) can still be meaningful time, it's just that it won't depend on your data). You need to make it the most efficient.

That means:

  1. Not to use printf and likes, but rather write to a special memory area, from which you'll extract the data later.

  2. For the same reason don't use strcat, instead use structs of binary data. Parse it in the end when you're done.

  3. Instead of measuring each call, consider measuring averages (i.e.: measure not each call, but each 1000 and average to extract the approximate cost of a single call). That will make your measurement overhead times lesser. That is not always a possibility though, but consider it.

  4. The clock_gettime can usually be trusted, but it depends on your OS and hardware - check them out, sometimes the hardware clock resolution might not be as small as you'd wish.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文