对受 CPU 限制的算法/实现进行基准测试

发布于 2024-12-23 04:31:50 字数 311 浏览 2 评论 0原文

假设我正在用编译语言(例如 C++)编写自己的 StringBuilder

衡量各种实施的性能的最佳方法是什么?简单地对几十万次运行进行计时会产生高度不一致的结果:一批与另一批的计时可能相差高达 15%,因此无法准确评估潜在的性能改进,从而产生小于此的性能增益。

我已完成以下操作:

  1. 禁用 SpeedStep
  2. 使用 RDTSC 进行计时
  3. 以实时优先级运行进程
  4. 设置与单个 CPU 核心的关联性

这在一定程度上稳定了结果。还有其他想法吗?

Let's say I'm writing my own StringBuilder in a compiled language (e.g. C++).

What is the best way to measure the performance of various implementations? Simply timing a few hundred thousand runs yields highly inconsistent results: the timings from one batch to the other can differ by as much as 15%, making it impossible to accurately assess potential performance improvements that yield performance gains smaller than that.

I've done the following:

  1. Disable SpeedStep
  2. Use RDTSC for timing
  3. Run the process with realtime priority
  4. Set the affinity to a single CPU core

This stabilizied the results somewhat. Any other ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

锦爱 2024-12-30 04:31:50

我通过以下方式获得了 100% 一致的结果:

  1. 使用 MS-DOS 设置 Bochs。
  2. 设置您的工具链以瞄准 MS-DOS
    - 或者 -

    1. 设置工具链以面向 32 位 Windows
    2. 在 Bochs 中安装HX-DOS 扩展程序
    3. 如有必要,破解工具包的标准库/运行时并存根/删除需要 Windows API 的功能(未在 HX-DOS 中实现)。当您尝试运行该程序时,扩展程序将打印未实现的 API 列表。
  3. 将基准测试中的周期数减少几个数量级。
  4. 使用汇编器 cli / sti 指令包装基准代码(请注意,此更改后二进制文件将无法在现代操作系统上运行)。
  5. 如果您还没有这样做,请让您的基准测试使用 rdtsc 增量进行计时。示例应位于 clisti 指令内。
  6. 在 Boch 中运行它!

Bochs snapshot

结果似乎是完全确定性的,但并不是对整体性能的准确评估(请参阅 Osman Turan 下的讨论)详细回答)。


作为额外提示,这里有一个与 Bochs 共享文件的简单方法(这样您就不必每次都卸载/重建/重新安装软盘映像):

在 Windows 上,Bochs 将锁定软盘映像文件,但该文件仍然是以共享写入模式打开。这意味着您无法覆盖该文件,但可以对其进行写入。 (我认为 *nix 操作系统可能会导致覆盖创建新文件,就文件描述符而言。)技巧是使用 dd。我设置了以下批处理脚本:

... benchmark build commands here ...
copy /Y C:\Path\To\Benchmark\Project\test2dos.exe floppy\test2.exe
bfi -t=288 -f=floppysrc.img floppy
dd if=floppysrc.img of=floppy.img

bfi 是 Bart 的 构建软盘映像

然后,只需在 Bochs 中挂载 floppy.img 即可。


额外提示 #2:为了避免每次在 Bochs 中手动启动基准测试,请将一个空的 go.txt 文件放入软盘目录中,然后在 Bochs 中运行此批处理:

@echo off
A:
:loop
choice /T:y,1 > nul
if not exist go.txt goto loop
del go.txt
echo ---------------------------------------------------
test2
goto loop

它将启动测试程序每次检测到新的软盘映像时。这样,您可以在单个脚本中自动运行基准测试。


更新:这个方法不太可靠。有时,仅通过重新排序一些测试,计时就会改变多达 200%(使用原始问题中描述的方法在真实硬件上运行时,不会观察到这些计时变化)。

I have achieved 100% consistent results in this manner:

  1. Set up Bochs with MS-DOS.
  2. Set up your toolchain to target MS-DOS
    — or —

    1. Set up your toolchain to target 32-bit Windows
    2. Install the HX-DOS extender in Bochs.
    3. If necessary, hack your toolkit's standard library / runtime and stub out/remove features requiring Windows APIs not implemented in HX-DOS. The extender will print a list of unimplemented APIs when you attempt to run the program.
  3. Reduce the number of cycles in your benchmark by a few orders of magnitude.
  4. Wrap the benchmark code with assembler cli / sti instructions (note that the binary won't run on modern OSes after this change).
  5. If you haven't already, make your benchmark use rdtsc deltas for timing. The samples should be within the clisti instructions.
  6. Run it in the Bochs!

Bochs screenshot

The result seems to be completely deterministic, but is not an accurate assessment of overall performance (see the discussion under Osman Turan's answer for details).


As a bonus tip, here's an easy way to share files with Bochs (so you don't have to unmount/rebuild/remount the floppy image every time):

On Windows, Bochs will lock the floppy image file, but the file is still opened in shared-write mode. This means that you can't overwrite the file, but you can write to it. (I think *nix OSes might cause overwriting to create a new file, as far as file descriptors are concerned.) The trick is to use dd. I had the following batch script set up:

... benchmark build commands here ...
copy /Y C:\Path\To\Benchmark\Project\test2dos.exe floppy\test2.exe
bfi -t=288 -f=floppysrc.img floppy
dd if=floppysrc.img of=floppy.img

bfi is Bart's Build Floppy Image.

Then, just mount floppy.img in Bochs.


Bonus tip #2: To avoid having to manually start the benchmark every time in Bochs, put an empty go.txt file in the floppy directory, and run this batch in Bochs:

@echo off
A:
:loop
choice /T:y,1 > nul
if not exist go.txt goto loop
del go.txt
echo ---------------------------------------------------
test2
goto loop

It will start the test program every time it detects a fresh floppy image. This way, you can automate a benchmark run in a single script.


Update: this method is not very reliable. Sometimes the timings would change as much as by 200% just by reordering some tests (these timing changes were not observed when ran on real hardware, using the methods described in the original question).

带刺的爱情 2024-12-30 04:31:50

精确测量一段代码确实很难。对于这样的要求,我建议您查看Agner Fog 的测试套件。通过使用它,您可以测量时钟周期并收集一些重要因素(例如缓存未命中、分支预测错误等)。

另外,我建议您查看 Agner 网站上的 PDF 文档。这是使这种微观优化成为可能的非常宝贵的文档。

附带说明一下,实际性能不是“时钟周期”的函数。缓存未命中可能会改变实际应用程序中每次运行的所有内容。所以,我会首先优化缓存未命中。只需对同一内存部分多次运行一段代码,即可显着减少缓存缺失。因此,很难精确测量。在我看来,整个应用程序调整通常是更好的主意。 Intel VTune 和其他工具非常适合此类用途。

It's really hard to precisely measure a piece of code. For such requirements, I recommend you to have look at Agner Fog's test suite. By using it, you can measure clock cycles and collect some important factors (such as cache misses, branch mispredictions etc.).

Also, I recommend you to have look at PDF document from Agner's site. It's a really invaluable document to make possible such micro-optimization.

As a side note, actual performance is not a function of "clock cycles". Cache misses can change everything for each run within a real application. So, I would optimize cache misses first. Simply running a piece of code several times for same memory portion, decreases cache miss dramatically. So, it makes it hard to measure precisely. Whole application tuning is usually better idea IMO. Intel VTune and other tools are really good for such usages.

挽你眉间 2024-12-30 04:31:50

过去我一直很关心这个问题,现在我意识到只有一个完美理想的解决方案,尽管这需要大量的工作(主要是准备工作),所以我实际上从未这样做过。

解决方案是使用 386 模拟器运行代码,该模拟器将准确告诉您执行了多少个操作。您应该能够找到开源的 386 模拟器。它将准确地符合说明,并且需要运行一次测试。如果你做到了,请发布你是如何做到的!

I have been concerned about this issue a lot in the past, and I have come to the realization that there is only one perfect ideal solution, which though requires a lot of work, (preparation mostly,) so I never actually did it this way.

The solution is to run your code using a 386 emulator which will tell you exactly how many operations were executed. You should be able to find an open-source 386 emulator out there. It will be accurate to the instruction, and it will require a single run of your test. If you do it, please post how you did it!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文