当前位置：文江博客话题详情

对受 CPU 限制的算法/实现进行基准测试

发布于 2024-12-23 04:31:50 字数 311 浏览 2 评论 0原文

假设我正在用编译语言（例如 C++）编写自己的 StringBuilder。

衡量各种实施的性能的最佳方法是什么？简单地对几十万次运行进行计时会产生高度不一致的结果：一批与另一批的计时可能相差高达 15%，因此无法准确评估潜在的性能改进，从而产生小于此的性能增益。

我已完成以下操作：

禁用 SpeedStep
使用 RDTSC 进行计时
以实时优先级运行进程
设置与单个 CPU 核心的关联性

这在一定程度上稳定了结果。还有其他想法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

锦爱 2024-12-30 04:31:50

我通过以下方式获得了 100% 一致的结果：

使用 MS-DOS 设置 Bochs。
设置您的工具链以瞄准 MS-DOS
- 或者 -
1. 设置工具链以面向 32 位 Windows
2. 在 Bochs 中安装HX-DOS 扩展程序。
3. 如有必要，破解工具包的标准库/运行时并存根/删除需要 Windows API 的功能（未在 HX-DOS 中实现）。当您尝试运行该程序时，扩展程序将打印未实现的 API 列表。
将基准测试中的周期数减少几个数量级。
使用汇编器 cli / sti 指令包装基准代码（请注意，此更改后二进制文件将无法在现代操作系统上运行）。
如果您还没有这样做，请让您的基准测试使用 rdtsc 增量进行计时。示例应位于 cli…sti 指令内。
在 Boch 中运行它！

Bochs snapshot

结果似乎是完全确定性的，但并不是对整体性能的准确评估（请参阅 Osman Turan 下的讨论）详细回答）。

作为额外提示，这里有一个与 Bochs 共享文件的简单方法（这样您就不必每次都卸载/重建/重新安装软盘映像）：

在 Windows 上，Bochs 将锁定软盘映像文件，但该文件仍然是以共享写入模式打开。这意味着您无法覆盖该文件，但可以对其进行写入。（我认为 *nix 操作系统可能会导致覆盖创建新文件，就文件描述符而言。）技巧是使用 dd。我设置了以下批处理脚本：

... benchmark build commands here ...
copy /Y C:\Path\To\Benchmark\Project\test2dos.exe floppy\test2.exe
bfi -t=288 -f=floppysrc.img floppy
dd if=floppysrc.img of=floppy.img

bfi 是 Bart 的构建软盘映像。

然后，只需在 Bochs 中挂载 floppy.img 即可。

额外提示 #2：为了避免每次在 Bochs 中手动启动基准测试，请将一个空的 go.txt 文件放入软盘目录中，然后在 Bochs 中运行此批处理：

@echo off
A:
:loop
choice /T:y,1 > nul
if not exist go.txt goto loop
del go.txt
echo ---------------------------------------------------
test2
goto loop

它将启动测试程序每次检测到新的软盘映像时。这样，您可以在单个脚本中自动运行基准测试。

更新：这个方法不太可靠。有时，仅通过重新排序一些测试，计时就会改变多达 200%（使用原始问题中描述的方法在真实硬件上运行时，不会观察到这些计时变化）。

I have achieved 100% consistent results in this manner:

Set up Bochs with MS-DOS.
Set up your toolchain to target MS-DOS
— or —
1. Set up your toolchain to target 32-bit Windows
2. Install the HX-DOS extender in Bochs.
3. If necessary, hack your toolkit's standard library / runtime and stub out/remove features requiring Windows APIs not implemented in HX-DOS. The extender will print a list of unimplemented APIs when you attempt to run the program.
Reduce the number of cycles in your benchmark by a few orders of magnitude.
Wrap the benchmark code with assembler cli / sti instructions (note that the binary won't run on modern OSes after this change).
If you haven't already, make your benchmark use rdtsc deltas for timing. The samples should be within the cli…sti instructions.
Run it in the Bochs!

Bochs screenshot

The result seems to be completely deterministic, but is not an accurate assessment of overall performance (see the discussion under Osman Turan's answer for details).

As a bonus tip, here's an easy way to share files with Bochs (so you don't have to unmount/rebuild/remount the floppy image every time):

On Windows, Bochs will lock the floppy image file, but the file is still opened in shared-write mode. This means that you can't overwrite the file, but you can write to it. (I think *nix OSes might cause overwriting to create a new file, as far as file descriptors are concerned.) The trick is to use dd. I had the following batch script set up:

... benchmark build commands here ...
copy /Y C:\Path\To\Benchmark\Project\test2dos.exe floppy\test2.exe
bfi -t=288 -f=floppysrc.img floppy
dd if=floppysrc.img of=floppy.img

bfi is Bart's Build Floppy Image.

Then, just mount floppy.img in Bochs.

Bonus tip #2: To avoid having to manually start the benchmark every time in Bochs, put an empty go.txt file in the floppy directory, and run this batch in Bochs:

@echo off
A:
:loop
choice /T:y,1 > nul
if not exist go.txt goto loop
del go.txt
echo ---------------------------------------------------
test2
goto loop

It will start the test program every time it detects a fresh floppy image. This way, you can automate a benchmark run in a single script.

Update: this method is not very reliable. Sometimes the timings would change as much as by 200% just by reordering some tests (these timing changes were not observed when ran on real hardware, using the methods described in the original question).

回复收藏 0 原文

带刺的爱情 2024-12-30 04:31:50

精确测量一段代码确实很难。对于这样的要求，我建议您查看Agner Fog 的测试套件。通过使用它，您可以测量时钟周期并收集一些重要因素（例如缓存未命中、分支预测错误等）。

另外，我建议您查看 Agner 网站上的 PDF 文档。这是使这种微观优化成为可能的非常宝贵的文档。

附带说明一下，实际性能不是“时钟周期”的函数。缓存未命中可能会改变实际应用程序中每次运行的所有内容。所以，我会首先优化缓存未命中。只需对同一内存部分多次运行一段代码，即可显着减少缓存缺失。因此，很难精确测量。在我看来，整个应用程序调整通常是更好的主意。 Intel VTune 和其他工具非常适合此类用途。

回复收藏 0 原文