如何在 x86 Windows 中刷新 CPU 缓存?

发布于 2024-08-12 09:12:54 字数 190 浏览 8 评论 0原文

我有兴趣在 Windows 中强制刷新 CPU 缓存(出于基准测试的原因,我想从 CPU 缓存中没有数据的情况下进行模拟),最好是基本的 C 实现或 Win32 调用。

是否有已知的方法可以通过系统调用甚至像大型 memcpy 那样偷偷摸摸地执行此操作?

Intel i686 平台(P4 及以上也可以)。

I am interested in forcing a CPU cache flush in Windows (for benchmarking reasons, I want to emulate starting with no data in CPU cache), preferably a basic C implementation or Win32 call.

Is there a known way to do this with a system call or even something as sneaky as doing say a large memcpy?

Intel i686 platform (P4 and up is okay as well).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

自此以后,行同陌路 2024-08-19 09:12:55

幸运的是,有不止一种方法可以显式刷新缓存。

指令“wbinvd”写回修改后的缓存内容并将缓存标记为空。它执行总线周期以使外部缓存刷新其数据。不幸的是,这是一条特权指令。但如果可以在 DOS 等环境下运行测试程序,则可以采用这种方法。这样做的优点是可以保持“操作系统”的缓存占用空间非常小。

此外,还有“invd”指令,它使缓存无效,而不将它们刷新回主内存。这违反了主内存和缓存的一致性,因此您必须自己处理。不太推荐。

出于基准测试的目的,最简单的解决方案可能是将大内存块复制到标有 WC(写组合)而不是 WB 的区域。显卡的内存映射区域是一个很好的候选区域,或者您可以通过 MTRR 寄存器自行将区域标记为 WC。

您可以在测量时钟周期和性能监控的测试程序中找到一些有关短例程基准测试的资源。

Fortunately, there is more than one way to explicitly flush the caches.

The instruction "wbinvd" writes back modified cache content and marks the caches empty. It executes a bus cycle to make external caches flush their data. Unfortunately, it is a privileged instruction. But if it is possible to run the test program under something like DOS, this is the way to go. This has the advantage of keeping the cache footprint of the "OS" very small.

Additionally, there is the "invd" instruction, which invalidates caches without flushing them back to main memory. This violates the coherency of main memory and cache, so you have to take care of that by yourself. Not really recommended.

For benchmarking purposes, the simplest solution is probably copying a large memory block to a region marked with WC (write combining) instead of WB. The memory mapped region of the graphics card is a good candidate, or you can mark a region as WC by yourself via the MTRR registers.

You can find some resources about benchmarking short routines at Test programs for measuring clock cycles and performance monitoring.

旧梦荧光笔 2024-08-19 09:12:55

有 x86 汇编指令可以强制 CPU 刷新某些缓存行(例如 CLFLUSH),但它们非常晦涩难懂。特别是,CLFLUSH 仅刷新各级缓存(L1、L2、L3)中选定的地址。

像做一个大的内存副本一样偷偷摸摸的事情?

是的,这是最简单的方法,并且将确保 CPU 刷新所有级别的缓存。只需从基准测试中排除缓存刷新时间,您就应该很好地了解程序在缓存压力下的性能。

There are x86 assembly instructions to force the CPU to flush certain cache lines (such as CLFLUSH), but they are pretty obscure. CLFLUSH in particular only flushes a chosen address from all levels of cache (L1, L2, L3).

something as sneaky as doing say a large memcopy?

Yes, this is the simplest approach, and will make sure that the CPU flushes all levels of cache. Just exclude the cache flushing time from your benchmakrs and you should get a good idea how your program performs under cache pressure.

玩套路吗 2024-08-19 09:12:55

不幸的是,没有办法显式刷新缓存。您的一些选择是:

1.) 通过在您正在基准测试的代码迭代之间执行一些非常大的内存操作来破坏缓存。

2.) 在x86 控制寄存器中启用缓存禁用并进行基准测试。这可能也会禁用指令缓存,这可能不是您想要的。

3.) 使用非临时指令实现代码部分的基准测试(如果可能) 。虽然,这些只是处理器使用缓存的提示,但它仍然可以自由地做它想做的事情。

1 可能是最简单且足以满足您的目的的。

编辑:哎呀,我更正了,有一条使 x86 缓存无效的指令,请参阅 drhirsch 的回答

There is unfortunately no way to explicitly flush the cache. A few of your options are:

1.) Thrash the cache by doing some very large memory operations between iterations of the code you're benchmarking.

2.) Enable Cache Disable in the x86 Control Registers and benchmark that. This will probably disable the instruction cache also, which may not be what you want.

3.) Implement the portion of your code your benchmarking (if it's possible) using Non-Temporal instructions. Though, these are just hints to the processor about using the cache, it's still free to do what it wants.

1 is probably the easiest and sufficient for your purposes.

Edit: Oops, I stand corrected there is an instruction to invalidate the x86 cache, see drhirsch's answer

鹤仙姿 2024-08-19 09:12:55

x86 指令 WBINVD 写回并使所有缓存无效。它描述为

将处理器内部高速缓存中所有修改的高速缓存行写回到主内存,并使内部高速缓存无效(刷新)。然后,该指令发出一个特殊功能总线周期,指示外部缓存也写回修改的数据,并发出另一个总线周期以指示外部缓存应失效。

重要的是,该指令只能在ring0,即操作系统中执行。所以你的用户态程序不能简单地使用它。在 Linux 上,您可以编写一个可以按需执行该指令的内核模块。实际上,已经有人编写了这样的内核模块: https://github.com/batmac/wbinvd

幸运的是,内核模块的代码非常小,因此您可以在将互联网上陌生人的代码加载到内核之前实际检查它。您可以通过读取 /proc/wbinvd(例如通过 cat /proc/wbinvd)来使用该模块(并触发执行 WBINVD 指令)。

然而,我发现这条指令(或者至少是这个内核模块)非常慢。在我的 i7-6700HQ 上,我测得需要 750μs!这个数字对我来说似乎非常高,所以我可能在测量时犯了一个错误——请记住这一点!该指令的解释只是说:

WBINVD 完成的时间或周期会因不同缓存层次结构的大小和其他因素而有所不同。

The x86 instruction WBINVD writes back and invalidates all caches. It is described as:

Writes back all modified cache lines in the processor’s internal cache to main memory and invalidates (flushes) the internal caches. The instruction then issues a special-function bus cycle that directs external caches to also write back modified data and another bus cycle to indicate that the external caches should be invalidated.

Importantly, the instruction can only be executed in ring0, i.e. the operating system. So your userland programs can't simply use it. On Linux, you can write a kernel module that can execute that instruction on demand. Actually, someone already wrote such a kernel module: https://github.com/batmac/wbinvd

Luckily, the kernel module's code is really tiny, so you can actually check it before loading code from strangers on the internet into your kernel. You can use that module (and trigger executing the WBINVD instruction) by reading /proc/wbinvd, for example via cat /proc/wbinvd.

However, I found that this instruction (or at least this kernel module) is really slow. On my i7-6700HQ I measured it to take 750µs! This number seems really high to me, so I might have made a mistake measuring this -- please keep that in mind! Explanation of that instruction just say:

The amount of time or cycles for WBINVD to complete will vary due to size and other factors of different cache hierarchies.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文