XGETBV和CPUID检查是否足以保证AVX2支持？

发布于 2025-02-05 03:04:44 字数 1510 浏览 4 评论 0原文

在这个问题，已确认__ ___ nediin_cpu_supports（“ avx2”）没有检查检查操作系统支持。（或者至少在gcc 修复了错误）。来自 intel docs ，我知道，除了检查CPUID位，我们还需要检查与X86-64指令有关的内容XGETBV。上面链接的英特尔文档为检查提供此代码：

int check_xcr0_ymm()
{
    uint32_t xcr0;
#if defined(_MSC_VER)
    xcr0 = (uint32_t)_xgetbv(0);  /* min VS2010 SP1 compiler is required */
#else
    __asm__ ("xgetbv" : "=a" (xcr0) : "c" (0) : "%edx" );
#endif
    return ((xcr0 & 6) == 6); /* checking if xmm and ymm state are enabled in XCR0 */
}

问题：此检查加上CPUID检查足以保证AVX2说明不会使我的程序崩溃？

奖励问题：这张检查实际在做什么？为什么存在？（对此有一些讨论在这里和在这里，但我认为这个话题值得一个专门的答案）。

注意：

这个问题在类似的主题上，但是答案不涵盖xgetBv。
这个问题是相似的，但专门询问Windows。我对跨平台解决方案感兴趣。

原文

In this question, it is confirmed that __builtin_cpu_supports("avx2") doesn't check for OS support. (Or at least, it didn't before GCC fixed the bug). From Intel docs, I know that in addition to checking the CPUID bits we need to check something related to the x86-64 instruction xgetbv. The Intel docs linked above provide this code for the check:

int check_xcr0_ymm()
{
    uint32_t xcr0;
#if defined(_MSC_VER)
    xcr0 = (uint32_t)_xgetbv(0);  /* min VS2010 SP1 compiler is required */
#else
    __asm__ ("xgetbv" : "=a" (xcr0) : "c" (0) : "%edx" );
#endif
    return ((xcr0 & 6) == 6); /* checking if xmm and ymm state are enabled in XCR0 */
}

Question: Is this check plus the CPUID check sufficient to guarantee AVX2 instructions won't crash my program?

Bonus Question: What is this check actually doing? Why does it exist? (There is some discussion of this here and here, but I think the topic deserves a dedicated answer).

Notes:

this question is on a similar topic, but the answers don't cover xgetbv.
this question is similar, but asks about Windows specifically. I'm interested in a cross-platform solution.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

颜漓半夏 2025-02-12 03:04:45

是的，CPUID +检查这些XCR0位就足够了，假设OS没有损坏（并且遵循预期的约定）。

而且，假设虚拟机或模拟器的CPUID指令不会躺在，并告诉您AVX2可用，但实际上是错误的。但是，如果这两个事情发生了，那是OS或VM的错，而不是您的程序的错。

（对于具有相当旧的CPU的兼容，您需要使用CPUID在使用之前是否支持XgetBV，否则会错误。良好的AVX检测功能将执行此操作。
另请参阅 CPU多媒体扩展？（如何检查SSE或AVX是否完全可用？） - 我的答案主要集中在后者上，并且不是特定于Windows的。）

如果您 Just 检查了CPUID，则发现CPU支持AVX2，即使该CPU正在运行不了解AVX的旧操作系统，并且仅在上下文开关上保存/还原XMM寄存器，而不是YMM。

英特尔设计的事物使故障模式将是在这种情况下是非法的指导故障（#UD < / code>），而不是如果使用YMM或ZMM寄存器的多个线程 /进程，而不是静静地损坏用户空间状态。（因为那将是可怕的。）

（每个任务都应该具有自己的私人注册状态，包括整数和FP/simd寄存器。上下文切换而无需保存/还原YMM寄存器的高半有效地是异步的损坏寄存器，如果您查看单个线程的程序订单执行。）

机制是，OS必须在XCR0中设置一些位（扩展控制登录0），用户空间可以通过 xgetBv 。如果设置了这些位，则实际上可以承诺操作系统是AVX感知的，并且可以节省/还原YMM regs。并且它将设置其他控制登录位，以便SSE和AVX指令实际上在没有故障的情况下工作。

我不确定这些位是否真的影响了CPU的行为，或者仅存在作为内核向用户空间宣传AVX支持的通信机制。

（YMM寄存器是新的AVX1，而XMM是SSE1的新产品。操作系统不需要了解SSE4.X或AVX2，只是如何保存新的建筑状态。因此，AVX-512是下一个需要的SIMD扩展程序新的OS支持。）

更新：我认为这些XCR0位实际上确实控制了AVX1/2和AVX-512指令是否会#UD。 MacOS的Darwin内核显然只能按需进行按需AVX-512支持，因此第一个用法 Will 故障（但是，但是内核可以处理并重新运行，我希望我能透明地透明地转换为用户空间）。参见

// darwin-xnu .../i386/fpu.c#L176
 * On-demand AVX512 support
 * ------------------------
 * On machines with AVX512 support, by default, threads are created with
 * AVX512 masked off in XCR0 and an AVX-sized savearea is used. However, AVX512
 * capabilities are advertised in the commpage and via sysctl. If a thread
 * opts to use AVX512 instructions, the first will result in a #UD exception.
 * Faulting AVX512 intructions are recognizable by their unique prefix.
 * This exception results in the thread being promoted to use an AVX512-sized
 * savearea and for the AVX512 bit masks being set in its XCR0. The faulting
 * instruction is re-driven and the thread can proceed to perform AVX512
 * operations.
 *
 * In addition to AVX512 instructions causing promotion, the thread_set_state()
 * primitive with an AVX512 state flavor result in promotion.
 *
 * AVX512 promotion of the first thread in a task causes the default xstate
 * of the task to be promoted so that any subsequently created or subsequently
 * DNA-faulted thread will have AVX512 xstate and it will not need to fault-in
 * a promoted xstate.
 *
 * Two savearea zones are used: the default pool of AVX-sized (832 byte) areas
 * and a second pool of larger AVX512-sized (2688 byte) areas.
 *
 * Note the initial state value is an AVX512 object but that the AVX initial
 * value is a subset of it.
 */

https://github.com/apple/darwin-xnu/blob/0a798f6738bc1db01281fc08ae024145e845e84df927/ ，似乎XGETBV +检查XCR0可能不是是检测AVX-512指令可用性的保证方法！该评论说：“ 功能是在通讯中和通过sysctl 中宣传的，因此您需要某种特定于OS的方式。

但这是AVX-512；可能总是启用AVX1/2，因此检查XCR0是否在任何地方都可以使用，包括MacOS。

懒惰的上下文开关曾经是一些

用于使用“懒惰”上下文开关的OS的东西，实际上并没有保存/还原X87，XMM，甚至YMM寄存器，直到新过程实际使用它们为止。这是通过使用单独的控制注册位来完成的，该位将这些类型的指令故障执行。在该故障处理程序中，OS将从该核心上的上一个任务中保存状态，并从新任务中加载状态。然后更改控制位，然后返回用户空间重新运行指令。

但是如今，大多数过程在整个地方，memcpy和其他LIBC函数以及复制/初始化结构都使用XMM（和YMM）登记。因此，懒惰的策略是不值得的，只是额外的复杂性，尤其是在SMP系统上。这就是为什么现代内核不再这样做的原因。

内核用来使X87，XMM或YMM指令使用的控制登录位与我们正在检查的XCR0位分开，因此即使在使用懒惰上下文切换的系统上，您的检测也不会被欺骗OS暂时使CPU设置SO VADDPS XMM0，XMM1，XMM2会错误。

当SSE1是新的时，没有使用用户空间可见的位来检测SSE了解OSOS，而无需使用特定的API，但英特尔从该错误中学到了AVX。（使用SSE，故障模式仍在故障，而不是损坏。CPU以SSE指令设置为故障：我如何为我的自由启动代码启用SSE？）

Yes, CPUID + checking those XCR0 bits is sufficient, assuming an OS that isn't broken (and follows the expected conventions).

And assuming a virtual machine or emulator's CPUID instruction doesn't lie and tell you AVX2 is available but then actually fault. But if either of those things happen, it's the OS or VM's fault, not your program's.

(For compat with quite old CPUs, you need to use CPUID to check whether XGETBV is even supported before using it, otherwise that will fault. A good AVX detection function will do this.
See also Which versions of Windows support/require which CPU multimedia extensions? (How to check if SSE or AVX are fully usable?) - my answer there focuses mostly on the latter and isn't Windows specific.)

If you just checked CPUID, you'd find that the CPU supported AVX2 even if that CPU was running an old OS that didn't know about AVX, and only saved/restored XMM registers on context-switch, not YMM.

Intel designed things so the failure mode would be an illegal-instruction fault (#UD) in that case, rather than silently corrupting user-space state if multiple threads / processes used YMM or ZMM registers. (Because that would be horrible.)

(Every task is supposed to have its own private register state, including integer and FP/SIMD registers. Context switching without saving/restore the high halves of the YMM registers would effectively be asynchronously corrupting registers, if you look at program-order execution for a single thread.)

The mechanism is that the OS has to set some bits in XCR0 (extended control-register 0), which user-space can check via xgetbv. If those bits are set, it's effectively a promise that the OS is AVX-aware and will save/restore YMM regs. And that it will set other control-register bits so SSE and AVX instructions actually work without faulting.

I'm not sure if these bits actually affect the CPU behaviour at all, or if they only exist as a communication mechanism for the kernel to advertise AVX support to user-space.

(YMM registers were new with AVX1, and XMM were new with SSE1. The OS doesn't need to know about SSE4.x or AVX2, just how to save the new architectural state. So AVX-512 is the next SIMD extension that needed new OS support.)

Update: I think those XCR0 bits actually do control whether AVX1/2 and AVX-512 instructions will #UD. MacOS's Darwin kernel apparently only does on-demand AVX-512 support, so the first usage will fault (but then the kernel handles it and re-runs, transparently to user-space I hope). See the source:

// darwin-xnu .../i386/fpu.c#L176
 * On-demand AVX512 support
 * ------------------------
 * On machines with AVX512 support, by default, threads are created with
 * AVX512 masked off in XCR0 and an AVX-sized savearea is used. However, AVX512
 * capabilities are advertised in the commpage and via sysctl. If a thread
 * opts to use AVX512 instructions, the first will result in a #UD exception.
 * Faulting AVX512 intructions are recognizable by their unique prefix.
 * This exception results in the thread being promoted to use an AVX512-sized
 * savearea and for the AVX512 bit masks being set in its XCR0. The faulting
 * instruction is re-driven and the thread can proceed to perform AVX512
 * operations.
 *
 * In addition to AVX512 instructions causing promotion, the thread_set_state()
 * primitive with an AVX512 state flavor result in promotion.
 *
 * AVX512 promotion of the first thread in a task causes the default xstate
 * of the task to be promoted so that any subsequently created or subsequently
 * DNA-faulted thread will have AVX512 xstate and it will not need to fault-in
 * a promoted xstate.
 *
 * Two savearea zones are used: the default pool of AVX-sized (832 byte) areas
 * and a second pool of larger AVX512-sized (2688 byte) areas.
 *
 * Note the initial state value is an AVX512 object but that the AVX initial
 * value is a subset of it.
 */

So on MacOS, it seems XGETBV + checking XCR0 might not be a guaranteed way to detect usability of AVX-512 instruction! The comment says "capabilities are advertised in the commpage and via sysctl", so you need some OS-specific way.

But that's AVX-512; probably AVX1/2 is always enabled so checking XCR0 for that will work everywhere, including MacOS.

Lazy context switches used to be a thing

Some OSes used to use "lazy" context switches, not actually saving/restoring the x87, XMM, and maybe even YMM registers until the new process actually used them. This was done by using a separate control-register bit that made those types of instructions fault if executed; in that fault handler, the OS would save state from the previous task on this core, and load state from the new task. Then change the control bit and return to user-space to rerun the instruction.

But these days most processes use XMM (and YMM) registers all over the place, in memcpy and other libc functions, and for copying/initializing structs. So a lazy strategy isn't worth it, and is just a lot of extra complexity, especially on an SMP system. That's why modern kernels don't do that anymore.

The control-register bits that a kernel would use to make x87, xmm, or ymm instructions fault is separate from the XCR0 bit we're checking, so even on a system using lazy context switching, your detection won't be fooled by the OS temporarily having the CPU set up so vaddps xmm0, xmm1, xmm2 would fault.

When SSE1 was new, there was no user-space-visible bit for detecting SSE-aware OSes without using an OS-specific API, but Intel learned from that mistake for AVX. (With SSE, the failure mode is still faulting, not corruption, though. The CPU boots up with SSE instructions set to fault: How do I enable SSE for my freestanding bootable code?)

回复收藏 0 原文

~没有更多了~