现在在 x86-64 上还值得使用 Quake 快速反平方根算法吗？

发布于 2025-01-16 16:23:06 字数 466 浏览 5 评论 0原文

具体来说，这就是我正在谈论的代码：

float InvSqrt(float x) {
  float xhalf = 0.5f*x;
  int i = *(int*)&x;        // warning: strict-aliasing UB, use memcpy instead
  i = 0x5f375a86- (i >> 1);
  x = *(float*)&i;          // same
  x = x*(1.5f-xhalf*x*x);
  return x;  
}

我忘记了从哪里得到的，但它显然比原始的 Quake III 算法（魔法常数略有不同）更好、更高效或更精确，但距此已经过去 20 多年了算法已创建，我只是想知道在性能方面是否仍然值得使用它，或者是否有一条指令可以在现代 x86-64 CPU 中实现它。

原文

Specifically, this is the code I'm talking about:

float InvSqrt(float x) {
  float xhalf = 0.5f*x;
  int i = *(int*)&x;        // warning: strict-aliasing UB, use memcpy instead
  i = 0x5f375a86- (i >> 1);
  x = *(float*)&i;          // same
  x = x*(1.5f-xhalf*x*x);
  return x;  
}

I forgot where I got this from but it's apparently better and more efficient or precise than the original Quake III algorithm (slightly different magic constant), but it's been more than 2 decades since this algorithm was created, and I just want to know if it's still worth using it in terms of performance, or if there's an instruction that implements it already in modern x86-64 CPUs.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无法回应 2025-01-23 16:23:06

起源：

参见 John Carmack 的不寻常的快速平方根倒数 (Quake III)

x86 上的现代实用性：无，已被 SSE1 废弃 `rsqrtss`

其他一些 ISA 具有硬件支持的近似值倒数和倒数平方根，其中大部分内容也适用于 x86 之外。您的高级源代码可能需要使用 ISA 特定的内在函数来让编译器使用这些指令，这可能取决于快速数学设置。但对于答案的其余部分，我将只讨论 x86。

使用 _mm_rsqrt_ps 或 ss 获得 4 个并行浮点数的非常近似的倒数平方根，比一个好的编译器用它做的要快得多（使用 SSE2 整数移位/加法）将 FP 位模式保存在 XMM 寄存器中的指令，这可能不是它实际使用类型双关语编译为整数的方式，这是 C 或 C++ 中的严格别名 UB；使用 memcpy 或 C++20 std::bit_cast。）

https://www.felixcloutier.com/x86/rsqrtss 记录了 asm 指令的标量版本，包括 |相对错误| ≤ 1.5 ∗ 2−12 保证。（即大约一半的尾数位是正确的。）一次 Newton-Raphson 迭代可以将其精确到正确性的 1ulp 以内，尽管仍然不是从实际 sqrt 中获得的 0.5ulp。请参阅使用 SSE/AVX 进行快速矢量化 rsqrt 和倒数取决于精度）

rsqrtps 的执行速度仅比 mulps / mulss 指令稍慢在大多数 CPU 上，例如 5 个周期延迟，1 个时钟吞吐量。（通过牛顿迭代来改进它，更多的 uops。）延迟因微架构而异，在 Zen 3 中低至 3 uops，但自 Conroe 以来，英特尔至少以大约 5c 的延迟运行（https://uops.info/）。

Quake InvSqrt 中的幻数的整数移位/减去类似地提供了更粗略的初始猜测，其余的（在将位模式类型双关回浮点之后是牛顿拉夫森迭代

在使用 -ffast-math 编译 sqrt 时，编译器甚至会为您使用 rsqrtss，具体取决于（例如，现代 clang 使用 -O3 -ffast-math -march=skylake 编译 1.0f/sqrtf(x) https://godbolt.org/z/fT86bKesb 使用vrsqrtss 和 3x vmulss 加上 FMA。）当您使用 sqrt 作为除数以外的东西时，通常不值得使用 rsqrt，但对于互惠用例 rsqrt + 精化加上乘法可以避免除法和开方

全精度平方根和除法本身并不像以前那么慢，至少如果您使用它们的话。很少与 mul/add/sub 进行比较。（例如，如果您可以隐藏延迟，那么每 12 个左右的其他操作可能花费大约相同的成本，仍然是单个 uop，而不是 rsqrt + Newton 迭代的多个 uop。）请参阅浮点除法与浮点乘法
但 sqrt 和 div 确实会相互竞争吞吐量，因此需要除以平方根是一个令人讨厌的情况。

因此，如果您对一个主要只执行 sqrt 的数组有一个错误的循环，而不与其他数学运算混合，那么这是 _mm_rsqrt_ps （和牛顿迭代）的一个用例，作为比 _mm_rsqrt_ps 更高的吞吐量近似值code>_mm_sqrt_ps

但是，如果您可以将该通道与其他东西结合起来以增加计算强度并在保留 div/sqrt 单位的同时完成更多工作，通常会更好单独使用真正的 sqrt 指令，因为前端发出、后端跟踪和执行仍然只有 1 uop。与牛顿迭代相比，如果 FMA 可用于倒数平方根，则需要大约 5 微秒，否则更多（如果需要非倒数平方根也是如此）。

例如，Skylake 每 3 个周期有 1 个 sqrtps xmm 吞吐量（128 位向量），如果每个周期不执行超过 1 个操作，则其成本与 mul/add/sub/fma 操作相同6 次数学运算。（对于 256 位 YMM 向量，6 个周期，吞吐量更差。）牛顿迭代会花费更多的 uops，因此如果端口 0/1 的 uops 是瓶颈，那么直接使用 sqrt 是一种胜利。（这是假设无序 exec 可以隐藏延迟，通常当每个循环迭代都是独立时。）如果您在循环中使用多项式近似作为 log 或 exp 等内容的一部分，这种情况很常见。

另请参阅快速矢量化 rsqrt 和 SSE 的倒数/ AVX 取决于精度：现代 OoO 执行 CPU 上的性能。

Origins:

See John Carmack's Unusual Fast Inverse Square Root (Quake III)

Modern usefulness on x86: none, obsoleted by SSE1 `rsqrtss`

Some other ISAs have hardware-supported approximate reciprocal and reciprocal square-root so much of this also applies outside of x86. Your high-level source code might need to use ISA-specific intrinsics to get compilers to use these instructions, maybe depending on fast-math settings. But for the rest of the answer I'll just discuss x86.

Use _mm_rsqrt_ps or ss to get a very approximate reciprocal-sqrt for 4 floats in parallel, much faster than even a good compiler could do with this (using SSE2 integer shift/add instructions to keep the FP bit pattern in an XMM register, which is probably not how it would actually compile with the type-pun to integer. Which is strict-aliasing UB in C or C++; use memcpy or C++20 std::bit_cast.)

https://www.felixcloutier.com/x86/rsqrtss documents the scalar version of the asm instruction, including the |Relative Error| ≤ 1.5 ∗ 2−12 guarantee. (i.e. about half the mantissa bits are correct.) One Newton-Raphson iteration can refine it to within 1ulp of being correct, although still not the 0.5ulp you'd get from actual sqrt. See Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision)

rsqrtps performs only slightly slower than a mulps / mulss instruction on most CPUs, like 5 cycle latency, 1/clock throughput. (With a Newton iteration to refine it, more uops.) Latency various by microarchitecture, as low as 3 uops in Zen 3, but Intel runs it with about 5c latency since Conroe at least (https://uops.info/).

The integer shift / subtract from the magic number in the Quake InvSqrt similarly provides an even rougher initial-guess, and the rest (after type-punning the bit-pattern back to a float is a Newton Raphson iteration.

Compilers will even use rsqrtss for you when compiling sqrt with -ffast-math, depending on context and tuning options. (e.g. modern clang compiling 1.0f/sqrtf(x) with -O3 -ffast-math -march=skylake https://godbolt.org/z/fT86bKesb uses vrsqrtss and 3x vmulss plus an FMA.) When you're using sqrt as something other than a divisor it's often not worth it using rsqrt, but for reciprocal use-cases rsqrt + refinement plus a multiply avoids a division as well as a sqrt.

Full-precision square root and division themselves are not as slow as they used to be, at least if you use them infrequently compared to mul/add/sub. (e.g. if you can hide the latency, one sqrt every 12 or so other operations might cost about the same, still a single uop instead of multiple for rsqrt + Newton iteration.) See Floating point division vs floating point multiplication
But sqrt and div do compete with each other for throughput so needing to divide by a square root is a nasty case.

So if you have a bad loop over an array that mostly just does sqrt, not mixed with other math operations, that's a use-case for _mm_rsqrt_ps (and a Newton iteration) as a higher throughput approximation than _mm_sqrt_ps

But if you can combine that pass with something else to increase computational intensity and get more work done overlapped with keeping the div/sqrt unit, often it's better to use a real sqrt instruction on its own, since that's still just 1 uop for the front-end to issue, and for the back-end to track and execute. vs. a Newton iteration taking something like 5 uops if FMA is available for reciprocal square root, else more (also if non-reciprocal sqrt is needed).

With Skylake for example having 1 per 3 cycle sqrtps xmm throughput (128-bit vectors), it costs the same as a mul/add/sub/fma operation if you don't do more than one per 6 math operations. (Throughput is worse for 256-bit YMM vectors, 6 cycles.) A Newton iteration would cost more uops, so if uops for port 0/1 are the bottleneck, it's a win to just use sqrt directly. (This is assuming that out-of-order exec can hide the latency, typically when each loop iteration is independent.) This kind of situation is common if you're using a polynomial approximation as part of something like log or exp in a loop.

See also Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision re: performance on modern OoO exec CPUs.

回复收藏 0 原文

~没有更多了~