现在在 x86-64 上还值得使用 Quake 快速反平方根算法吗?

发布于 2025-01-16 16:23:06 字数 466 浏览 5 评论 0原文

具体来说,这就是我正在谈论的代码:

float InvSqrt(float x) {
  float xhalf = 0.5f*x;
  int i = *(int*)&x;        // warning: strict-aliasing UB, use memcpy instead
  i = 0x5f375a86- (i >> 1);
  x = *(float*)&i;          // same
  x = x*(1.5f-xhalf*x*x);
  return x;  
}

我忘记了从哪里得到的,但它显然比原始的 Quake III 算法(魔法常数略有不同)更好、更高效或更精确,但距此已经过去 20 多年了算法已创建,我只是想知道在性能方面是否仍然值得使用它,或者是否有一条指令可以在现代 x86-64 CPU 中实现它。

Specifically, this is the code I'm talking about:

float InvSqrt(float x) {
  float xhalf = 0.5f*x;
  int i = *(int*)&x;        // warning: strict-aliasing UB, use memcpy instead
  i = 0x5f375a86- (i >> 1);
  x = *(float*)&i;          // same
  x = x*(1.5f-xhalf*x*x);
  return x;  
}

I forgot where I got this from but it's apparently better and more efficient or precise than the original Quake III algorithm (slightly different magic constant), but it's been more than 2 decades since this algorithm was created, and I just want to know if it's still worth using it in terms of performance, or if there's an instruction that implements it already in modern x86-64 CPUs.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

无法回应 2025-01-23 16:23:06

起源:

参见 John Carmack 的不寻常的快速平方根倒数 (Quake III)


x86 上的现代实用性:无,已被 SSE1 废弃 rsqrtss

其他一些 ISA 具有硬件支持的近似值倒数和倒数平方根,其中大部分内容也适用于 x86 之外。您的高级源代码可能需要使用 ISA 特定的内在函数来让编译器使用这些指令,这可能取决于快速数学设置。但对于答案的其余部分,我将只讨论 x86。

使用 _mm_rsqrt_psss 获得 4 个并行浮点数的非常近似的倒数平方根,比一个好的编译器用它做的要快得多(使用 SSE2 整数移位/加法)将 FP 位模式保存在 XMM 寄存器中的指令,这可能不是它实际使用类型双关语编译为整数的方式,这是 C 或 C++ 中的严格别名 UB;使用 memcpy 或 C++20 std::bit_cast。)

https://www.felixcloutier.com/x86/rsqrtss 记录了 asm 指令的标量版本,包括 |相对错误| ≤ 1.5 ∗ 2−12 保证。 (即大约一半的尾数位是正确的。)一次 Newton-Raphson 迭代可以将其精确到正确性的 1ulp 以内,尽管仍然不是从实际 sqrt 中获得的 0.5ulp。请参阅 使用 SSE/AVX 进行快速矢量化 rsqrt 和倒数取决于精度

rsqrtps 的执行速度仅比 mulps / mulss 指令稍慢在大多数 CPU 上,例如 5 个周期延迟,1 个时钟吞吐量。 (通过牛顿迭代来改进它,更多的 uops。)延迟因微架构而异,在 Zen 3 中低至 3 uops,但自 Conroe 以来,英特尔至少以大约 5c 的延迟运行(https://uops.info/)。

Quake InvSqrt 中的幻数的整数移位/减去类似地提供了更粗略的初始猜测,其余的(在将位模式类型双关回浮点之后是牛顿拉夫森迭代


在使用 -ffast-math 编译 sqrt 时,编译器甚至会为您使用 rsqrtss,具体取决于(例如,现代 clang 使用 -O3 -ffast-math -march=skylake 编译 1.0f/sqrtf(x) https://godbolt.org/z/fT86bKesb 使用vrsqrtss 和 3x vmulss 加上 FMA。)当您使用 sqrt 作为除数以外的东西时,通常不值得使用 rsqrt,但对于互惠用例 rsqrt + 精化加上乘法可以避免除法和开方


全精度平方根和除法本身并不像以前那么慢,至少如果您使用它们的话。很少与 mul/add/sub 进行比较。 (例如,如果您可以隐藏延迟,那么每 12 个左右的其他操作可能花费大约相同的成本,仍然是单个 uop,而不是 rsqrt + Newton 迭代的多个 uop。)请参阅 浮点除法与浮点乘法
但 sqrt 和 div 确实会相互竞争吞吐量,因此需要除以平方根是一个令人讨厌的情况。

因此,如果您对一个主要只执行 sqrt 的数组有一个错误的循环,而不与其他数学运算混合,那么这是 _mm_rsqrt_ps (和牛顿迭代)的一个用例,作为比 _mm_rsqrt_ps 更高的吞吐量近似值code>_mm_sqrt_ps

但是,如果您可以将该通道与其他东西结合起来以增加计算强度并在保留 div/sqrt 单位的同时完成更多工作,通常会更好单独使用真正的 sqrt 指令,因为前端发出、后端跟踪和执行仍然只有 1 uop。与牛顿迭代相比,如果 FMA 可用于倒数平方根,则需要大约 5 微秒,否则更多(如果需要非倒数平方根也是如此)。

例如,Skylake 每 3 个周期有 1 个 sqrtps xmm 吞吐量(128 位向量),如果每个周期不执行超过 1 个操作,则其成本与 mul/add/sub/fma 操作相同6 次数学运算。 (对于 256 位 YMM 向量,6 个周期,吞吐量更差。)牛顿迭代会花费更多的 uops,因此如果端口 0/1 的 uops 是瓶颈,那么直接使用 sqrt 是一种胜利。 (这是假设无序 exec 可以隐藏延迟,通常当每个循环迭代都是独立时。)如果您在循环中使用多项式近似作为 log 或 exp 等内容的一部分,这种情况很常见。

另请参阅 快速矢量化 rsqrt 和 SSE 的倒数/ AVX 取决于精度:现代 OoO 执行 CPU 上的性能。

Origins:

See John Carmack's Unusual Fast Inverse Square Root (Quake III)


Modern usefulness on x86: none, obsoleted by SSE1 rsqrtss

Some other ISAs have hardware-supported approximate reciprocal and reciprocal square-root so much of this also applies outside of x86. Your high-level source code might need to use ISA-specific intrinsics to get compilers to use these instructions, maybe depending on fast-math settings. But for the rest of the answer I'll just discuss x86.

Use _mm_rsqrt_ps or ss to get a very approximate reciprocal-sqrt for 4 floats in parallel, much faster than even a good compiler could do with this (using SSE2 integer shift/add instructions to keep the FP bit pattern in an XMM register, which is probably not how it would actually compile with the type-pun to integer. Which is strict-aliasing UB in C or C++; use memcpy or C++20 std::bit_cast.)

https://www.felixcloutier.com/x86/rsqrtss documents the scalar version of the asm instruction, including the |Relative Error| ≤ 1.5 ∗ 2−12 guarantee. (i.e. about half the mantissa bits are correct.) One Newton-Raphson iteration can refine it to within 1ulp of being correct, although still not the 0.5ulp you'd get from actual sqrt. See Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision)

rsqrtps performs only slightly slower than a mulps / mulss instruction on most CPUs, like 5 cycle latency, 1/clock throughput. (With a Newton iteration to refine it, more uops.) Latency various by microarchitecture, as low as 3 uops in Zen 3, but Intel runs it with about 5c latency since Conroe at least (https://uops.info/).

The integer shift / subtract from the magic number in the Quake InvSqrt similarly provides an even rougher initial-guess, and the rest (after type-punning the bit-pattern back to a float is a Newton Raphson iteration.


Compilers will even use rsqrtss for you when compiling sqrt with -ffast-math, depending on context and tuning options. (e.g. modern clang compiling 1.0f/sqrtf(x) with -O3 -ffast-math -march=skylake https://godbolt.org/z/fT86bKesb uses vrsqrtss and 3x vmulss plus an FMA.) When you're using sqrt as something other than a divisor it's often not worth it using rsqrt, but for reciprocal use-cases rsqrt + refinement plus a multiply avoids a division as well as a sqrt.


Full-precision square root and division themselves are not as slow as they used to be, at least if you use them infrequently compared to mul/add/sub. (e.g. if you can hide the latency, one sqrt every 12 or so other operations might cost about the same, still a single uop instead of multiple for rsqrt + Newton iteration.) See Floating point division vs floating point multiplication
But sqrt and div do compete with each other for throughput so needing to divide by a square root is a nasty case.

So if you have a bad loop over an array that mostly just does sqrt, not mixed with other math operations, that's a use-case for _mm_rsqrt_ps (and a Newton iteration) as a higher throughput approximation than _mm_sqrt_ps

But if you can combine that pass with something else to increase computational intensity and get more work done overlapped with keeping the div/sqrt unit, often it's better to use a real sqrt instruction on its own, since that's still just 1 uop for the front-end to issue, and for the back-end to track and execute. vs. a Newton iteration taking something like 5 uops if FMA is available for reciprocal square root, else more (also if non-reciprocal sqrt is needed).

With Skylake for example having 1 per 3 cycle sqrtps xmm throughput (128-bit vectors), it costs the same as a mul/add/sub/fma operation if you don't do more than one per 6 math operations. (Throughput is worse for 256-bit YMM vectors, 6 cycles.) A Newton iteration would cost more uops, so if uops for port 0/1 are the bottleneck, it's a win to just use sqrt directly. (This is assuming that out-of-order exec can hide the latency, typically when each loop iteration is independent.) This kind of situation is common if you're using a polynomial approximation as part of something like log or exp in a loop.

See also Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision re: performance on modern OoO exec CPUs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文