现在在 x86-64 上还值得使用 Quake 快速反平方根算法吗?
具体来说,这就是我正在谈论的代码:
float InvSqrt(float x) {
float xhalf = 0.5f*x;
int i = *(int*)&x; // warning: strict-aliasing UB, use memcpy instead
i = 0x5f375a86- (i >> 1);
x = *(float*)&i; // same
x = x*(1.5f-xhalf*x*x);
return x;
}
我忘记了从哪里得到的,但它显然比原始的 Quake III 算法(魔法常数略有不同)更好、更高效或更精确,但距此已经过去 20 多年了算法已创建,我只是想知道在性能方面是否仍然值得使用它,或者是否有一条指令可以在现代 x86-64 CPU 中实现它。
Specifically, this is the code I'm talking about:
float InvSqrt(float x) {
float xhalf = 0.5f*x;
int i = *(int*)&x; // warning: strict-aliasing UB, use memcpy instead
i = 0x5f375a86- (i >> 1);
x = *(float*)&i; // same
x = x*(1.5f-xhalf*x*x);
return x;
}
I forgot where I got this from but it's apparently better and more efficient or precise than the original Quake III algorithm (slightly different magic constant), but it's been more than 2 decades since this algorithm was created, and I just want to know if it's still worth using it in terms of performance, or if there's an instruction that implements it already in modern x86-64 CPUs.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
起源:
参见 John Carmack 的不寻常的快速平方根倒数 (Quake III)
x86 上的现代实用性:无,已被 SSE1 废弃
rsqrtss
其他一些 ISA 具有硬件支持的近似值倒数和倒数平方根,其中大部分内容也适用于 x86 之外。您的高级源代码可能需要使用 ISA 特定的内在函数来让编译器使用这些指令,这可能取决于快速数学设置。但对于答案的其余部分,我将只讨论 x86。
使用
_mm_rsqrt_ps
或ss
获得 4 个并行浮点数的非常近似的倒数平方根,比一个好的编译器用它做的要快得多(使用 SSE2 整数移位/加法)将 FP 位模式保存在 XMM 寄存器中的指令,这可能不是它实际使用类型双关语编译为整数的方式,这是 C 或 C++ 中的严格别名 UB;使用memcpy
或 C++20std::bit_cast
。)https://www.felixcloutier.com/x86/rsqrtss 记录了 asm 指令的标量版本,包括
|相对错误| ≤ 1.5 ∗ 2−12
保证。 (即大约一半的尾数位是正确的。)一次 Newton-Raphson 迭代可以将其精确到正确性的 1ulp 以内,尽管仍然不是从实际 sqrt 中获得的 0.5ulp。请参阅 使用 SSE/AVX 进行快速矢量化 rsqrt 和倒数取决于精度)rsqrtps
的执行速度仅比mulps
/mulss
指令稍慢在大多数 CPU 上,例如 5 个周期延迟,1 个时钟吞吐量。 (通过牛顿迭代来改进它,更多的 uops。)延迟因微架构而异,在 Zen 3 中低至 3 uops,但自 Conroe 以来,英特尔至少以大约 5c 的延迟运行(https://uops.info/)。Quake InvSqrt 中的幻数的整数移位/减去类似地提供了更粗略的初始猜测,其余的(在将位模式类型双关回浮点之后是牛顿拉夫森迭代
在使用
-ffast-math
编译sqrt
时,编译器甚至会为您使用rsqrtss
,具体取决于(例如,现代 clang 使用-O3 -ffast-math -march=skylake
编译1.0f/sqrtf(x)
https://godbolt.org/z/fT86bKesb 使用vrsqrtss
和 3xvmulss
加上 FMA。)当您使用 sqrt 作为除数以外的东西时,通常不值得使用 rsqrt,但对于互惠用例 rsqrt + 精化加上乘法可以避免除法和开方全精度平方根和除法本身并不像以前那么慢,至少如果您使用它们的话。很少与 mul/add/sub 进行比较。 (例如,如果您可以隐藏延迟,那么每 12 个左右的其他操作可能花费大约相同的成本,仍然是单个 uop,而不是 rsqrt + Newton 迭代的多个 uop。)请参阅 浮点除法与浮点乘法
但 sqrt 和 div 确实会相互竞争吞吐量,因此需要除以平方根是一个令人讨厌的情况。
因此,如果您对一个主要只执行 sqrt 的数组有一个错误的循环,而不与其他数学运算混合,那么这是
_mm_rsqrt_ps
(和牛顿迭代)的一个用例,作为比_mm_rsqrt_ps
更高的吞吐量近似值code>_mm_sqrt_ps但是,如果您可以将该通道与其他东西结合起来以增加计算强度并在保留 div/sqrt 单位的同时完成更多工作,通常会更好单独使用真正的 sqrt 指令,因为前端发出、后端跟踪和执行仍然只有 1 uop。与牛顿迭代相比,如果 FMA 可用于倒数平方根,则需要大约 5 微秒,否则更多(如果需要非倒数平方根也是如此)。
例如,Skylake 每 3 个周期有 1 个
sqrtps xmm
吞吐量(128 位向量),如果每个周期不执行超过 1 个操作,则其成本与 mul/add/sub/fma 操作相同6 次数学运算。 (对于 256 位 YMM 向量,6 个周期,吞吐量更差。)牛顿迭代会花费更多的 uops,因此如果端口 0/1 的 uops 是瓶颈,那么直接使用 sqrt 是一种胜利。 (这是假设无序 exec 可以隐藏延迟,通常当每个循环迭代都是独立时。)如果您在循环中使用多项式近似作为 log 或 exp 等内容的一部分,这种情况很常见。另请参阅 快速矢量化 rsqrt 和 SSE 的倒数/ AVX 取决于精度:现代 OoO 执行 CPU 上的性能。
Origins:
See John Carmack's Unusual Fast Inverse Square Root (Quake III)
Modern usefulness on x86: none, obsoleted by SSE1
rsqrtss
Some other ISAs have hardware-supported approximate reciprocal and reciprocal square-root so much of this also applies outside of x86. Your high-level source code might need to use ISA-specific intrinsics to get compilers to use these instructions, maybe depending on fast-math settings. But for the rest of the answer I'll just discuss x86.
Use
_mm_rsqrt_ps
orss
to get a very approximate reciprocal-sqrt for 4 floats in parallel, much faster than even a good compiler could do with this (using SSE2 integer shift/add instructions to keep the FP bit pattern in an XMM register, which is probably not how it would actually compile with the type-pun to integer. Which is strict-aliasing UB in C or C++; usememcpy
or C++20std::bit_cast
.)https://www.felixcloutier.com/x86/rsqrtss documents the scalar version of the asm instruction, including the
|Relative Error| ≤ 1.5 ∗ 2−12
guarantee. (i.e. about half the mantissa bits are correct.) One Newton-Raphson iteration can refine it to within 1ulp of being correct, although still not the 0.5ulp you'd get from actual sqrt. See Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision)rsqrtps
performs only slightly slower than amulps
/mulss
instruction on most CPUs, like 5 cycle latency, 1/clock throughput. (With a Newton iteration to refine it, more uops.) Latency various by microarchitecture, as low as 3 uops in Zen 3, but Intel runs it with about 5c latency since Conroe at least (https://uops.info/).The integer shift / subtract from the magic number in the Quake InvSqrt similarly provides an even rougher initial-guess, and the rest (after type-punning the bit-pattern back to a
float
is a Newton Raphson iteration.Compilers will even use
rsqrtss
for you when compilingsqrt
with-ffast-math
, depending on context and tuning options. (e.g. modern clang compiling1.0f/sqrtf(x)
with-O3 -ffast-math -march=skylake
https://godbolt.org/z/fT86bKesb usesvrsqrtss
and 3xvmulss
plus an FMA.) When you're using sqrt as something other than a divisor it's often not worth it using rsqrt, but for reciprocal use-cases rsqrt + refinement plus a multiply avoids a division as well as a sqrt.Full-precision square root and division themselves are not as slow as they used to be, at least if you use them infrequently compared to mul/add/sub. (e.g. if you can hide the latency, one sqrt every 12 or so other operations might cost about the same, still a single uop instead of multiple for rsqrt + Newton iteration.) See Floating point division vs floating point multiplication
But sqrt and div do compete with each other for throughput so needing to divide by a square root is a nasty case.
So if you have a bad loop over an array that mostly just does sqrt, not mixed with other math operations, that's a use-case for
_mm_rsqrt_ps
(and a Newton iteration) as a higher throughput approximation than_mm_sqrt_ps
But if you can combine that pass with something else to increase computational intensity and get more work done overlapped with keeping the div/sqrt unit, often it's better to use a real sqrt instruction on its own, since that's still just 1 uop for the front-end to issue, and for the back-end to track and execute. vs. a Newton iteration taking something like 5 uops if FMA is available for reciprocal square root, else more (also if non-reciprocal sqrt is needed).
With Skylake for example having 1 per 3 cycle
sqrtps xmm
throughput (128-bit vectors), it costs the same as a mul/add/sub/fma operation if you don't do more than one per 6 math operations. (Throughput is worse for 256-bit YMM vectors, 6 cycles.) A Newton iteration would cost more uops, so if uops for port 0/1 are the bottleneck, it's a win to just use sqrt directly. (This is assuming that out-of-order exec can hide the latency, typically when each loop iteration is independent.) This kind of situation is common if you're using a polynomial approximation as part of something like log or exp in a loop.See also Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision re: performance on modern OoO exec CPUs.