定点的反平方根
我正在寻找定点 16.16 数字的最佳反平方根算法。下面的代码是我到目前为止所拥有的(但基本上它取平方根并除以原始数字,我想得到不除法的倒数平方根)。如果它发生任何改变,代码将为armv5te编译。
uint32_t INVSQRT(uint32_t n)
{
uint64_t op, res, one;
op = ((uint64_t)n<<16);
res = 0;
one = (uint64_t)1 << 46;
while (one > op) one >>= 2;
while (one != 0)
{
if (op >= res + one)
{
op -= (res + one);
res += (one<<1);
}
res >>= 1;
one >>= 2;
}
res<<=16;
res /= n;
return(res);
}
I am looking for the best inverse square root algorithm for fixed point 16.16 numbers. The code below is what I have so far(but basically it takes the square root and divides by the original number, and I would like to get the inverse square root without a division). If it changes anything, the code will be compiled for armv5te.
uint32_t INVSQRT(uint32_t n)
{
uint64_t op, res, one;
op = ((uint64_t)n<<16);
res = 0;
one = (uint64_t)1 << 46;
while (one > op) one >>= 2;
while (one != 0)
{
if (op >= res + one)
{
op -= (res + one);
res += (one<<1);
}
res >>= 1;
one >>= 2;
}
res<<=16;
res /= n;
return(res);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
诀窍是将牛顿法应用于问题 x - 1/y^2 = 0。因此,给定 x,使用迭代方案求解 y。
除以 2 只是位移一位,或者最坏的情况是乘以 0.5。该方案完全按照要求收敛到 y=1/sqrt(x),并且根本没有任何真正的除法。
唯一的问题是你需要一个合适的 y 起始值。我记得迭代收敛的估计 y 是有限制的。
The trick is to apply Newton's method to the problem x - 1/y^2 = 0. So, given x, solve for y using an iterative scheme.
The divide by 2 is just a bit shift, or at worst, a multiply by 0.5. This scheme converges to y=1/sqrt(x), exactly as requested, and without any true divides at all.
The only problem is that you need a decent starting value for y. As I recall there are limits on the estimate y for the iterations to converge.
ARMv5TE 处理器提供快速整数乘法器和“计数前导零”指令。它们通常还配备中等大小的缓存。基于此,最适合高性能实现的方法似乎是查找初始近似值,然后进行两次牛顿-拉夫森迭代以获得完全准确的结果。我们可以通过将额外的预计算纳入表中,进一步加快第一次迭代的速度,这是 Cray 计算机四十年前使用的技术。
下面的函数
fxrsqrt()
实现了这种方法。它以参数a
的倒数平方根的 8 位近似值r
开始,但不是存储r
,而是每个表元素存储3r(在32位条目的低十位中)和r3(在32位条目的高22位中)。这允许快速计算第一次迭代:r1 = 0.5 * (3 * r - a * r3)。然后以传统方式计算第二次迭代: r2 = 0.5 * r1 * (3 - r1 * (r1 * a))。
为了能够准确地执行这些计算,无论输入的大小如何,参数
a
在计算开始时都会被标准化,本质上将其表示为2.32
定点数乘以比例因子 2scal。计算结束时,根据公式 1/sqrt(22n) = 2-n 对结果进行非规格化。通过对最高有效丢弃位为 1 的结果进行舍入,可以提高准确性,从而导致几乎所有结果都被正确舍入。详尽的测试报告:结果太低:639 太高:1454 未正确舍入:2093
该代码使用两个辅助函数:
__clz()
确定前导数非零 32 位参数中的零位。__umulhi()
计算两个无符号 32 位整数的完整 64 位乘积的 32 个最高有效位。这两个函数都应该通过编译器内部函数或使用一些内联汇编来实现。在下面的代码中,我展示了非常适合 ARM CPU 的可移植实现以及 x86 平台的内联汇编版本。在 ARMv5TE 平台上,__clz()
应映射到CLZ
指令,__umulhi()
应映射到UMULL
代码>.ARMv5TE processors provide a fast integer multiplier, and a "count leading zeros" instruction. They also typically come with moderately sized caches. Based on this, the most suitable approach for a high-performance implementation appears to be a table lookup for an initial approximation, followed by two Newton-Raphson iterations to achieve fully accurate results. We can speed up the first of these iterations further with additional pre-computation that is incorporated into the table, a technique used by Cray computers forty years ago.
The function
fxrsqrt()
below implements this approach. It starts out with an 8-bit approximationr
to the reciprocal square root of the argumenta
, but instead of storingr
, each table element stores 3r (in the lower ten bits of the 32-bit entry) and r3 (in the upper 22 bits of the 32-bit entry). This allows the quick computation of the first iteration asr1 = 0.5 * (3 * r - a * r3). The second iteration is then computed in the conventional way as r2 = 0.5 * r1 * (3 - r1 * (r1 * a)).
To be able to perform these computations accurately, regardless of the magnitude of the input, the argument
a
is normalized at the start of the computation, in essence representing it as a2.32
fixed-point number multiplied with a scale factor of 2scal. At the end of the computation the result is denormalized according to formula 1/sqrt(22n) = 2-n. By rounding up results whose most significant discarded bit is 1, accuracy is improved, resulting in almost all results being correctly rounded. The exhaustive test reports:results too low: 639 too high: 1454 not correctly rounded: 2093
The code makes use of two helper functions:
__clz()
determines the number of leading zero bits in a non-zero 32-bit argument.__umulhi()
computes the 32 most significant bits of a full 64-bit product of two unsigned 32-bit integers. Both functions should be implemented either via compiler intrinsics, or by using a bit of inline assembly. In the code below I am showing portable implementations well suited to ARM CPUs along with inline assembly versions for x86 platforms. On ARMv5TE platforms__clz()
should be mapped map to theCLZ
instruction, and__umulhi()
should be mapped toUMULL
.我有一个解决方案,我将其描述为“快速反平方根,但针对 32 位定点”。没有表格,没有参考资料,只是直接开门见山地猜测。
如果需要,请跳转到下面的源代码,但要注意一些事情。
(x * y)>>16
可以替换为您想要的任何定点乘法方案。I have a solution that I characterize as "fast inverse sqrt, but for 32bit fixed points". No table, no reference, just straight to the point with a good guess.
If you want, jump to the source code below, but beware of a few things.
(x * y)>>16
can be replaced with any fixed-point multiplication scheme you want.