定点的反平方根

发布于 2024-11-14 13:44:20 字数 568 浏览 3 评论 0原文

我正在寻找定点 16.16 数字的最佳反平方根算法。下面的代码是我到目前为止所拥有的（但基本上它取平方根并除以原始数字，我想得到不除法的倒数平方根）。如果它发生任何改变，代码将为armv5te编译。

uint32_t INVSQRT(uint32_t n)
{
    uint64_t op, res, one;
    op = ((uint64_t)n<<16);
    res = 0;
    one = (uint64_t)1 << 46;
    while (one > op) one >>= 2;
    while (one != 0)
    {
        if (op >= res + one)
        {
            op -= (res + one);
            res +=  (one<<1);
        }
        res >>= 1;
        one >>= 2;
    }
    res<<=16;
    res /= n;
    return(res);
}

原文

I am looking for the best inverse square root algorithm for fixed point 16.16 numbers. The code below is what I have so far(but basically it takes the square root and divides by the original number, and I would like to get the inverse square root without a division). If it changes anything, the code will be compiled for armv5te.

uint32_t INVSQRT(uint32_t n)
{
    uint64_t op, res, one;
    op = ((uint64_t)n<<16);
    res = 0;
    one = (uint64_t)1 << 46;
    while (one > op) one >>= 2;
    while (one != 0)
    {
        if (op >= res + one)
        {
            op -= (res + one);
            res +=  (one<<1);
        }
        res >>= 1;
        one >>= 2;
    }
    res<<=16;
    res /= n;
    return(res);
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

踏雪无痕 2024-11-21 13:44:20

诀窍是将牛顿法应用于问题 x - 1/y^2 = 0。因此，给定 x，使用迭代方案求解 y。

Y_(n+1) = y_n * (3 - x*y_n^2)/2

除以 2 只是位移一位，或者最坏的情况是乘以 0.5。该方案完全按照要求收敛到 y=1/sqrt(x)，并且根本没有任何真正的除法。

唯一的问题是你需要一个合适的 y 起始值。我记得迭代收敛的估计 y 是有限制的。

The trick is to apply Newton's method to the problem x - 1/y^2 = 0. So, given x, solve for y using an iterative scheme.

Y_(n+1) = y_n * (3 - x*y_n^2)/2

The divide by 2 is just a bit shift, or at worst, a multiply by 0.5. This scheme converges to y=1/sqrt(x), exactly as requested, and without any true divides at all.

The only problem is that you need a decent starting value for y. As I recall there are limits on the estimate y for the iterations to converge.

回复收藏 0 原文

何以心动 2024-11-21 13:44:20

ARMv5TE 处理器提供快速整数乘法器和“计数前导零”指令。它们通常还配备中等大小的缓存。基于此，最适合高性能实现的方法似乎是查找初始近似值，然后进行两次牛顿-拉夫森迭代以获得完全准确的结果。我们可以通过将额外的预计算纳入表中，进一步加快第一次迭代的速度，这是 Cray 计算机四十年前使用的技术。

下面的函数 fxrsqrt() 实现了这种方法。它以参数 a 的倒数平方根的 8 位近似值 r 开始，但不是存储 r，而是每个表元素存储3r（在32位条目的低十位中）和r³（在32位条目的高22位中）。这允许快速计算第一次迭代：
r₁ = 0.5 * (3 * r - a * r³)。然后以传统方式计算第二次迭代： r₂ = 0.5 * r₁ * (3 - r₁ * (r₁ * a))。

为了能够准确地执行这些计算，无论输入的大小如何，参数 a 在计算开始时都会被标准化，本质上将其表示为 2.32定点数乘以比例因子 2^scal。计算结束时，根据公式 1/sqrt(2²ⁿ) = 2^-n 对结果进行非规格化。通过对最高有效丢弃位为 1 的结果进行舍入，可以提高准确性，从而导致几乎所有结果都被正确舍入。详尽的测试报告：结果太低：639 太高：1454 未正确舍入：2093

该代码使用两个辅助函数：__clz() 确定前导数非零 32 位参数中的零位。 __umulhi() 计算两个无符号 32 位整数的完整 64 位乘积的 32 个最高有效位。这两个函数都应该通过编译器内部函数或使用一些内联汇编来实现。在下面的代码中，我展示了非常适合 ARM CPU 的可移植实现以及 x86 平台的内联汇编版本。在 ARMv5TE 平台上，__clz() 应映射到 CLZ 指令，__umulhi() 应映射到 UMULL代码>.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>

#define USE_OWN_INTRINSICS 1

#if USE_OWN_INTRINSICS
__forceinline int __clz (uint32_t a)
{
    int r;
    __asm__ ("bsrl %1,%0\n\t" : "=r"(r): "r"(a));
    return 31 - r;
}

uint32_t __umulhi (uint32_t a, uint32_t b)
{
    uint32_t r;
    __asm__ ("movl %1,%%eax\n\tmull %2\n\tmovl %%edx,%0\n\t"
             : "=r"(r) : "r"(a), "r"(b) : "eax", "edx");
    return r;
}
#else // USE_OWN_INTRINSICS
int __clz (uint32_t a)
{
    uint32_t r = 32;
    if (a >= 0x00010000) { a >>= 16; r -= 16; }
    if (a >= 0x00000100) { a >>=  8; r -=  8; }
    if (a >= 0x00000010) { a >>=  4; r -=  4; }
    if (a >= 0x00000004) { a >>=  2; r -=  2; }
    r -= a - (a & (a >> 1));
    return r;
}

uint32_t __umulhi (uint32_t a, uint32_t b)
{
    return (uint32_t)(((uint64_t)a * b) >> 32);
}
#endif // USE_OWN_INTRINSICS

/*
 * For each sub-interval in [1, 4), use an 8-bit approximation r to reciprocal
 * square root. To speed up subsequent Newton-Raphson iterations, each entry in
 * the table combines two pieces of information: The least-significant 10 bits
 * store 3*r, the most-significant 22 bits store r**3, rounded from 24 down to
 * 22 bits such that accuracy is optimized.
 */
uint32_t rsqrt_tab [96] = 
{
    0xfa0bdefa, 0xee6af6ee, 0xe5effae5, 0xdaf27ad9,
    0xd2eff6d0, 0xc890aec4, 0xc10366bb, 0xb9a71ab2,
    0xb4da2eac, 0xadce7ea3, 0xa6f2b29a, 0xa279a694,
    0x9beb568b, 0x97a5c685, 0x9163027c, 0x8d4fd276,
    0x89501e70, 0x8563da6a, 0x818ac664, 0x7dc4fe5e,
    0x7a122258, 0x7671be52, 0x72e44a4c, 0x6f68fa46,
    0x6db22a43, 0x6a52623d, 0x67041a37, 0x65639634,
    0x622ffe2e, 0x609cba2b, 0x5d837e25, 0x5bfcfe22,
    0x58fd461c, 0x57838619, 0x560e1216, 0x53300a10,
    0x51c72e0d, 0x50621a0a, 0x4da48204, 0x4c4c2e01,
    0x4af789fe, 0x49a689fb, 0x485a11f8, 0x4710f9f5,
    0x45cc2df2, 0x448b4def, 0x421505e9, 0x40df5de6,
    0x3fadc5e3, 0x3e7fe1e0, 0x3d55c9dd, 0x3d55d9dd,
    0x3c2f41da, 0x39edd9d4, 0x39edc1d4, 0x38d281d1,
    0x37bae1ce, 0x36a6c1cb, 0x3595d5c8, 0x3488f1c5,
    0x3488fdc5, 0x337fbdc2, 0x3279ddbf, 0x317749bc,
    0x307831b9, 0x307879b9, 0x2f7d01b6, 0x2e84ddb3,
    0x2d9005b0, 0x2d9015b0, 0x2c9ec1ad, 0x2bb0a1aa,
    0x2bb0f5aa, 0x2ac615a7, 0x29ded1a4, 0x29dec9a4,
    0x28fabda1, 0x2819e99e, 0x2819ed9e, 0x273c3d9b,
    0x273c359b, 0x2661dd98, 0x258ad195, 0x258af195,
    0x24b71192, 0x24b6b192, 0x23e6058f, 0x2318118c,
    0x2318718c, 0x224da189, 0x224dd989, 0x21860d86,
    0x21862586, 0x20c19183, 0x20c1b183, 0x20001580
};

/* This function computes the reciprocal square root of its 16.16 fixed-point 
 * argument. After normalization of the argument if uses the most significant
 * bits of the argument for a table lookup to obtain an initial approximation 
 * accurate to 8 bits. This is followed by two Newton-Raphson iterations with
 * quadratic convergence. Finally, the result is denormalized and some simple
 * rounding is applied to maximize accuracy.
 *
 * To speed up the first NR iteration, for the initial 8-bit approximation r0
 * the lookup table supplies 3*r0 along with r0**3. A first iteration computes
 * a refined estimate r1 = 1.5 * r0 - x * r0**3. The second iteration computes
 * the final result as r2 = 0.5 * r1 * (3 - r1 * (r1 * x)).
 *
 * The accuracy for all arguments in [0x00000001, 0xffffffff] is as follows: 
 * 639 results are too small by one ulp, 1454 results are too big by one ulp.
 * A total of 2093 results deviate from the correctly rounded result.
 */
uint32_t fxrsqrt (uint32_t a)
{
    uint32_t s, r, t, scal;

    /* handle special case of zero input */
    if (a == 0) return ~a;
    /* normalize argument */
    scal = __clz (a) & 0xfffffffe;
    a = a << scal;
    /* initial approximation */
    t = rsqrt_tab [(a >> 25) - 32];
    /* first NR iteration */
    r = (t << 22) - __umulhi (t, a);
    /* second NR iteration */
    s = __umulhi (r, a);
    s = 0x30000000 - __umulhi (r, s);
    r = __umulhi (r, s);
    /* denormalize and round result */
    r = ((r >> (18 - (scal >> 1))) + 1) >> 1;
    return r;
}

/* reference implementation, 16.16 reciprocal square root of non-zero argment */
uint32_t ref_fxrsqrt (uint32_t a)
{
    double arg = a / 65536.0;
    double rsq = sqrt (1.0 / arg);
    uint32_t r = (uint32_t)(rsq * 65536.0 + 0.5);
    return r;
}

int main (void)
{
    uint32_t arg = 0x00000001;
    uint32_t res, ref;
    uint32_t err, lo = 0, hi = 0;

    do {
        res = fxrsqrt (arg);
        ref = ref_fxrsqrt (arg);

        err = 0;
        if (res < ref) {
            err = ref - res;
            lo++;
        }
        if (res > ref) {
            err = res - ref;
            hi++;
        }
        if (err > 1) {
            printf ("!!!! arg=%08x  res=%08x  ref=%08x\n", arg, res, ref);
            return EXIT_FAILURE;
        }
        arg++;
    } while (arg);
    printf ("results too low: %u  too high: %u  not correctly rounded: %u\n", 
            lo, hi, lo + hi);
    return EXIT_SUCCESS;
}

ARMv5TE processors provide a fast integer multiplier, and a "count leading zeros" instruction. They also typically come with moderately sized caches. Based on this, the most suitable approach for a high-performance implementation appears to be a table lookup for an initial approximation, followed by two Newton-Raphson iterations to achieve fully accurate results. We can speed up the first of these iterations further with additional pre-computation that is incorporated into the table, a technique used by Cray computers forty years ago.

The function fxrsqrt() below implements this approach. It starts out with an 8-bit approximation r to the reciprocal square root of the argument a, but instead of storing r, each table element stores 3r (in the lower ten bits of the 32-bit entry) and r³ (in the upper 22 bits of the 32-bit entry). This allows the quick computation of the first iteration as
r₁ = 0.5 * (3 * r - a * r³). The second iteration is then computed in the conventional way as r₂ = 0.5 * r₁ * (3 - r₁ * (r₁ * a)).

To be able to perform these computations accurately, regardless of the magnitude of the input, the argument a is normalized at the start of the computation, in essence representing it as a 2.32 fixed-point number multiplied with a scale factor of 2^scal. At the end of the computation the result is denormalized according to formula 1/sqrt(2²ⁿ) = 2^-n. By rounding up results whose most significant discarded bit is 1, accuracy is improved, resulting in almost all results being correctly rounded. The exhaustive test reports: results too low: 639 too high: 1454 not correctly rounded: 2093

The code makes use of two helper functions: __clz() determines the number of leading zero bits in a non-zero 32-bit argument. __umulhi() computes the 32 most significant bits of a full 64-bit product of two unsigned 32-bit integers. Both functions should be implemented either via compiler intrinsics, or by using a bit of inline assembly. In the code below I am showing portable implementations well suited to ARM CPUs along with inline assembly versions for x86 platforms. On ARMv5TE platforms __clz() should be mapped map to the CLZ instruction, and __umulhi() should be mapped to UMULL.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>

#define USE_OWN_INTRINSICS 1

#if USE_OWN_INTRINSICS
__forceinline int __clz (uint32_t a)
{
    int r;
    __asm__ ("bsrl %1,%0\n\t" : "=r"(r): "r"(a));
    return 31 - r;
}

uint32_t __umulhi (uint32_t a, uint32_t b)
{
    uint32_t r;
    __asm__ ("movl %1,%%eax\n\tmull %2\n\tmovl %%edx,%0\n\t"
             : "=r"(r) : "r"(a), "r"(b) : "eax", "edx");
    return r;
}
#else // USE_OWN_INTRINSICS
int __clz (uint32_t a)
{
    uint32_t r = 32;
    if (a >= 0x00010000) { a >>= 16; r -= 16; }
    if (a >= 0x00000100) { a >>=  8; r -=  8; }
    if (a >= 0x00000010) { a >>=  4; r -=  4; }
    if (a >= 0x00000004) { a >>=  2; r -=  2; }
    r -= a - (a & (a >> 1));
    return r;
}

uint32_t __umulhi (uint32_t a, uint32_t b)
{
    return (uint32_t)(((uint64_t)a * b) >> 32);
}
#endif // USE_OWN_INTRINSICS

/*
 * For each sub-interval in [1, 4), use an 8-bit approximation r to reciprocal
 * square root. To speed up subsequent Newton-Raphson iterations, each entry in
 * the table combines two pieces of information: The least-significant 10 bits
 * store 3*r, the most-significant 22 bits store r**3, rounded from 24 down to
 * 22 bits such that accuracy is optimized.
 */
uint32_t rsqrt_tab [96] = 
{
    0xfa0bdefa, 0xee6af6ee, 0xe5effae5, 0xdaf27ad9,
    0xd2eff6d0, 0xc890aec4, 0xc10366bb, 0xb9a71ab2,
    0xb4da2eac, 0xadce7ea3, 0xa6f2b29a, 0xa279a694,
    0x9beb568b, 0x97a5c685, 0x9163027c, 0x8d4fd276,
    0x89501e70, 0x8563da6a, 0x818ac664, 0x7dc4fe5e,
    0x7a122258, 0x7671be52, 0x72e44a4c, 0x6f68fa46,
    0x6db22a43, 0x6a52623d, 0x67041a37, 0x65639634,
    0x622ffe2e, 0x609cba2b, 0x5d837e25, 0x5bfcfe22,
    0x58fd461c, 0x57838619, 0x560e1216, 0x53300a10,
    0x51c72e0d, 0x50621a0a, 0x4da48204, 0x4c4c2e01,
    0x4af789fe, 0x49a689fb, 0x485a11f8, 0x4710f9f5,
    0x45cc2df2, 0x448b4def, 0x421505e9, 0x40df5de6,
    0x3fadc5e3, 0x3e7fe1e0, 0x3d55c9dd, 0x3d55d9dd,
    0x3c2f41da, 0x39edd9d4, 0x39edc1d4, 0x38d281d1,
    0x37bae1ce, 0x36a6c1cb, 0x3595d5c8, 0x3488f1c5,
    0x3488fdc5, 0x337fbdc2, 0x3279ddbf, 0x317749bc,
    0x307831b9, 0x307879b9, 0x2f7d01b6, 0x2e84ddb3,
    0x2d9005b0, 0x2d9015b0, 0x2c9ec1ad, 0x2bb0a1aa,
    0x2bb0f5aa, 0x2ac615a7, 0x29ded1a4, 0x29dec9a4,
    0x28fabda1, 0x2819e99e, 0x2819ed9e, 0x273c3d9b,
    0x273c359b, 0x2661dd98, 0x258ad195, 0x258af195,
    0x24b71192, 0x24b6b192, 0x23e6058f, 0x2318118c,
    0x2318718c, 0x224da189, 0x224dd989, 0x21860d86,
    0x21862586, 0x20c19183, 0x20c1b183, 0x20001580
};

/* This function computes the reciprocal square root of its 16.16 fixed-point 
 * argument. After normalization of the argument if uses the most significant
 * bits of the argument for a table lookup to obtain an initial approximation 
 * accurate to 8 bits. This is followed by two Newton-Raphson iterations with
 * quadratic convergence. Finally, the result is denormalized and some simple
 * rounding is applied to maximize accuracy.
 *
 * To speed up the first NR iteration, for the initial 8-bit approximation r0
 * the lookup table supplies 3*r0 along with r0**3. A first iteration computes
 * a refined estimate r1 = 1.5 * r0 - x * r0**3. The second iteration computes
 * the final result as r2 = 0.5 * r1 * (3 - r1 * (r1 * x)).
 *
 * The accuracy for all arguments in [0x00000001, 0xffffffff] is as follows: 
 * 639 results are too small by one ulp, 1454 results are too big by one ulp.
 * A total of 2093 results deviate from the correctly rounded result.
 */
uint32_t fxrsqrt (uint32_t a)
{
    uint32_t s, r, t, scal;

    /* handle special case of zero input */
    if (a == 0) return ~a;
    /* normalize argument */
    scal = __clz (a) & 0xfffffffe;
    a = a << scal;
    /* initial approximation */
    t = rsqrt_tab [(a >> 25) - 32];
    /* first NR iteration */
    r = (t << 22) - __umulhi (t, a);
    /* second NR iteration */
    s = __umulhi (r, a);
    s = 0x30000000 - __umulhi (r, s);
    r = __umulhi (r, s);
    /* denormalize and round result */
    r = ((r >> (18 - (scal >> 1))) + 1) >> 1;
    return r;
}

/* reference implementation, 16.16 reciprocal square root of non-zero argment */
uint32_t ref_fxrsqrt (uint32_t a)
{
    double arg = a / 65536.0;
    double rsq = sqrt (1.0 / arg);
    uint32_t r = (uint32_t)(rsq * 65536.0 + 0.5);
    return r;
}

int main (void)
{
    uint32_t arg = 0x00000001;
    uint32_t res, ref;
    uint32_t err, lo = 0, hi = 0;

    do {
        res = fxrsqrt (arg);
        ref = ref_fxrsqrt (arg);

        err = 0;
        if (res < ref) {
            err = ref - res;
            lo++;
        }
        if (res > ref) {
            err = res - ref;
            hi++;
        }
        if (err > 1) {
            printf ("!!!! arg=%08x  res=%08x  ref=%08x\n", arg, res, ref);
            return EXIT_FAILURE;
        }
        arg++;
    } while (arg);
    printf ("results too low: %u  too high: %u  not correctly rounded: %u\n", 
            lo, hi, lo + hi);
    return EXIT_SUCCESS;
}

回复收藏 0 原文

伴随着你 2024-11-21 13:44:20

我有一个解决方案，我将其描述为“快速反平方根，但针对 32 位定点”。没有表格，没有参考资料，只是直接开门见山地猜测。

如果需要，请跳转到下面的源代码，但要注意一些事情。

(x * y)>>16 可以替换为您想要的任何定点乘法方案。
这不需要 64 位[长字]，我只是为了便于演示而使用它。长字用于防止乘法溢出。定点数学库将具有定点乘法函数，可以更好地处理这个问题。
最初的猜测相当不错，所以在第一个咒语中你就得到了相对准确的结果。
该代码比演示所需的更详细。
不能使用小于 65536 (<1) 和大于 32767<<16 的值。
如果您的硬件具有除法功能，这通常不会比使用平方根表和除法更快。如果没有，这可以避免分裂。

int fxisqrt(int input){

    if(input <= 65536){
        return 1;
    }

    long xSR = input>>1;
    long pushRight = input;
    long msb = 0;
    long shoffset = 0;
    long yIsqr = 0;
    long ysqr = 0;
    long fctrl = 0;
    long subthreehalf = 0;

    while(pushRight >= 65536){
        pushRight >>=1;
        msb++;
    }

    shoffset = (16 - ((msb)>>1));
    yIsqr = 1<<shoffset;
    //y = (y * (98304 - ( ( (x>>1) * ((y * y)>>16 ) )>>16 ) ) )>>16;   x2
    //Incantation 1
    ysqr = (yIsqr * yIsqr)>>16;
    fctrl = (xSR * ysqr)>>16;
    subthreehalf = 98304 - fctrl;
    yIsqr = (yIsqr * subthreehalf)>>16;
    //Incantation 2 - Increases precision greatly, but may not be neccessary
    ysqr = (yIsqr * yIsqr)>>16;
    fctrl = (xSR * ysqr)>>16;
    subthreehalf = 98304 - fctrl;
    yIsqr = (yIsqr * subthreehalf)>>16;
    return yIsqr;
}

I have a solution that I characterize as "fast inverse sqrt, but for 32bit fixed points". No table, no reference, just straight to the point with a good guess.

If you want, jump to the source code below, but beware of a few things.

(x * y)>>16 can be replaced with any fixed-point multiplication scheme you want.
This does not require 64-bit [long-words], I just use that for the ease of demonstration. Long words are used to prevent overflow in multiplication. A fixed-point math library will have fixed-point multiplication functions that handle this better.
The initial guess is pretty good, so you get relatively precise results in the first incantation.
The code is more verbose than needed for demonstration.
Values less than 65536 (<1) and greater than 32767<<16 cannot be used.
This is generally not faster than using a square root table and division if your hardware has a division function. If it does not, this avoids divisions.

int fxisqrt(int input){

    if(input <= 65536){
        return 1;
    }

    long xSR = input>>1;
    long pushRight = input;
    long msb = 0;
    long shoffset = 0;
    long yIsqr = 0;
    long ysqr = 0;
    long fctrl = 0;
    long subthreehalf = 0;

    while(pushRight >= 65536){
        pushRight >>=1;
        msb++;
    }

    shoffset = (16 - ((msb)>>1));
    yIsqr = 1<<shoffset;
    //y = (y * (98304 - ( ( (x>>1) * ((y * y)>>16 ) )>>16 ) ) )>>16;   x2
    //Incantation 1
    ysqr = (yIsqr * yIsqr)>>16;
    fctrl = (xSR * ysqr)>>16;
    subthreehalf = 98304 - fctrl;
    yIsqr = (yIsqr * subthreehalf)>>16;
    //Incantation 2 - Increases precision greatly, but may not be neccessary
    ysqr = (yIsqr * yIsqr)>>16;
    fctrl = (xSR * ysqr)>>16;
    subthreehalf = 98304 - fctrl;
    yIsqr = (yIsqr * subthreehalf)>>16;
    return yIsqr;
}

回复收藏 0 原文

~没有更多了~