当前位置：文江博客话题详情

如何改进小值的定点平方根

发布于 2024-12-24 02:54:20 字数 4583 浏览 6 评论 0 原文

我正在使用 Dobb 博士的文章“使用定点算术优化数学密集型应用”使用恒向线法。

当点之间的距离很大（大于几公里）时，这种方法效果很好，但在距离较小时，效果很差。最坏的情况是当两点相等或接近相等时，结果是 194 米的距离，而我需要在距离 >= 1 米时至少达到 1 米的精度。

通过与双精度浮点实现进行比较，我将问题定位到了 fixed::sqrt() 函数，该函数在小值时表现不佳：

x       std::sqrt(x)    fixed::sqrt(x)  error
----------------------------------------------------
0       0               3.05176e-005    3.05176e-005
1e-005  0.00316228      0.00316334      1.06005e-006
2e-005  0.00447214      0.00447226      1.19752e-007
3e-005  0.00547723      0.0054779       6.72248e-007
4e-005  0.00632456      0.00632477      2.12746e-007
5e-005  0.00707107      0.0070715       4.27244e-007
6e-005  0.00774597      0.0077467       7.2978e-007
7e-005  0.0083666       0.00836658      1.54875e-008
8e-005  0.00894427      0.00894427      1.085e-009

更正 fixed:: 的结果sqrt(0) 通过将其视为特殊情况而显得微不足道，但这无法解决较小的非零距离的问题，其中误差从 194 米开始，并随着距离的增加而收敛于零。我可能需要将精度至少提高一个数量级以达到零。

上面链接的文章的第 4 页简要解释了 fixed::sqrt() 算法，但我很难遵循它，更不用说确定是否可以改进它了。该函数的代码复制如下：

fixed fixed::sqrt() const
{
    unsigned const max_shift=62;
    uint64_t a_squared=1LL<<max_shift;
    unsigned b_shift=(max_shift+fixed_resolution_shift)/2;
    uint64_t a=1LL<<b_shift;

    uint64_t x=m_nVal;

    while(b_shift && a_squared>x)
    {
        a>>=1;
        a_squared>>=2;
        --b_shift;
    }

    uint64_t remainder=x-a_squared;
    --b_shift;

    while(remainder && b_shift)
    {
        uint64_t b_squared=1LL<<(2*b_shift-fixed_resolution_shift);
        int const two_a_b_shift=b_shift+1-fixed_resolution_shift;
        uint64_t two_a_b=(two_a_b_shift>0)?(a<<two_a_b_shift):(a>>-two_a_b_shift);

        while(b_shift && remainder<(b_squared+two_a_b))
        {
            b_squared>>=2;
            two_a_b>>=1;
            --b_shift;
        }
        uint64_t const delta=b_squared+two_a_b;
        if((2*remainder)>delta)
        {
            a+=(1LL<<b_shift);
            remainder-=delta;
            if(b_shift)
            {
                --b_shift;
            }
        }
    }
    return fixed(internal(),a);
}

请注意，m_nVal 是内部定点表示值，它是一个 int64_t 并且表示使用 Q36.28 格式 (fixed_resolution_shift = 28)。该表示本身具有至少 8 位小数的足够精度，并且赤道弧的分数对于 0.14 米左右的距离来说是好的，因此限制不是定点表示。

使用恒向线方法是标准机构对此应用程序的建议，因此无法更改，并且在任何情况下，该应用程序或未来应用程序的其他地方可能需要更准确的平方根函数。

问题：是否可以提高 fixed::sqrt() 算法对于小非零值的精度，同时仍然保持其有界和确定性收敛？

其他信息 用于生成上表的测试代码：

#include <cmath>
#include <iostream>
#include "fixed.hpp"

int main()
{
    double error = 1.0 ;
    for( double x = 0.0; error > 1e-8; x += 1e-5 )
    {
        double fixed_root = sqrt(fixed(x)).as_double() ;
        double std_root = std::sqrt(x) ;
        error = std::fabs(fixed_root - std_root) ;
        std::cout << x << '\t' << std_root << '\t' << fixed_root << '\t' << error << std::endl ;
    }
}

结论根据Justin Peel的解决方案和分析，并与“被忽视的定点算术艺术”，我对后者进行了如下调整：

fixed fixed::sqrt() const
{
    uint64_t a = 0 ;            // root accumulator
    uint64_t remHi = 0 ;        // high part of partial remainder
    uint64_t remLo = m_nVal ;   // low part of partial remainder
    uint64_t testDiv ;
    int count = 31 + (fixed_resolution_shift >> 1); // Loop counter
    do 
    {
        // get 2 bits of arg
        remHi = (remHi << 2) | (remLo >> 62); remLo <<= 2 ;

        // Get ready for the next bit in the root
        a <<= 1;   

        // Test radical
        testDiv = (a << 1) + 1;    
        if (remHi >= testDiv) 
        {
            remHi -= testDiv;
            a += 1;
        }

    } while (count-- != 0);

    return fixed(internal(),a);
}

虽然这提供了更高的精度，但我需要的改进并没有实现。仅 Q36.28 格式就可以提供我需要的精度，但不可能在不损失几位精度的情况下执行 sqrt()。然而，一些横向思维提供了更好的解决方案。我的应用程序根据某个距离限制测试计算出的距离。事后看来，相当明显的解决方案是测试距离的平方与极限的平方！

原文

I am using Anthony Williams' fixed point library described in the Dr Dobb's article "Optimizing Math-Intensive Applications with Fixed-Point Arithmetic" to calculate the distance between two geographical points using the Rhumb Line method.

This works well enough when the distance between the points is significant (greater than a few kilometers), but is very poor at smaller distances. The worst case being when the two points are equal or near equal, the result is a distance of 194 meters, while I need precision of at least 1 metre at distances >= 1 metre.

By comparison with a double precision floating-point implementation, I have located the problem to the fixed::sqrt() function, which performs poorly at small values:

x       std::sqrt(x)    fixed::sqrt(x)  error
----------------------------------------------------
0       0               3.05176e-005    3.05176e-005
1e-005  0.00316228      0.00316334      1.06005e-006
2e-005  0.00447214      0.00447226      1.19752e-007
3e-005  0.00547723      0.0054779       6.72248e-007
4e-005  0.00632456      0.00632477      2.12746e-007
5e-005  0.00707107      0.0070715       4.27244e-007
6e-005  0.00774597      0.0077467       7.2978e-007
7e-005  0.0083666       0.00836658      1.54875e-008
8e-005  0.00894427      0.00894427      1.085e-009

Correcting the result for fixed::sqrt(0) is trivial by treating it as a special case, but that will not solve the problem for small non-zero distances, where the error starts at 194 metres and converges toward zero with increasing distance. I probably need at least an order of maginitude improvement in precision toward zero.

The fixed::sqrt() algorithim is briefly explained on page 4 of the article linked above, but I am struggling to follow it let alone determine whether it is possible to improve it. The code for the function is reproduced below:

fixed fixed::sqrt() const
{
    unsigned const max_shift=62;
    uint64_t a_squared=1LL<<max_shift;
    unsigned b_shift=(max_shift+fixed_resolution_shift)/2;
    uint64_t a=1LL<<b_shift;

    uint64_t x=m_nVal;

    while(b_shift && a_squared>x)
    {
        a>>=1;
        a_squared>>=2;
        --b_shift;
    }

    uint64_t remainder=x-a_squared;
    --b_shift;

    while(remainder && b_shift)
    {
        uint64_t b_squared=1LL<<(2*b_shift-fixed_resolution_shift);
        int const two_a_b_shift=b_shift+1-fixed_resolution_shift;
        uint64_t two_a_b=(two_a_b_shift>0)?(a<<two_a_b_shift):(a>>-two_a_b_shift);

        while(b_shift && remainder<(b_squared+two_a_b))
        {
            b_squared>>=2;
            two_a_b>>=1;
            --b_shift;
        }
        uint64_t const delta=b_squared+two_a_b;
        if((2*remainder)>delta)
        {
            a+=(1LL<<b_shift);
            remainder-=delta;
            if(b_shift)
            {
                --b_shift;
            }
        }
    }
    return fixed(internal(),a);
}

Note that m_nVal is the internal fixed point representation value, it is an int64_t and the representation uses Q36.28 format (fixed_resolution_shift = 28). The representation itself has enough precision for at least 8 decimal places, and as a fraction of equatorial arc is good for distances of around 0.14 metres, so the limitation is not the fixed-point representation.

Use of the rhumb line method is a standards body recommendation for this application so cannot be changed, and in any case a more accurate square-root function is likely to be required elsewhere in the application or in future applications.

Question: Is it possible to improve the accuracy of the fixed::sqrt() algorithm for small non-zero values while still maintaining its bounded and deterministic convergence?

Additional Information
The test code used to generate the table above:

#include <cmath>
#include <iostream>
#include "fixed.hpp"

int main()
{
    double error = 1.0 ;
    for( double x = 0.0; error > 1e-8; x += 1e-5 )
    {
        double fixed_root = sqrt(fixed(x)).as_double() ;
        double std_root = std::sqrt(x) ;
        error = std::fabs(fixed_root - std_root) ;
        std::cout << x << '\t' << std_root << '\t' << fixed_root << '\t' << error << std::endl ;
    }
}

Conclusion
In the light of Justin Peel's solution and analysis, and comparison with the algorithm in "The Neglected Art of Fixed Point Arithmetic", I have adapted the latter as follows:

fixed fixed::sqrt() const
{
    uint64_t a = 0 ;            // root accumulator
    uint64_t remHi = 0 ;        // high part of partial remainder
    uint64_t remLo = m_nVal ;   // low part of partial remainder
    uint64_t testDiv ;
    int count = 31 + (fixed_resolution_shift >> 1); // Loop counter
    do 
    {
        // get 2 bits of arg
        remHi = (remHi << 2) | (remLo >> 62); remLo <<= 2 ;

        // Get ready for the next bit in the root
        a <<= 1;   

        // Test radical
        testDiv = (a << 1) + 1;    
        if (remHi >= testDiv) 
        {
            remHi -= testDiv;
            a += 1;
        }

    } while (count-- != 0);

    return fixed(internal(),a);
}

While this gives far greater precision, the improvement I needed is not to be achieved. The Q36.28 format alone just about provides the precision I need, but it is not possible to perform a sqrt() without loss of a few bits of precision. However some lateral thinking provides a better solution. My application tests the calculated distance against some distance limit. The rather obvious solution in hindsight is to test the square of the distance against the square of the limit!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

新一帅帅 2024-12-31 02:54:20

给定 sqrt(ab) = sqrt(a)sqrt(b) ，那么你不能只捕获数字较小的情况并将其向上移动给定的位数，计算开根并将其向下移动一半的位数以获得结果？

即

 sqrt(n) = sqrt(n.2^k)/sqrt(2^k)
         = sqrt(n.2^k).2^(-k/2)

，对于任何小于 2^8 的 n，选择 k = 28。

Given that sqrt(ab) = sqrt(a)sqrt(b), then can't you just trap the case where your number is small and shift it up by a given number of bits, compute the root and shift that back down by half the number of bits to get the result?

I.e.

 sqrt(n) = sqrt(n.2^k)/sqrt(2^k)
         = sqrt(n.2^k).2^(-k/2)

E.g. Choose k = 28 for any n less than 2^8.

回复收藏 0 原文

还如梦归 2024-12-31 02:54:20

原来的实现显然存在一些问题。我对尝试用当前代码完成的方式修复所有这些问题感到沮丧，最终采用了不同的方法。我现在可能可以修复原来的版本，但无论如何我更喜欢我的方式。

我将输入数字视为在 Q64 中开始，这与先移 28，然后再移回 14（开方将其减半）相同。但是，如果您这样做，则精度将限制为 1/2^14 = 6.1035e-5，因为最后 14 位将为 0。为了解决这个问题，我将 a 和 < code>remainder 正确并继续填写数字，我再次执行循环。代码可以变得更高效、更简洁，但我会将其留给其他人。下面显示的精度几乎与 Q36.28 一样好。如果将输入数被定点截断后的定点 sqrt 与浮点 sqrt 进行比较（将其转换为定点并返回），则错误约为 2e-9（我没有在下面的代码，但需要修改一行）。这与 Q36.28 的最佳精度一致，即 1/2^28 = 3.7529e-9。

顺便说一句，原始代码中的一个大错误是从未考虑 m = 0 的项，因此永远无法设置该位。无论如何，这是代码。享受！

#include <iostream>
#include <cmath>

typedef unsigned long uint64_t;

uint64_t sqrt(uint64_t in_val)
{
    const uint64_t fixed_resolution_shift = 28;
    const unsigned max_shift=62;
    uint64_t a_squared=1ULL<<max_shift;
    unsigned b_shift=(max_shift>>1) + 1;
    uint64_t a=1ULL<<(b_shift - 1);

    uint64_t x=in_val;

    while(b_shift && a_squared>x)
    {
        a>>=1;
        a_squared>>=2;
        --b_shift;
    }

    uint64_t remainder=x-a_squared;
    --b_shift;

    while(remainder && b_shift)
    {
        uint64_t b_squared=1ULL<<(2*(b_shift - 1));
        uint64_t two_a_b=(a<<b_shift);

        while(b_shift && remainder<(b_squared+two_a_b))
        {
            b_squared>>=2;
            two_a_b>>=1;
            --b_shift;
        }
        uint64_t const delta=b_squared+two_a_b;
        if((remainder)>=delta && b_shift)
        {
            a+=(1ULL<<(b_shift - 1));
            remainder-=delta;
            --b_shift;
        }
    }
    a <<= (fixed_resolution_shift/2);
    b_shift = (fixed_resolution_shift/2) + 1;
    remainder <<= (fixed_resolution_shift);

    while(remainder && b_shift)
    {
        uint64_t b_squared=1ULL<<(2*(b_shift - 1));
        uint64_t two_a_b=(a<<b_shift);

        while(b_shift && remainder<(b_squared+two_a_b))
        {
            b_squared>>=2;
            two_a_b>>=1;
            --b_shift;
        }
        uint64_t const delta=b_squared+two_a_b;
        if((remainder)>=delta && b_shift)
        {
            a+=(1ULL<<(b_shift - 1));
            remainder-=delta;
            --b_shift;
        }
    }

    return a;
}

double fixed2float(uint64_t x)
{
    return static_cast<double>(x) * pow(2.0, -28.0);
}

uint64_t float2fixed(double f)
{
    return static_cast<uint64_t>(f * pow(2, 28.0));
}

void finderror(double num)
{
    double root1 = fixed2float(sqrt(float2fixed(num)));
    double root2 = pow(num, 0.5);
    std::cout << "input: " << num << ", fixed sqrt: " << root1 << " " << ", float sqrt: " << root2 << ", finderror: " << root2 - root1 << std::endl;
}

main()
{
    finderror(0);
    finderror(1e-5);
    finderror(2e-5);
    finderror(3e-5);
    finderror(4e-5);
    finderror(5e-5);
    finderror(pow(2.0,1));
    finderror(1ULL<<35);
}

程序的输出是

input: 0, fixed sqrt: 0 , float sqrt: 0, finderror: 0
input: 1e-05, fixed sqrt: 0.00316207 , float sqrt: 0.00316228, finderror: 2.10277e-07
input: 2e-05, fixed sqrt: 0.00447184 , float sqrt: 0.00447214, finderror: 2.97481e-07
input: 3e-05, fixed sqrt: 0.0054772 , float sqrt: 0.00547723, finderror: 2.43815e-08
input: 4e-05, fixed sqrt: 0.00632443 , float sqrt: 0.00632456, finderror: 1.26255e-07
input: 5e-05, fixed sqrt: 0.00707086 , float sqrt: 0.00707107, finderror: 2.06055e-07
input: 2, fixed sqrt: 1.41421 , float sqrt: 1.41421, finderror: 1.85149e-09
input: 3.43597e+10, fixed sqrt: 185364 , float sqrt: 185364, finderror: 2.24099e-09

The original implementation obviously has some problems. I became frustrated with trying to fix them all with the way the code is currently done and ended up going at it with a different approach. I could probably fix the original now, but I like my way better anyway.

I treat the input number as being in Q64 to start which is the same as shifting by 28 and then shifting back by 14 afterwards (the sqrt halves it). However, if you just do that, then the accuracy is limited to 1/2^14 = 6.1035e-5 because the last 14 bits will be 0. To remedy this, I then shift a and remainder correctly and to keep filling in digits I do the loop again. The code can be made more efficient and cleaner, but I'll leave that to someone else. The accuracy shown below is pretty much as good as you can get with Q36.28. If you compare the fixed point sqrt with the floating point sqrt of the input number after it has been truncated by fixed point(convert it to fixed point and back), then the errors are around 2e-9(I didn't do this in the code below, but it requires one line of change). This is right in line with the best accuracy for Q36.28 which is 1/2^28 = 3.7529e-9.

By the way, one big mistake in the original code is that the term where m = 0 is never considered so that bit can never be set. Anyway, here is the code. Enjoy!

#include <iostream>
#include <cmath>

typedef unsigned long uint64_t;

uint64_t sqrt(uint64_t in_val)
{
    const uint64_t fixed_resolution_shift = 28;
    const unsigned max_shift=62;
    uint64_t a_squared=1ULL<<max_shift;
    unsigned b_shift=(max_shift>>1) + 1;
    uint64_t a=1ULL<<(b_shift - 1);

    uint64_t x=in_val;

    while(b_shift && a_squared>x)
    {
        a>>=1;
        a_squared>>=2;
        --b_shift;
    }

    uint64_t remainder=x-a_squared;
    --b_shift;

    while(remainder && b_shift)
    {
        uint64_t b_squared=1ULL<<(2*(b_shift - 1));
        uint64_t two_a_b=(a<<b_shift);

        while(b_shift && remainder<(b_squared+two_a_b))
        {
            b_squared>>=2;
            two_a_b>>=1;
            --b_shift;
        }
        uint64_t const delta=b_squared+two_a_b;
        if((remainder)>=delta && b_shift)
        {
            a+=(1ULL<<(b_shift - 1));
            remainder-=delta;
            --b_shift;
        }
    }
    a <<= (fixed_resolution_shift/2);
    b_shift = (fixed_resolution_shift/2) + 1;
    remainder <<= (fixed_resolution_shift);

    while(remainder && b_shift)
    {
        uint64_t b_squared=1ULL<<(2*(b_shift - 1));
        uint64_t two_a_b=(a<<b_shift);

        while(b_shift && remainder<(b_squared+two_a_b))
        {
            b_squared>>=2;
            two_a_b>>=1;
            --b_shift;
        }
        uint64_t const delta=b_squared+two_a_b;
        if((remainder)>=delta && b_shift)
        {
            a+=(1ULL<<(b_shift - 1));
            remainder-=delta;
            --b_shift;
        }
    }

    return a;
}

double fixed2float(uint64_t x)
{
    return static_cast<double>(x) * pow(2.0, -28.0);
}

uint64_t float2fixed(double f)
{
    return static_cast<uint64_t>(f * pow(2, 28.0));
}

void finderror(double num)
{
    double root1 = fixed2float(sqrt(float2fixed(num)));
    double root2 = pow(num, 0.5);
    std::cout << "input: " << num << ", fixed sqrt: " << root1 << " " << ", float sqrt: " << root2 << ", finderror: " << root2 - root1 << std::endl;
}

main()
{
    finderror(0);
    finderror(1e-5);
    finderror(2e-5);
    finderror(3e-5);
    finderror(4e-5);
    finderror(5e-5);
    finderror(pow(2.0,1));
    finderror(1ULL<<35);
}

with the output of the program being

input: 0, fixed sqrt: 0 , float sqrt: 0, finderror: 0
input: 1e-05, fixed sqrt: 0.00316207 , float sqrt: 0.00316228, finderror: 2.10277e-07
input: 2e-05, fixed sqrt: 0.00447184 , float sqrt: 0.00447214, finderror: 2.97481e-07
input: 3e-05, fixed sqrt: 0.0054772 , float sqrt: 0.00547723, finderror: 2.43815e-08
input: 4e-05, fixed sqrt: 0.00632443 , float sqrt: 0.00632456, finderror: 1.26255e-07
input: 5e-05, fixed sqrt: 0.00707086 , float sqrt: 0.00707107, finderror: 2.06055e-07
input: 2, fixed sqrt: 1.41421 , float sqrt: 1.41421, finderror: 1.85149e-09
input: 3.43597e+10, fixed sqrt: 185364 , float sqrt: 185364, finderror: 2.24099e-09

回复收藏 0 原文

嘦怹 2024-12-31 02:54:20

我不确定您如何从表中显示的 fixed::sqrt() 获取数字。

这就是我所做的：

#include <stdio.h>
#include <math.h>

#define __int64 long long // gcc doesn't know __int64
typedef __int64 fixed;

#define FRACT 28

#define DBL2FIX(x) ((fixed)((double)(x) * (1LL << FRACT)))
#define FIX2DBL(x) ((double)(x) / (1LL << FRACT))

// De-++-ified code from
// http://www.justsoftwaresolutions.co.uk/news/optimizing-applications-with-fixed-point-arithmetic.html
fixed sqrtfix0(fixed num)
{
    static unsigned const fixed_resolution_shift=FRACT;

    unsigned const max_shift=62;
    unsigned __int64 a_squared=1LL<<max_shift;
    unsigned b_shift=(max_shift+fixed_resolution_shift)/2;
    unsigned __int64 a=1LL<<b_shift;

    unsigned __int64 x=num;

    unsigned __int64 remainder;

    while(b_shift && a_squared>x)
    {
        a>>=1;
        a_squared>>=2;
        --b_shift;
    }

    remainder=x-a_squared;
    --b_shift;

    while(remainder && b_shift)
    {
        unsigned __int64 b_squared=1LL<<(2*b_shift-fixed_resolution_shift);
        int const two_a_b_shift=b_shift+1-fixed_resolution_shift;
        unsigned __int64 two_a_b=(two_a_b_shift>0)?(a<<two_a_b_shift):(a>>-two_a_b_shift);
        unsigned __int64 delta;

        while(b_shift && remainder<(b_squared+two_a_b))
        {
            b_squared>>=2;
            two_a_b>>=1;
            --b_shift;
        }
        delta=b_squared+two_a_b;
        if((2*remainder)>delta)
        {
            a+=(1LL<<b_shift);
            remainder-=delta;
            if(b_shift)
            {
                --b_shift;
            }
        }
    }
    return (fixed)a;
}

// Adapted code from
// http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Digit-by-digit_calculation
fixed sqrtfix1(fixed num)
{
    fixed res = 0;
    fixed bit = (fixed)1 << 62; // The second-to-top bit is set
    int s = 0;

    // Scale num up to get more significant digits

    while (num && num < bit)
    {
        num <<= 1;
        s++;
    }

    if (s & 1)
    {
        num >>= 1;
        s--;
    }

    s = 14 - (s >> 1);

    while (bit != 0)
    {
        if (num >= res + bit)
        {
            num -= res + bit;
            res = (res >> 1) + bit;
        }
        else
        {
            res >>= 1;
        }

        bit >>= 2;
    }

    if (s >= 0) res <<= s;
    else res >>= -s;

    return res;
}

int main(void)
{
    double testData[] =
    {
        0,
        1e-005,
        2e-005,
        3e-005,
        4e-005,
        5e-005,
        6e-005,
        7e-005,
        8e-005,
    };
    int i;

    for (i = 0; i < sizeof(testData) / sizeof(testData[0]); i++)
    {
        double x = testData[i];
        fixed xf = DBL2FIX(x);

        fixed sqf0 = sqrtfix0(xf);
        fixed sqf1 = sqrtfix1(xf);

        double sq0 = FIX2DBL(sqf0);
        double sq1 = FIX2DBL(sqf1);

        printf("%10.8f:  "
               "sqrtfix0()=%10.8f / err=%e  "
               "sqrt()=%10.8f  "
               "sqrtfix1()=%10.8f / err=%e\n",
               x,
               sq0, fabs(sq0 - sqrt(x)),
               sqrt(x),
               sq1, fabs(sq1 - sqrt(x)));
    }

    printf("sizeof(double)=%d\n", (int)sizeof(double));

    return 0;
}

这是我得到的（使用 gcc 和 Open Watcom）：

0.00000000:  sqrtfix0()=0.00003052 / err=3.051758e-05  sqrt()=0.00000000  sqrtfix1()=0.00000000 / err=0.000000e+00
0.00001000:  sqrtfix0()=0.00311279 / err=4.948469e-05  sqrt()=0.00316228  sqrtfix1()=0.00316207 / err=2.102766e-07
0.00002000:  sqrtfix0()=0.00445557 / err=1.656955e-05  sqrt()=0.00447214  sqrtfix1()=0.00447184 / err=2.974807e-07
0.00003000:  sqrtfix0()=0.00543213 / err=4.509667e-05  sqrt()=0.00547723  sqrtfix1()=0.00547720 / err=2.438148e-08
0.00004000:  sqrtfix0()=0.00628662 / err=3.793423e-05  sqrt()=0.00632456  sqrtfix1()=0.00632443 / err=1.262553e-07
0.00005000:  sqrtfix0()=0.00701904 / err=5.202484e-05  sqrt()=0.00707107  sqrtfix1()=0.00707086 / err=2.060551e-07
0.00006000:  sqrtfix0()=0.00772095 / err=2.501943e-05  sqrt()=0.00774597  sqrtfix1()=0.00774593 / err=3.390476e-08
0.00007000:  sqrtfix0()=0.00836182 / err=4.783859e-06  sqrt()=0.00836660  sqrtfix1()=0.00836649 / err=1.086198e-07
0.00008000:  sqrtfix0()=0.00894165 / err=2.621519e-06  sqrt()=0.00894427  sqrtfix1()=0.00894409 / err=1.777289e-07
sizeof(double)=8

编辑：

我错过了上面的 sqrtfix1() 不起作用的事实很好的大论点。它可以通过向参数附加 28 个零并本质上计算其精确的整数平方根来修复。这是以 128 位算术进行内部计算为代价的，但它非常简单：

fixed sqrtfix2(fixed num)
{
    unsigned __int64 numl, numh;
    unsigned __int64 resl = 0, resh = 0;
    unsigned __int64 bitl = 0, bith = (unsigned __int64)1 << 26;

    numl = num << 28;
    numh = num >> (64 - 28);

    while (bitl | bith)
    {
        unsigned __int64 tmpl = resl + bitl;
        unsigned __int64 tmph = resh + bith + (tmpl < resl);

        tmph = numh - tmph - (numl < tmpl);
        tmpl = numl - tmpl;

        if (tmph & 0x8000000000000000ULL)
        {
            resl >>= 1;
            if (resh & 1) resl |= 0x8000000000000000ULL;
            resh >>= 1;
        }
        else
        {
            numl = tmpl;
            numh = tmph;

            resl >>= 1;
            if (resh & 1) resl |= 0x8000000000000000ULL;
            resh >>= 1;

            resh += bith + (resl + bitl < resl);
            resl += bitl;
        }

        bitl >>= 2;
        if (bith & 1) bitl |= 0x4000000000000000ULL;
        if (bith & 2) bitl |= 0x8000000000000000ULL;
        bith >>= 2;
    }

    return resl;
}

并且它给出的结果几乎与这个答案：

0.00000000:  sqrtfix0()=0.00003052 / err=3.051758e-05  sqrt()=0.00000000  sqrtfix2()=0.00000000 / err=0.000000e+00
0.00001000:  sqrtfix0()=0.00311279 / err=4.948469e-05  sqrt()=0.00316228  sqrtfix2()=0.00316207 / err=2.102766e-07
0.00002000:  sqrtfix0()=0.00445557 / err=1.656955e-05  sqrt()=0.00447214  sqrtfix2()=0.00447184 / err=2.974807e-07
0.00003000:  sqrtfix0()=0.00543213 / err=4.509667e-05  sqrt()=0.00547723  sqrtfix2()=0.00547720 / err=2.438148e-08
0.00004000:  sqrtfix0()=0.00628662 / err=3.793423e-05  sqrt()=0.00632456  sqrtfix2()=0.00632443 / err=1.262553e-07
0.00005000:  sqrtfix0()=0.00701904 / err=5.202484e-05  sqrt()=0.00707107  sqrtfix2()=0.00707086 / err=2.060551e-07
0.00006000:  sqrtfix0()=0.00772095 / err=2.501943e-05  sqrt()=0.00774597  sqrtfix2()=0.00774593 / err=3.390476e-08
0.00007000:  sqrtfix0()=0.00836182 / err=4.783859e-06  sqrt()=0.00836660  sqrtfix2()=0.00836649 / err=1.086198e-07
0.00008000:  sqrtfix0()=0.00894165 / err=2.621519e-06  sqrt()=0.00894427  sqrtfix2()=0.00894409 / err=1.777289e-07
2.00000000:  sqrtfix0()=1.41419983 / err=1.373327e-05  sqrt()=1.41421356  sqrtfix2()=1.41421356 / err=1.851493e-09
34359700000.00000000:  sqrtfix0()=185363.69654846 / err=5.097361e-06  sqrt()=185363.69655356  sqrtfix2()=185363.69655356 / err=1
.164153e-09

I'm not sure how you're getting the numbers from fixed::sqrt() shown in the table.

Here's what I do:

#include <stdio.h>
#include <math.h>

#define __int64 long long // gcc doesn't know __int64
typedef __int64 fixed;

#define FRACT 28

#define DBL2FIX(x) ((fixed)((double)(x) * (1LL << FRACT)))
#define FIX2DBL(x) ((double)(x) / (1LL << FRACT))

// De-++-ified code from
// http://www.justsoftwaresolutions.co.uk/news/optimizing-applications-with-fixed-point-arithmetic.html
fixed sqrtfix0(fixed num)
{
    static unsigned const fixed_resolution_shift=FRACT;

    unsigned const max_shift=62;
    unsigned __int64 a_squared=1LL<<max_shift;
    unsigned b_shift=(max_shift+fixed_resolution_shift)/2;
    unsigned __int64 a=1LL<<b_shift;

    unsigned __int64 x=num;

    unsigned __int64 remainder;

    while(b_shift && a_squared>x)
    {
        a>>=1;
        a_squared>>=2;
        --b_shift;
    }

    remainder=x-a_squared;
    --b_shift;

    while(remainder && b_shift)
    {
        unsigned __int64 b_squared=1LL<<(2*b_shift-fixed_resolution_shift);
        int const two_a_b_shift=b_shift+1-fixed_resolution_shift;
        unsigned __int64 two_a_b=(two_a_b_shift>0)?(a<<two_a_b_shift):(a>>-two_a_b_shift);
        unsigned __int64 delta;

        while(b_shift && remainder<(b_squared+two_a_b))
        {
            b_squared>>=2;
            two_a_b>>=1;
            --b_shift;
        }
        delta=b_squared+two_a_b;
        if((2*remainder)>delta)
        {
            a+=(1LL<<b_shift);
            remainder-=delta;
            if(b_shift)
            {
                --b_shift;
            }
        }
    }
    return (fixed)a;
}

// Adapted code from
// http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Digit-by-digit_calculation
fixed sqrtfix1(fixed num)
{
    fixed res = 0;
    fixed bit = (fixed)1 << 62; // The second-to-top bit is set
    int s = 0;

    // Scale num up to get more significant digits

    while (num && num < bit)
    {
        num <<= 1;
        s++;
    }

    if (s & 1)
    {
        num >>= 1;
        s--;
    }

    s = 14 - (s >> 1);

    while (bit != 0)
    {
        if (num >= res + bit)
        {
            num -= res + bit;
            res = (res >> 1) + bit;
        }
        else
        {
            res >>= 1;
        }

        bit >>= 2;
    }

    if (s >= 0) res <<= s;
    else res >>= -s;

    return res;
}

int main(void)
{
    double testData[] =
    {
        0,
        1e-005,
        2e-005,
        3e-005,
        4e-005,
        5e-005,
        6e-005,
        7e-005,
        8e-005,
    };
    int i;

    for (i = 0; i < sizeof(testData) / sizeof(testData[0]); i++)
    {
        double x = testData[i];
        fixed xf = DBL2FIX(x);

        fixed sqf0 = sqrtfix0(xf);
        fixed sqf1 = sqrtfix1(xf);

        double sq0 = FIX2DBL(sqf0);
        double sq1 = FIX2DBL(sqf1);

        printf("%10.8f:  "
               "sqrtfix0()=%10.8f / err=%e  "
               "sqrt()=%10.8f  "
               "sqrtfix1()=%10.8f / err=%e\n",
               x,
               sq0, fabs(sq0 - sqrt(x)),
               sqrt(x),
               sq1, fabs(sq1 - sqrt(x)));
    }

    printf("sizeof(double)=%d\n", (int)sizeof(double));

    return 0;
}

And here's what I get (with gcc and Open Watcom):

0.00000000:  sqrtfix0()=0.00003052 / err=3.051758e-05  sqrt()=0.00000000  sqrtfix1()=0.00000000 / err=0.000000e+00
0.00001000:  sqrtfix0()=0.00311279 / err=4.948469e-05  sqrt()=0.00316228  sqrtfix1()=0.00316207 / err=2.102766e-07
0.00002000:  sqrtfix0()=0.00445557 / err=1.656955e-05  sqrt()=0.00447214  sqrtfix1()=0.00447184 / err=2.974807e-07
0.00003000:  sqrtfix0()=0.00543213 / err=4.509667e-05  sqrt()=0.00547723  sqrtfix1()=0.00547720 / err=2.438148e-08
0.00004000:  sqrtfix0()=0.00628662 / err=3.793423e-05  sqrt()=0.00632456  sqrtfix1()=0.00632443 / err=1.262553e-07
0.00005000:  sqrtfix0()=0.00701904 / err=5.202484e-05  sqrt()=0.00707107  sqrtfix1()=0.00707086 / err=2.060551e-07
0.00006000:  sqrtfix0()=0.00772095 / err=2.501943e-05  sqrt()=0.00774597  sqrtfix1()=0.00774593 / err=3.390476e-08
0.00007000:  sqrtfix0()=0.00836182 / err=4.783859e-06  sqrt()=0.00836660  sqrtfix1()=0.00836649 / err=1.086198e-07
0.00008000:  sqrtfix0()=0.00894165 / err=2.621519e-06  sqrt()=0.00894427  sqrtfix1()=0.00894409 / err=1.777289e-07
sizeof(double)=8

EDIT:

I've missed the fact that the above sqrtfix1() won't work well with large arguments. It can be fixed by appending 28 zeroes to the argument and essentially calculating the exact integer square root of that. This comes at the expense of doing internal calculations in 128-bit arithmetic, but it's pretty straightforward:

fixed sqrtfix2(fixed num)
{
    unsigned __int64 numl, numh;
    unsigned __int64 resl = 0, resh = 0;
    unsigned __int64 bitl = 0, bith = (unsigned __int64)1 << 26;

    numl = num << 28;
    numh = num >> (64 - 28);

    while (bitl | bith)
    {
        unsigned __int64 tmpl = resl + bitl;
        unsigned __int64 tmph = resh + bith + (tmpl < resl);

        tmph = numh - tmph - (numl < tmpl);
        tmpl = numl - tmpl;

        if (tmph & 0x8000000000000000ULL)
        {
            resl >>= 1;
            if (resh & 1) resl |= 0x8000000000000000ULL;
            resh >>= 1;
        }
        else
        {
            numl = tmpl;
            numh = tmph;

            resl >>= 1;
            if (resh & 1) resl |= 0x8000000000000000ULL;
            resh >>= 1;

            resh += bith + (resl + bitl < resl);
            resl += bitl;
        }

        bitl >>= 2;
        if (bith & 1) bitl |= 0x4000000000000000ULL;
        if (bith & 2) bitl |= 0x8000000000000000ULL;
        bith >>= 2;
    }

    return resl;
}

And it gives pretty much the same results (slightly better for 3.43597e+10) than this answer:

0.00000000:  sqrtfix0()=0.00003052 / err=3.051758e-05  sqrt()=0.00000000  sqrtfix2()=0.00000000 / err=0.000000e+00
0.00001000:  sqrtfix0()=0.00311279 / err=4.948469e-05  sqrt()=0.00316228  sqrtfix2()=0.00316207 / err=2.102766e-07
0.00002000:  sqrtfix0()=0.00445557 / err=1.656955e-05  sqrt()=0.00447214  sqrtfix2()=0.00447184 / err=2.974807e-07
0.00003000:  sqrtfix0()=0.00543213 / err=4.509667e-05  sqrt()=0.00547723  sqrtfix2()=0.00547720 / err=2.438148e-08
0.00004000:  sqrtfix0()=0.00628662 / err=3.793423e-05  sqrt()=0.00632456  sqrtfix2()=0.00632443 / err=1.262553e-07
0.00005000:  sqrtfix0()=0.00701904 / err=5.202484e-05  sqrt()=0.00707107  sqrtfix2()=0.00707086 / err=2.060551e-07
0.00006000:  sqrtfix0()=0.00772095 / err=2.501943e-05  sqrt()=0.00774597  sqrtfix2()=0.00774593 / err=3.390476e-08
0.00007000:  sqrtfix0()=0.00836182 / err=4.783859e-06  sqrt()=0.00836660  sqrtfix2()=0.00836649 / err=1.086198e-07
0.00008000:  sqrtfix0()=0.00894165 / err=2.621519e-06  sqrt()=0.00894427  sqrtfix2()=0.00894409 / err=1.777289e-07
2.00000000:  sqrtfix0()=1.41419983 / err=1.373327e-05  sqrt()=1.41421356  sqrtfix2()=1.41421356 / err=1.851493e-09
34359700000.00000000:  sqrtfix0()=185363.69654846 / err=5.097361e-06  sqrt()=185363.69655356  sqrtfix2()=185363.69655356 / err=1
.164153e-09

回复收藏 0 原文

温柔女人霸气范 2024-12-31 02:54:20

许多年前，我为我们公司建造的一台小型计算机开发了一个演示程序。计算机有一个内置的平方根指令，我们构建了一个简单的程序来演示计算机在 TTY 上执行 16 位加/减/乘/除/平方根。唉，事实证明平方根指令有一个严重的错误，但我们已经承诺演示该功能。因此，我们创建了一个由 1-255 值的平方组成的数组，然后使用简单的查找将输入的值与其中一个数组值相匹配。该指数是平方根。