当前位置：文江博客话题详情

为 ARM Thumb2 寻找高效的整数平方根算法

发布于 2024-07-26 09:05:26 字数 112 浏览 8 评论 0原文

我正在寻找一种快速的纯整数算法来查找无符号整数的平方根（其整数部分）。该代码必须在 ARM Thumb 2 处理器上具有出色的性能。它可以是汇编语言或C 代码。

欢迎任何提示。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ぃ弥猫深巷。 2024-08-02 09:05:27

整数平方根作者：Jack W.克伦肖可以作为另一个参考。

C 代码片段存档也有一个整数平方根实现。这不仅仅是整数结果，还计算答案的额外小数（定点）位。（更新：不幸的是，C 代码片段存档现已失效。该链接指向该页面的 Web 存档。）这是来自 C 代码片段存档的代码：

#define BITSPERLONG 32
#define TOP2BITS(x) ((x & (3L << (BITSPERLONG-2))) >> (BITSPERLONG-2))

struct int_sqrt {
    unsigned sqrt, frac;
};

/* usqrt:
    ENTRY x: unsigned long
    EXIT  returns floor(sqrt(x) * pow(2, BITSPERLONG/2))

    Since the square root never uses more than half the bits
    of the input, we use the other half of the bits to contain
    extra bits of precision after the binary point.

    EXAMPLE
        suppose BITSPERLONG = 32
        then    usqrt(144) = 786432 = 12 * 65536
                usqrt(32) = 370727 = 5.66 * 65536

    NOTES
        (1) change BITSPERLONG to BITSPERLONG/2 if you do not want
            the answer scaled.  Indeed, if you want n bits of
            precision after the binary point, use BITSPERLONG/2+n.
            The code assumes that BITSPERLONG is even.
        (2) This is really better off being written in assembly.
            The line marked below is really a "arithmetic shift left"
            on the double-long value with r in the upper half
            and x in the lower half.  This operation is typically
            expressible in only one or two assembly instructions.
        (3) Unrolling this loop is probably not a bad idea.

    ALGORITHM
        The calculations are the base-two analogue of the square
        root algorithm we all learned in grammar school.  Since we're
        in base 2, there is only one nontrivial trial multiplier.

        Notice that absolutely no multiplications or divisions are performed.
        This means it'll be fast on a wide range of processors.
*/

void usqrt(unsigned long x, struct int_sqrt *q)
{
      unsigned long a = 0L;                   /* accumulator      */
      unsigned long r = 0L;                   /* remainder        */
      unsigned long e = 0L;                   /* trial product    */

      int i;

      for (i = 0; i < BITSPERLONG; i++)   /* NOTE 1 */
      {
            r = (r << 2) + TOP2BITS(x); x <<= 2; /* NOTE 2 */
            a <<= 1;
            e = (a << 1) + 1;
            if (r >= e)
            {
                  r -= e;
                  a++;
            }
      }
      memcpy(q, &a, sizeof(long));
}

我选择了以下代码。它本质上来自有关平方根计算方法的维基百科文章。但它已更改为使用 stdint.h 类型 uint32_t 等。严格来说，返回类型可以更改为 uint16_t。

/**
 * \brief    Fast Square root algorithm
 *
 * Fractional parts of the answer are discarded. That is:
 *      - SquareRoot(3) --> 1
 *      - SquareRoot(4) --> 2
 *      - SquareRoot(5) --> 2
 *      - SquareRoot(8) --> 2
 *      - SquareRoot(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRoot(uint32_t a_nInput)
{
    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    {
        one >>= 2;
    }

    while (one != 0)
    {
        if (op >= res + one)
        {
            op = op - (res + one);
            res = res +  2 * one;
        }
        res >>= 1;
        one >>= 2;
    }
    return res;
}

我发现，好的一点是，相当简单的修改就可以返回“四舍五入”的答案。我发现这在某些应用程序中很有用，可以提高准确性。请注意，在本例中，返回类型必须为 uint32_t，因为 2³² - 1 的舍入平方根为 2¹⁶ 。

/**
 * \brief    Fast Square root algorithm, with rounding
 *
 * This does arithmetic rounding of the result. That is, if the real answer
 * would have a fractional part of 0.5 or greater, the result is rounded up to
 * the next integer.
 *      - SquareRootRounded(2) --> 1
 *      - SquareRootRounded(3) --> 2
 *      - SquareRootRounded(4) --> 2
 *      - SquareRootRounded(6) --> 2
 *      - SquareRootRounded(7) --> 3
 *      - SquareRootRounded(8) --> 3
 *      - SquareRootRounded(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRootRounded(uint32_t a_nInput)
{
    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    {
        one >>= 2;
    }

    while (one != 0)
    {
        if (op >= res + one)
        {
            op = op - (res + one);
            res = res +  2 * one;
        }
        res >>= 1;
        one >>= 2;
    }

    /* Do arithmetic rounding to nearest integer */
    if (op > res)
    {
        res++;
    }

    return res;
}

Integer Square Roots by Jack W. Crenshaw could be useful as another reference.

The C Snippets Archive also has an integer square root implementation. This one goes beyond just the integer result, and calculates extra fractional (fixed-point) bits of the answer. (Update: unfortunately, the C snippets archive is now defunct. The link points to the web archive of the page.) Here is the code from the C Snippets Archive:

#define BITSPERLONG 32
#define TOP2BITS(x) ((x & (3L << (BITSPERLONG-2))) >> (BITSPERLONG-2))

struct int_sqrt {
    unsigned sqrt, frac;
};

/* usqrt:
    ENTRY x: unsigned long
    EXIT  returns floor(sqrt(x) * pow(2, BITSPERLONG/2))

    Since the square root never uses more than half the bits
    of the input, we use the other half of the bits to contain
    extra bits of precision after the binary point.

    EXAMPLE
        suppose BITSPERLONG = 32
        then    usqrt(144) = 786432 = 12 * 65536
                usqrt(32) = 370727 = 5.66 * 65536

    NOTES
        (1) change BITSPERLONG to BITSPERLONG/2 if you do not want
            the answer scaled.  Indeed, if you want n bits of
            precision after the binary point, use BITSPERLONG/2+n.
            The code assumes that BITSPERLONG is even.
        (2) This is really better off being written in assembly.
            The line marked below is really a "arithmetic shift left"
            on the double-long value with r in the upper half
            and x in the lower half.  This operation is typically
            expressible in only one or two assembly instructions.
        (3) Unrolling this loop is probably not a bad idea.

    ALGORITHM
        The calculations are the base-two analogue of the square
        root algorithm we all learned in grammar school.  Since we're
        in base 2, there is only one nontrivial trial multiplier.

        Notice that absolutely no multiplications or divisions are performed.
        This means it'll be fast on a wide range of processors.
*/

void usqrt(unsigned long x, struct int_sqrt *q)
{
      unsigned long a = 0L;                   /* accumulator      */
      unsigned long r = 0L;                   /* remainder        */
      unsigned long e = 0L;                   /* trial product    */

      int i;

      for (i = 0; i < BITSPERLONG; i++)   /* NOTE 1 */
      {
            r = (r << 2) + TOP2BITS(x); x <<= 2; /* NOTE 2 */
            a <<= 1;
            e = (a << 1) + 1;
            if (r >= e)
            {
                  r -= e;
                  a++;
            }
      }
      memcpy(q, &a, sizeof(long));
}

I settled on the following code. It's essentially from the Wikipedia article on square-root computing methods. But it has been changed to use stdint.h types uint32_t etc. Strictly speaking, the return type could be changed to uint16_t.

/**
 * \brief    Fast Square root algorithm
 *
 * Fractional parts of the answer are discarded. That is:
 *      - SquareRoot(3) --> 1
 *      - SquareRoot(4) --> 2
 *      - SquareRoot(5) --> 2
 *      - SquareRoot(8) --> 2
 *      - SquareRoot(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRoot(uint32_t a_nInput)
{
    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    {
        one >>= 2;
    }

    while (one != 0)
    {
        if (op >= res + one)
        {
            op = op - (res + one);
            res = res +  2 * one;
        }
        res >>= 1;
        one >>= 2;
    }
    return res;
}

The nice thing, I discovered, is that a fairly simple modification can return the "rounded" answer. I found this useful in a certain application for greater accuracy. Note that in this case, the return type must be uint32_t because the rounded square root of 2³² - 1 is 2¹⁶.

/**
 * \brief    Fast Square root algorithm, with rounding
 *
 * This does arithmetic rounding of the result. That is, if the real answer
 * would have a fractional part of 0.5 or greater, the result is rounded up to
 * the next integer.
 *      - SquareRootRounded(2) --> 1
 *      - SquareRootRounded(3) --> 2
 *      - SquareRootRounded(4) --> 2
 *      - SquareRootRounded(6) --> 2
 *      - SquareRootRounded(7) --> 3
 *      - SquareRootRounded(8) --> 3
 *      - SquareRootRounded(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRootRounded(uint32_t a_nInput)
{
    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    {
        one >>= 2;
    }

    while (one != 0)
    {
        if (op >= res + one)
        {
            op = op - (res + one);
            res = res +  2 * one;
        }
        res >>= 1;
        one >>= 2;
    }

    /* Do arithmetic rounding to nearest integer */
    if (op > res)
    {
        res++;
    }

    return res;
}

回复收藏 0 原文

顾北清歌寒 2024-08-02 09:05:27

如果不需要精确的精度，我可以为您提供一个快速近似值，它使用 260 字节的 RAM（您可以将其减半，但不要这样做）。

int ftbl[33]={0,1,1,2,2,4,5,8,11,16,22,32,45,64,90,128,181,256,362,512,724,1024,1448,2048,2896,4096,5792,8192,11585,16384,23170,32768,46340};
int ftbl2[32]={ 32768,33276,33776,34269,34755,35235,35708,36174,36635,37090,37540,37984,38423,38858,39287,39712,40132,40548,40960,41367,41771,42170,42566,42959,43347,43733,44115,44493,44869,45241,45611,45977};

int fisqrt(int val)
{
    int cnt=0;
    int t=val;
    while (t) {cnt++;t>>=1;}
    if (6>=cnt)    t=(val<<(6-cnt));
    else           t=(val>>(cnt-6));

    return (ftbl[cnt]*ftbl2[t&31])>>15;
}

以下是生成表格的代码：

ftbl[0]=0;
for (int i=0;i<32;i++) ftbl[i+1]=sqrt(pow(2.0,i));
printf("int ftbl[33]={0");
for (int i=0;i<32;i++) printf(",%d",ftbl[i+1]);
printf("};\n");

for (int i=0;i<32;i++) ftbl2[i]=sqrt(1.0+i/32.0)*32768;
printf("int ftbl2[32]={");
for (int i=0;i<32;i++) printf("%c%d",(i)?',':' ',ftbl2[i]);
printf("};\n");

在 1 → 2²⁰ 范围内，最大误差为 11，在 1 → 2³⁰ 范围内，大约为 256。可以使用更大的表并尽量减少这种情况。值得一提的是，错误总是负数——即当错误时，该值将小于正确值。

您最好在这之后进行精炼阶段。

这个想法很简单：(ab)^0.5 = a^0.b × b^0.5。

因此，我们采用输入 X = A×B，其中 A = 2^N 且 1 ≤ B < 2

然后我们有一个 sqrt(2^N) 的查找表和一个 sqrt(1 ≤ B < 2) 的查找表。我们将 sqrt(2^N) 的查找表存储为整数，这可能是一个错误（测试显示没有不良影响），并且我们将 sqrt(1 ≤ B < 2) 的查找表存储作为 15 位定点。

我们知道 1 ≤ sqrt(2^N) < 65536，所以这是16位，我们知道我们实际上只能在ARM上乘以16位×15位，而不用担心报复，所以这就是我们所做的。

在实现方面，while(t) {cnt++;t>>=1;} 实际上是一条计数前导位指令 (CLB)，因此如果您的芯片组版本具有该指令，你赢了！另外，移位指令很容易用双向移位器来实现，如果你有的话？

这里有一个 Lg[N] 算法用于计算最高设置位。

就幻数而言，对于更改表大小，ftbl2 的幻数是 32，但请注意 6 (Lg[32]+1) 用于移位。

If exact accuracy isn't required, I have a fast approximation for you, that uses 260 bytes of RAM (you could halve that, but don't).

int ftbl[33]={0,1,1,2,2,4,5,8,11,16,22,32,45,64,90,128,181,256,362,512,724,1024,1448,2048,2896,4096,5792,8192,11585,16384,23170,32768,46340};
int ftbl2[32]={ 32768,33276,33776,34269,34755,35235,35708,36174,36635,37090,37540,37984,38423,38858,39287,39712,40132,40548,40960,41367,41771,42170,42566,42959,43347,43733,44115,44493,44869,45241,45611,45977};

int fisqrt(int val)
{
    int cnt=0;
    int t=val;
    while (t) {cnt++;t>>=1;}
    if (6>=cnt)    t=(val<<(6-cnt));
    else           t=(val>>(cnt-6));

    return (ftbl[cnt]*ftbl2[t&31])>>15;
}

Here's the code to generate the tables:

ftbl[0]=0;
for (int i=0;i<32;i++) ftbl[i+1]=sqrt(pow(2.0,i));
printf("int ftbl[33]={0");
for (int i=0;i<32;i++) printf(",%d",ftbl[i+1]);
printf("};\n");

for (int i=0;i<32;i++) ftbl2[i]=sqrt(1.0+i/32.0)*32768;
printf("int ftbl2[32]={");
for (int i=0;i<32;i++) printf("%c%d",(i)?',':' ',ftbl2[i]);
printf("};\n");

Over the range 1 → 2²⁰, the maximum error is 11, and over the range 1 → 2³⁰, it's about 256. You could use larger tables and minimise this. It's worth mentioning that the error will always be negative - i.e. when it's wrong, the value will be LESS than the correct value.

You might do well to follow this with a refining stage.

The idea is simple enough: (ab)^0.5 = a^0.b × b^0.5.

So, we take the input X = A×B where A = 2^N and 1 ≤ B < 2

Then we have a lookup table for sqrt(2^N), and a lookup table for sqrt(1 ≤ B < 2). We store the lookup table for sqrt(2^N) as integer, which might be a mistake (testing shows no ill effects), and we store the lookup table for sqrt(1 ≤ B < 2) as 15-bit fixed-point.

We know that 1 ≤ sqrt(2^N) < 65536, so that's 16-bit, and we know that we can really only multiply 16-bit × 15-bit on an ARM, without fear of reprisal, so that's what we do.

In terms of implementation, the while(t) {cnt++;t>>=1;} is effectively a count-leading-bits instruction (CLB), so if your version of the chipset has that, you're winning! Also, the shift instruction would be easy to implement with a bidirectional shifter, if you have one?

There's a Lg[N] algorithm for counting the highest set bit here.

In terms of magic numbers, for changing table sizes, THE magic number for ftbl2 is 32, though note that 6 (Lg[32]+1) is used for the shifting.

回复收藏 0 原文

空袭的梦i 2024-08-02 09:05:27

一种常见的方法是二分法。

hi = number
lo = 0
mid = ( hi + lo ) / 2
mid2 = mid*mid
while( lo < hi-1 and mid2 != number ) {
    if( mid2 < number ) {
        lo = mid
    else
        hi = mid
    mid = ( hi + lo ) / 2
    mid2 = mid*mid

类似的东西应该工作得相当好。它进行 log2(number) 测试，做
log2(number) 乘法和除法。由于除法是除以 2，因此您可以将其替换为 >>。

终止条件可能不准确，因此请务必测试各种整数，以确保除以 2 不会在两个偶数之间错误地振荡；它们的差异将超过 1。

One common approach is bisection.

hi = number
lo = 0
mid = ( hi + lo ) / 2
mid2 = mid*mid
while( lo < hi-1 and mid2 != number ) {
    if( mid2 < number ) {
        lo = mid
    else
        hi = mid
    mid = ( hi + lo ) / 2
    mid2 = mid*mid

Something like that should work reasonably well. It makes log2(number) tests, doing
log2(number) multiplies and divides. Since the divide is a divide by 2, you can replace it with a >>.

The terminating condition may not be spot on, so be sure to test a variety of integers to be sure that the division by 2 doesn't incorrectly oscillate between two even values; they would differ by more than 1.

回复收藏 0 原文

太阳男子 2024-08-02 09:05:27

我发现大多数算法都基于简单的想法，但以比必要的更复杂的方式实现。我从这里得到了这个想法： http://ww1.microchip.com /downloads/en/AppNotes/91040a.pdf（作者：Ross M. Fosler）并将其制作成一个非常短的 C 函数：

uint16_t int_sqrt32(uint32_t x)
{
    uint16_t res= 0;
    uint16_t add= 0x8000;   
    int i;
    for(i=0;i<16;i++)
    {
        uint16_t temp=res | add;
        uint32_t g2= (uint32_t)temp * temp;     
        if (x>=g2)
        {
            res=temp;           
        }
        add>>=1;
    }
    return res;
}

这在我的 blackfin 上编译为 5 个周期/位。我相信，如果您使用 for 循环而不是 while 循环，您编译的代码通常会更快，并且您会获得确定性时间的额外好处（尽管这在某种程度上取决于编译器如何优化 if 语句。）

I find that most algorithms are based on simple ideas, but are implemented in a way more complicated manner than necessary. I've taken the idea from here: http://ww1.microchip.com/downloads/en/AppNotes/91040a.pdf (by Ross M. Fosler) and made it into a very short C-function:

uint16_t int_sqrt32(uint32_t x)
{
    uint16_t res= 0;
    uint16_t add= 0x8000;   
    int i;
    for(i=0;i<16;i++)
    {
        uint16_t temp=res | add;
        uint32_t g2= (uint32_t)temp * temp;     
        if (x>=g2)
        {
            res=temp;           
        }
        add>>=1;
    }
    return res;
}

This compiles to 5 cycles/bit on my blackfin. I believe your compiled code will in general be faster if you use for loops instead of while loops, and you get the added benefit of deterministic time (although that to some extent depends on how your compiler optimizes the if statement.)

回复收藏 0 原文

似梦非梦 2024-08-02 09:05:27

这取决于 sqrt 函数的用法。我经常使用一些近似值来制作快速版本。例如，当我需要计算向量 : 的模时，

Module = SQRT( x^2 + y^2)

我使用 : ，

Module = MAX( x,y) + Min(x,y)/2

它可以用 3 或 4 条指令编码为：

If (x > y )
  Module  = x + y >> 1;
Else
   Module  = y + x >> 1;

It depends about the usage of the sqrt function. I often use some approx to make fast versions. For example, when I need to compute the module of vector :

Module = SQRT( x^2 + y^2)

I use :

Module = MAX( x,y) + Min(x,y)/2

Which can be coded in 3 or 4 instructions as:

If (x > y )
  Module  = x + y >> 1;
Else
   Module  = y + x >> 1;

回复收藏 0 原文

失退 2024-08-02 09:05:27

它速度不快，但又小又简单：

int isqrt(int n)
{
  int b = 0;

  while(n >= 0)
  {
    n = n - b;
    b = b + 1;
    n = n - b;
  }

  return b - 1;
}

It's not fast but it's small and simple:

int isqrt(int n)
{
  int b = 0;

  while(n >= 0)
  {
    n = n - b;
    b = b + 1;
    n = n - b;
  }

  return b - 1;
}

回复收藏 0 原文

浅紫色的梦幻 2024-08-02 09:05:27

我已经解决了类似于这篇维基百科文章中描述的二进制逐位算法的问题。

回复收藏 0 原文

戴着白色围巾的女孩 2024-08-02 09:05:27

我最近在 ARM Cortex-M3 (STM32F103CBT6) 上遇到了同样的任务，在搜索互联网后提出了以下解决方案。与此处提供的解决方案相比，它不是最快的，但它具有良好的精度（最大误差为 1，即整个 UI32 输入范围上的 LSB）和相对较好的速度（在 72 MHz ARM Cortex 上每秒约 1.3M 平方根） M3 或每个单根大约 55 个周期（包括函数调用）。

// FastIntSqrt is based on Wikipedia article:
// https://en.wikipedia.org/wiki/Methods_of_computing_square_roots
// Which involves Newton's method which gives the following iterative formula:
//
// X(n+1) = (X(n) + S/X(n))/2
//
// Thanks to ARM CLZ instruction (which counts how many bits in a number are
// zeros starting from the most significant one) we can very successfully
// choose the starting value, so just three iterations are enough to achieve
// maximum possible error of 1. The algorithm uses division, but fortunately
// it is fast enough here, so square root computation takes only about 50-55
// cycles with maximum compiler optimization.
uint32_t FastIntSqrt (uint32_t value)
{
    if (!value)
        return 0;

    uint32_t xn = 1 << ((32 - __CLZ (value))/2);
    xn = (xn + value/xn)/2;
    xn = (xn + value/xn)/2;
    xn = (xn + value/xn)/2;
    return xn;
}

我正在使用 IAR，它生成以下汇编代码：

        SECTION `.text`:CODE:NOROOT(1)
        THUMB
_Z11FastIntSqrtj:
        MOVS     R1,R0
        BNE.N    ??FastIntSqrt_0
        MOVS     R0,#+0
        BX       LR
??FastIntSqrt_0:
        CLZ      R0,R1
        RSB      R0,R0,#+32
        MOVS     R2,#+1
        LSRS     R0,R0,#+1
        LSL      R0,R2,R0
        UDIV     R3,R1,R0
        ADDS     R0,R3,R0
        LSRS     R0,R0,#+1
        UDIV     R2,R1,R0
        ADDS     R0,R2,R0
        LSRS     R0,R0,#+1
        UDIV     R1,R1,R0
        ADDS     R0,R1,R0
        LSRS     R0,R0,#+1
        BX       LR               ;; return

I recently encountered the same task on ARM Cortex-M3 (STM32F103CBT6) and after searching the Internet came up with the following solution. It's not the fastest comparing with solutions offered here, but it has good accuracy (maximum error is 1, i.e. LSB on the entire UI32 input range) and relatively good speed (about 1.3M square roots per second on a 72-MHz ARM Cortex-M3 or about 55 cycles per single root including the function call).

// FastIntSqrt is based on Wikipedia article:
// https://en.wikipedia.org/wiki/Methods_of_computing_square_roots
// Which involves Newton's method which gives the following iterative formula:
//
// X(n+1) = (X(n) + S/X(n))/2
//
// Thanks to ARM CLZ instruction (which counts how many bits in a number are
// zeros starting from the most significant one) we can very successfully
// choose the starting value, so just three iterations are enough to achieve
// maximum possible error of 1. The algorithm uses division, but fortunately
// it is fast enough here, so square root computation takes only about 50-55
// cycles with maximum compiler optimization.
uint32_t FastIntSqrt (uint32_t value)
{
    if (!value)
        return 0;

    uint32_t xn = 1 << ((32 - __CLZ (value))/2);
    xn = (xn + value/xn)/2;
    xn = (xn + value/xn)/2;
    xn = (xn + value/xn)/2;
    return xn;
}

I'm using IAR and it produces the following assembler code:

        SECTION `.text`:CODE:NOROOT(1)
        THUMB
_Z11FastIntSqrtj:
        MOVS     R1,R0
        BNE.N    ??FastIntSqrt_0
        MOVS     R0,#+0
        BX       LR
??FastIntSqrt_0:
        CLZ      R0,R1
        RSB      R0,R0,#+32
        MOVS     R2,#+1
        LSRS     R0,R0,#+1
        LSL      R0,R2,R0
        UDIV     R3,R1,R0
        ADDS     R0,R3,R0
        LSRS     R0,R0,#+1
        UDIV     R2,R1,R0
        ADDS     R0,R2,R0
        LSRS     R0,R0,#+1
        UDIV     R1,R1,R0
        ADDS     R0,R1,R0
        LSRS     R0,R0,#+1
        BX       LR               ;; return

回复收藏 0 原文

眼泪淡了忧伤 2024-08-02 09:05:27

ARM 编码最巧妙的按位整数平方根实现实现了每个结果位 3 个周期，对于 32 位无符号整数的平方根来说，其下限为 50 个周期。 Andrew N. Sloss、Dominic Symes、Chris Wright 的“ARM 系统开发人员指南”、Morgan Kaufman 2004 中给出了一个示例。

由于大多数 ARM 处理器还具有非常快的整数乘法器，并且大多数处理器甚至提供非常快速的宽乘法实现指令 UMULL 是一种可实现 35 至 45 个周期左右的执行时间的替代方法，它是使用定点计算通过倒数平方根 1/√x 进行计算。为此，有必要借助计数前导零指令对输入进行标准化，该指令在大多数 ARM 处理器上可用作指令 CLZ。

计算从由归一化参数的一些最高有效位索引的查找表中的初始低精度倒数平方根近似开始。通过二次收敛来精化数字 a 的倒数平方根 r 的牛顿-拉夫森迭代为 r_n+1 = r_n + r_n* (1 - a * r_n²) / 2. 这可以用代数形式重新排列方便的等效形式。在下面的示例性 C99 代码中，从 96 项查找表中读取 8 位近似值 r₀。该近似值精确到大约 7 位。第一次 Newton-Raphson 迭代计算 r₁ = (3 * r₀ - a * r₀³) / 2 潜在地利用小操作数乘法指令。第二次 Newton-Raphson 迭代计算 r₂ = (r₁ * (3 - r₁ * (r₁ > * a))) / 2.

然后通过反向乘法计算归一化平方根 s₂ = a * r₂ 并通过基于反归一化实现最终近似值原始参数 a 的前导零的计数。重要的是，期望结果 ⌊√a⌋ 是低估的近似值。通过保证余数 ⌊√a⌋ - s₂ * s₂ 为正，简化了对是否达到预期结果的检查。如果发现最终的近似值太小，则结果加一。通过针对“黄金”参考对所有可能的 2³² 输入进行详尽测试，可以轻松证明该算法的正确操作，只需几分钟。

通过预先计算 3 * r₀ 和 r₀³，可以加快这一实现速度，但会增加查找表的存储空间。简化第一次牛顿-拉夫森迭代。前者需要 10 位存储空间，后者需要 24 位存储空间。为了将每一对组合成一个 32 位数据项，立方体被舍入为 22 位，这在计算中引入的误差可以忽略不计。这会产生 96 * 4 = 384 字节的查找表。

另一种方法使用观察结果，即所有起始近似值都具有相同的最高有效位集，因此可以隐式假设并且不必存储。这允许将 9 位近似值 r₀ 压缩为 8 位数据项，并在查表后恢复前导位。对于 384 个条目的查找表，全部低估，可以达到约 7.5 位的精度。将反向乘法与倒数平方根的牛顿-拉夫森迭代相结合，计算出 s₀ = a * r₀, s₁ = s₀ + r₀ * (a - s₀ * s₀) / 2. 因为准确度起始近似值对于非常准确的最终平方根近似值而言不够高，最多可能相差三倍，并且必须根据余数下限 (sqrt (a)) - s₁ * s₁。

另一种方法的优点是所需的乘法次数减半，特别是仅需要一次宽乘法 UMULL。特别是宽乘法相当慢的处理器，这是值得尝试的替代方案。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#if defined(_MSC_VER) && defined(_WIN64)
#include <intrin.h>
#endif // defined(_MSC_VER) && defined(_WIN64)

#define CLZ_BUILTIN (1)      // use compiler's built-in count-leading-zeros
#define CLZ_FPU     (2)      // emulate count-leading-zeros via FPU
#define CLZ_CPU     (3)      // emulate count-leading-zeros via CPU

#define ALT_IMPL    (0)      // alternative implementation with fewer multiplies
#define LARGE_TABLE (0)      // ALT_IMPL=0: incorporate 1st NR-iter into table
#define CLZ_IMPL    (CLZ_CPU)// choose count-leading-zeros implementation
#define GEN_TAB     (0)      // generate tables

uint32_t umul32_hi (uint32_t a, uint32_t b);
uint32_t float_as_uint32 (float a);
int clz32 (uint32_t a);

#if ALT_IMPL
uint8_t rsqrt_tab [384] = 
{
    0xfe, 0xfc, 0xfa, 0xf8, 0xf6, 0xf4, 0xf2, 0xf0, 0xee, 0xed, 0xeb, 0xe9,
    0xe7, 0xe6, 0xe4, 0xe2, 0xe1, 0xdf, 0xdd, 0xdc, 0xda, 0xd8, 0xd7, 0xd5,
    0xd4, 0xd2, 0xd1, 0xcf, 0xce, 0xcc, 0xcb, 0xc9, 0xc8, 0xc7, 0xc5, 0xc4,
    0xc2, 0xc1, 0xc0, 0xbe, 0xbd, 0xbc, 0xba, 0xb9, 0xb8, 0xb7, 0xb5, 0xb4,
    0xb3, 0xb2, 0xb0, 0xaf, 0xae, 0xad, 0xac, 0xab, 0xa9, 0xa8, 0xa7, 0xa6,
    0xa5, 0xa4, 0xa3, 0xa2, 0xa0, 0x9f, 0x9e, 0x9d, 0x9c, 0x9b, 0x9a, 0x99,
    0x98, 0x97, 0x96, 0x95, 0x94, 0x93, 0x92, 0x91, 0x90, 0x8f, 0x8e, 0x8d,
    0x8c, 0x8b, 0x8b, 0x8a, 0x89, 0x88, 0x87, 0x86, 0x85, 0x84, 0x83, 0x83,
    0x82, 0x81, 0x80, 0x7f, 0x7e, 0x7d, 0x7d, 0x7c, 0x7b, 0x7a, 0x79, 0x79,
    0x78, 0x77, 0x76, 0x75, 0x75, 0x74, 0x73, 0x72, 0x72, 0x71, 0x70, 0x6f,
    0x6f, 0x6e, 0x6d, 0x6c, 0x6c, 0x6b, 0x6a, 0x6a, 0x69, 0x68, 0x67, 0x67,
    0x66, 0x65, 0x65, 0x64, 0x63, 0x63, 0x62, 0x61, 0x61, 0x60, 0x5f, 0x5f,
    0x5e, 0x5d, 0x5d, 0x5c, 0x5c, 0x5b, 0x5a, 0x5a, 0x59, 0x58, 0x58, 0x57,
    0x57, 0x56, 0x55, 0x55, 0x54, 0x54, 0x53, 0x52, 0x52, 0x51, 0x51, 0x50,
    0x50, 0x4f, 0x4e, 0x4e, 0x4d, 0x4d, 0x4c, 0x4c, 0x4b, 0x4b, 0x4a, 0x4a,
    0x49, 0x48, 0x48, 0x47, 0x47, 0x46, 0x46, 0x45, 0x45, 0x44, 0x44, 0x43,
    0x43, 0x42, 0x42, 0x41, 0x41, 0x40, 0x40, 0x3f, 0x3f, 0x3e, 0x3e, 0x3d,
    0x3d, 0x3c, 0x3c, 0x3c, 0x3b, 0x3b, 0x3a, 0x3a, 0x39, 0x39, 0x38, 0x38,
    0x37, 0x37, 0x36, 0x36, 0x36, 0x35, 0x35, 0x34, 0x34, 0x33, 0x33, 0x33,
    0x32, 0x32, 0x31, 0x31, 0x30, 0x30, 0x30, 0x2f, 0x2f, 0x2e, 0x2e, 0x2d,
    0x2d, 0x2d, 0x2c, 0x2c, 0x2b, 0x2b, 0x2b, 0x2a, 0x2a, 0x29, 0x29, 0x29,
    0x28, 0x28, 0x27, 0x27, 0x27, 0x26, 0x26, 0x26, 0x25, 0x25, 0x24, 0x24,
    0x24, 0x23, 0x23, 0x23, 0x22, 0x22, 0x21, 0x21, 0x21, 0x20, 0x20, 0x20,
    0x1f, 0x1f, 0x1f, 0x1e, 0x1e, 0x1e, 0x1d, 0x1d, 0x1d, 0x1c, 0x1c, 0x1c,
    0x1b, 0x1b, 0x1a, 0x1a, 0x1a, 0x19, 0x19, 0x19, 0x18, 0x18, 0x18, 0x17,
    0x17, 0x17, 0x17, 0x16, 0x16, 0x16, 0x15, 0x15, 0x15, 0x14, 0x14, 0x14,
    0x13, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10,
    0x10, 0x0f, 0x0f, 0x0f, 0x0e, 0x0e, 0x0e, 0x0e, 0x0d, 0x0d, 0x0d, 0x0c,
    0x0c, 0x0c, 0x0c, 0x0b, 0x0b, 0x0b, 0x0a, 0x0a, 0x0a, 0x0a, 0x09, 0x09,
    0x09, 0x08, 0x08, 0x08, 0x08, 0x07, 0x07, 0x07, 0x07, 0x06, 0x06, 0x06,
    0x05, 0x05, 0x05, 0x05, 0x04, 0x04, 0x04, 0x04, 0x03, 0x03, 0x03, 0x03,
    0x02, 0x02, 0x02, 0x02, 0x01, 0x01, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00,
};

/* compute floor (sqrt (a)) */
uint32_t my_isqrt32 (uint32_t a)
{
    uint32_t b, r, s, scal, rem;

    if (a == 0) return a;
    /* Normalize argument */
    scal = clz32 (a) & ~1;
    b = a << scal;
    /* Compute initial approximation to 1/sqrt(a) */
    r = rsqrt_tab [(b >> 23) - 128] | 0x100;
    /* Compute initial approximation to sqrt(a) */
    s = umul32_hi (b, r << 8); 
    /* Refine sqrt approximation */
    b = b - s * s;
    s = s + ((r * (b >> 2)) >> 23);
    /* Denormalize result*/
    s = s >> (scal >> 1);
    /* Ensure we got the floor correct */
    rem = a - s * s;
    if      (rem < (2 * s + 1)) s = s + 0;
    else if (rem < (4 * s + 4)) s = s + 1;
    else if (rem < (6 * s + 9)) s = s + 2;
    else                        s = s + 3;
    return s;
}

#else // ALT_IMPL

#if LARGE_TABLE
uint32_t rsqrt_tab [96] = 
{
    0xfa0bfafa, 0xee6b2aee, 0xe5f02ae5, 0xdaf26ed9, 0xd2f002d0, 0xc890c2c4,
    0xc1037abb, 0xb9a75ab2, 0xb4da42ac, 0xadcea2a3, 0xa6f27a9a, 0xa279c294,
    0x9beb4a8b, 0x97a5ca85, 0x9163427c, 0x8d4fca76, 0x89500270, 0x8563ba6a,
    0x818ac264, 0x7dc4ea5e, 0x7a120258, 0x7671da52, 0x72e4424c, 0x6f690a46,
    0x6db24243, 0x6a52423d, 0x67042637, 0x6563c234, 0x62302a2e, 0x609cea2b,
    0x5d836a25, 0x5bfd1a22, 0x58fd421c, 0x5783ae19, 0x560e4a16, 0x53300210,
    0x51c7120d, 0x50623a0a, 0x4da4c204, 0x4c4c1601, 0x4af769fe, 0x49a6b9fb,
    0x485a01f8, 0x471139f5, 0x45cc59f2, 0x448b5def, 0x4214fde9, 0x40df89e6,
    0x3fade1e3, 0x3e8001e0, 0x3d55e1dd, 0x3c2f79da, 0x3c2f79da, 0x3b0cc5d7,
    0x39edc1d4, 0x38d265d1, 0x37baa9ce, 0x36a689cb, 0x359601c8, 0x348909c5,
    0x348909c5, 0x337f99c2, 0x3279adbf, 0x317741bc, 0x30784db9, 0x30784db9,
    0x2f7cc9b6, 0x2e84b1b3, 0x2d9001b0, 0x2d9001b0, 0x2c9eb1ad, 0x2bb0b9aa,
    0x2bb0b9aa, 0x2ac615a7, 0x29dec1a4, 0x29dec1a4, 0x28fab5a1, 0x2819e99e,
    0x2819e99e, 0x273c599b, 0x273c599b, 0x26620198, 0x258ad995, 0x258ad995,
    0x24b6d992, 0x24b6d992, 0x23e5fd8f, 0x2318418c, 0x2318418c, 0x224d9d89,
    0x224d9d89, 0x21860986, 0x21860986, 0x20c18183, 0x20c18183, 0x20000180,
};
#else // LARGE_TABLE
uint8_t rsqrt_tab [96] = 
{
    0xfe, 0xfa, 0xf7, 0xf3, 0xf0, 0xec, 0xe9, 0xe6, 0xe4, 0xe1, 0xde, 0xdc,
    0xd9, 0xd7, 0xd4, 0xd2, 0xd0, 0xce, 0xcc, 0xca, 0xc8, 0xc6, 0xc4, 0xc2,
    0xc1, 0xbf, 0xbd, 0xbc, 0xba, 0xb9, 0xb7, 0xb6, 0xb4, 0xb3, 0xb2, 0xb0,
    0xaf, 0xae, 0xac, 0xab, 0xaa, 0xa9, 0xa8, 0xa7, 0xa6, 0xa5, 0xa3, 0xa2,
    0xa1, 0xa0, 0x9f, 0x9e, 0x9e, 0x9d, 0x9c, 0x9b, 0x9a, 0x99, 0x98, 0x97,
    0x97, 0x96, 0x95, 0x94, 0x93, 0x93, 0x92, 0x91, 0x90, 0x90, 0x8f, 0x8e,
    0x8e, 0x8d, 0x8c, 0x8c, 0x8b, 0x8a, 0x8a, 0x89, 0x89, 0x88, 0x87, 0x87,
    0x86, 0x86, 0x85, 0x84, 0x84, 0x83, 0x83, 0x82, 0x82, 0x81, 0x81, 0x80,
};
#endif //LARGE_TABLE 

/* compute floor (sqrt (a)) */
uint32_t my_isqrt32 (uint32_t a)
{
    uint32_t b, r, s, t, scal, rem;

    if (a == 0) return a;
    /* Normalize argument */
    scal = clz32 (a) & ~1;
    b = a << scal;
    /* Initial approximation to 1/sqrt(a)*/
    t = rsqrt_tab [(b >> 25) - 32];
    /* First NR iteration */
#if LARGE_TABLE
    r = (t << 22) - umul32_hi (b, t);
#else // LARGE_TABLE
    r = ((3 * t) << 22) - umul32_hi (b, (t * t * t) << 8);
#endif // LARGE_TABLE
    /* Second NR iteration */
    s = umul32_hi (r, b);
    s = 0x30000000 - umul32_hi (r, s);
    r = umul32_hi (r, s);
    /* Compute sqrt(a) = a * 1/sqrt(a). Adjust to ensure it's an underestimate*/
    r = umul32_hi (r, b) - 1;
    /* Denormalize result */
    r = r >> ((scal >> 1) + 11);
    /* Make sure we got the floor correct */
    rem = a - r * r;
    if (rem >= (2 * r + 1)) r++;
    return r;
}
#endif // ALT_IMPL

uint32_t umul32_hi (uint32_t a, uint32_t b)
{
    return (uint32_t)(((uint64_t)a * b) >> 32);
}

uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

int clz32 (uint32_t a)
{
#if (CLZ_IMPL == CLZ_FPU)
    // Henry S. Warren, Jr, " Hacker's Delight 2nd ed", p. 105
    int n = 158 - (float_as_uint32 ((float)(int32_t)(a & ~(a >> 1))+.5f) >> 23);
    return (n < 0) ? 0 : n;
#elif (CLZ_IMPL == CLZ_CPU)
    static const uint8_t clz_tab[32] = {
        31, 22, 30, 21, 18, 10, 29,  2, 20, 17, 15, 13, 9,  6, 28, 1,
        23, 19, 11,  3, 16, 14,  7, 24, 12,  4,  8, 25, 5, 26, 27, 0
    };
    a |= a >> 16;
    a |= a >> 8;
    a |= a >> 4;
    a |= a >> 2;
    a |= a >> 1;
    return clz_tab [0x07c4acddu * a >> 27] + (!a);
#else // CLZ_IMPL == CLZ_BUILTIN
#if defined(_MSC_VER) && defined(_WIN64)
    return __lzcnt (a);
#else // defined(_MSC_VER) && defined(_WIN64)
    return __builtin_clz (a);
#endif // defined(_MSC_VER) && defined(_WIN64)
#endif // CLZ_IMPL
}

/* Henry S. Warren, Jr.,  "Hacker's Delight, 2nd e.d", p. 286 */
uint32_t ref_isqrt32 (uint32_t x)
{
    uint32_t m, y, b;
    m = 0x40000000U;
    y = 0U;
    while (m != 0) {
        b = y | m;
        y = y >> 1;
        if (x >= b) {
            x = x - b;
            y = y | m;
        }
        m = m >> 2;
    }
    return y;
}

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

int main (void)
{
#if ALT_IMPL
    printf ("Alternate integer square root implementation\n");
#else // ALT_IMPL
#if LARGE_TABLE
    printf ("Integer square root implementation w/ large table\n");
#else // LARGE_TAB
    printf ("Integer square root implementation w/ small table\n");
#endif
#endif // ALT_IMPL

#if GEN_TAB
    printf ("Generating lookup table ...\n");
#if ALT_IMPL
      for (int i = 0; i < 384; i++) {
        double x = 1.0 + (i + 1) * 1.0 / 128;
        double y = 1.0 / sqrt (x);
        uint8_t val = (uint8_t)((y * 512) - 256);
        rsqrt_tab[i] = val;
        printf ("0x%02x, ", rsqrt_tab[i]);
        if (i % 12 == 11) printf("\n");
    }
#else // ALT_IMPL
    for (int i = 0; i < 96; i++ ) {
        double x1 = 1.0 + i * 1.0 / 32;
        double x2 = 1.0 + (i + 1) * 1.0 / 32;
        double y = (1.0 / sqrt(x1) + 1.0 / sqrt(x2)) * 0.5;
        uint32_t val = (uint32_t)(y * 256 + 0.5);
#if LARGE_TABLE
        uint32_t cube = val * val * val;
        rsqrt_tab[i] = (((cube + 1) / 4) << 10) + (3 * val);
        printf ("0x%08x, ", rsqrt_tab[i]);
        if (i % 6 == 5) printf ("\n");
#else // LARGE_TABLE
        rsqrt_tab[i] = val;
        printf ("0x%02x, ", rsqrt_tab[i]);
        if (i % 12 == 11) printf ("\n");
#endif // LARGE_TABLE
    }
#endif // ALT_IMPL
#endif // GEN_TAB
    printf ("Running exhaustive test ... ");
    uint32_t i = 0;
    do {
        uint32_t ref = ref_isqrt32 (i);
        uint32_t res = my_isqrt32 (i);
        if (res != ref) {
            printf ("error: arg=%08x  res=%08x  ref=%08x\n", i, res, ref);
            return EXIT_FAILURE;
        }
        i++;
    } while (i);

    printf ("PASSED\n");
    printf ("Running benchmark ...\n");
    i = 0;
    uint32_t sum[8] = {0, 0, 0, 0, 0, 0, 0, 0};
    double start = second();
    do {
        sum [0] += my_isqrt32 (i + 0);
        sum [1] += my_isqrt32 (i + 1);
        sum [2] += my_isqrt32 (i + 2);
        sum [3] += my_isqrt32 (i + 3);
        sum [4] += my_isqrt32 (i + 4);
        sum [5] += my_isqrt32 (i + 5);
        sum [6] += my_isqrt32 (i + 6);
        sum [7] += my_isqrt32 (i + 7);
        i += 8;
    } while (i);
    double stop = second();
    printf ("%08x \relapsed=%.5f sec\n", 
            sum[0]+sum[1]+sum[2]+sum[3]+sum[4]+sum[5]+sum[6]+sum[7],
            stop - start);
    return EXIT_SUCCESS;
}

The most cleverly coded bit-wise integer square root implementations for ARM achieve 3 cycles per result bit, which comes out to a lower bound of 50 cycles for the square root of a 32-bit unsigned integer. An example is shown in Andrew N. Sloss, Dominic Symes, Chris Wright, "ARM System Developer's Guide", Morgan Kaufman 2004.

Since most ARM processors also have very fast integer multipliers, and most even provide a very fast implementation of the wide multiply instruction UMULL, an alternative approach that can achieve execution times on the order of 35 to 45 cycles is computation via the reciprocal square root 1/√x using fixed-point computation. For this it is necessary to normalize the input with the help of a count-leading-zeros instruction, which on most ARM processors is available as an instruction CLZ.

The computation starts with an initial low-accuracy reciprocal square root approximation from a lookup table indexed by some most significant bit of the normalized argument. The Newton-Raphson iteration to refine the reciprocal square root r of a number a with quadratic convergence is r_n+1 = r_n + r_n* (1 - a * r_n²) / 2. This can be re-arranged into algebraically equivalent forms as is convenient. In the exemplary C99 code below, an 8-bit approximation r₀ is read from a 96-entry lookup table. This approximation is accurate to about 7 bits. The first Newton-Raphson iteration computes r₁ = (3 * r₀ - a * r₀³) / 2 to potentially take advantage of small operand multiplication instructions. The second Newton-Raphson iteration computes r₂ = (r₁ * (3 - r₁ * (r₁ * a))) / 2.

The normalized square root is then computed by a back multiplication s₂ = a * r₂ and the final approximation is achieved by denormalizing based on the count of leading zeros of the original argument a. It is important that the desired result ⌊√a⌋ is approximated with underestimation. This simplifies the checking whether the desired result has been achieved by guaranteeing that the remainder ⌊√a⌋ - s₂ * s₂ is positive. If the final approximation is found to be too small, the result is increased by one. Correct operation of this algorithm can easily be demonstrated by exhaustive test of all possible 2³² inputs against a "golden" reference, which takes only a few minutes.

One can speed up this implementation at the expense of additional storage for the lookup table, by pre-computing 3 * r₀ and r₀³ to simplify the first Newton-Raphson iteration. The former requires 10 bit of storage and the latter 24 bits. In order to combine each pair into a 32-bit data item, the cube is rounded to 22 bits, which introduces negligible error into the computation. This results in a lookup table of 96 * 4 = 384 bytes.

An alternative approach uses the observation that all starting approximations have the same most significant bit set, which therefore can be assumed implicitly and does not have to be stored. This allows to squeeze a 9-bit approximation r₀ into an 8-bit data item, with the leading bit restored after table lookup. With a lookup table of 384 entries, all underestimates, one can achieve an accuracy of about 7.5 bits. Combining the back multiply with the Newton-Raphson iteration for the reciprocal square root, one computes s₀ = a * r₀, s₁ = s₀ + r₀ * (a - s₀ * s₀) / 2. Because the accuracy of the starting approximation is not high enough for a very accurate final square root approximation, it can be off by up to three, and an appropriate adjustment must be made based on the magnitude of the remainder floor (sqrt (a)) - s₁ * s₁.

One advantage of the alternative approach is that is halves the number of multiplies required, and in particular requires only a single wide multiplication UMULL. Especially processors where wide multiplies are fairly slow, this is an alternative worth trying.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#if defined(_MSC_VER) && defined(_WIN64)
#include <intrin.h>
#endif // defined(_MSC_VER) && defined(_WIN64)

#define CLZ_BUILTIN (1)      // use compiler's built-in count-leading-zeros
#define CLZ_FPU     (2)      // emulate count-leading-zeros via FPU
#define CLZ_CPU     (3)      // emulate count-leading-zeros via CPU

#define ALT_IMPL    (0)      // alternative implementation with fewer multiplies
#define LARGE_TABLE (0)      // ALT_IMPL=0: incorporate 1st NR-iter into table
#define CLZ_IMPL    (CLZ_CPU)// choose count-leading-zeros implementation
#define GEN_TAB     (0)      // generate tables

uint32_t umul32_hi (uint32_t a, uint32_t b);
uint32_t float_as_uint32 (float a);
int clz32 (uint32_t a);

#if ALT_IMPL
uint8_t rsqrt_tab [384] = 
{
    0xfe, 0xfc, 0xfa, 0xf8, 0xf6, 0xf4, 0xf2, 0xf0, 0xee, 0xed, 0xeb, 0xe9,
    0xe7, 0xe6, 0xe4, 0xe2, 0xe1, 0xdf, 0xdd, 0xdc, 0xda, 0xd8, 0xd7, 0xd5,
    0xd4, 0xd2, 0xd1, 0xcf, 0xce, 0xcc, 0xcb, 0xc9, 0xc8, 0xc7, 0xc5, 0xc4,
    0xc2, 0xc1, 0xc0, 0xbe, 0xbd, 0xbc, 0xba, 0xb9, 0xb8, 0xb7, 0xb5, 0xb4,
    0xb3, 0xb2, 0xb0, 0xaf, 0xae, 0xad, 0xac, 0xab, 0xa9, 0xa8, 0xa7, 0xa6,
    0xa5, 0xa4, 0xa3, 0xa2, 0xa0, 0x9f, 0x9e, 0x9d, 0x9c, 0x9b, 0x9a, 0x99,
    0x98, 0x97, 0x96, 0x95, 0x94, 0x93, 0x92, 0x91, 0x90, 0x8f, 0x8e, 0x8d,
    0x8c, 0x8b, 0x8b, 0x8a, 0x89, 0x88, 0x87, 0x86, 0x85, 0x84, 0x83, 0x83,
    0x82, 0x81, 0x80, 0x7f, 0x7e, 0x7d, 0x7d, 0x7c, 0x7b, 0x7a, 0x79, 0x79,
    0x78, 0x77, 0x76, 0x75, 0x75, 0x74, 0x73, 0x72, 0x72, 0x71, 0x70, 0x6f,
    0x6f, 0x6e, 0x6d, 0x6c, 0x6c, 0x6b, 0x6a, 0x6a, 0x69, 0x68, 0x67, 0x67,
    0x66, 0x65, 0x65, 0x64, 0x63, 0x63, 0x62, 0x61, 0x61, 0x60, 0x5f, 0x5f,
    0x5e, 0x5d, 0x5d, 0x5c, 0x5c, 0x5b, 0x5a, 0x5a, 0x59, 0x58, 0x58, 0x57,
    0x57, 0x56, 0x55, 0x55, 0x54, 0x54, 0x53, 0x52, 0x52, 0x51, 0x51, 0x50,
    0x50, 0x4f, 0x4e, 0x4e, 0x4d, 0x4d, 0x4c, 0x4c, 0x4b, 0x4b, 0x4a, 0x4a,
    0x49, 0x48, 0x48, 0x47, 0x47, 0x46, 0x46, 0x45, 0x45, 0x44, 0x44, 0x43,
    0x43, 0x42, 0x42, 0x41, 0x41, 0x40, 0x40, 0x3f, 0x3f, 0x3e, 0x3e, 0x3d,
    0x3d, 0x3c, 0x3c, 0x3c, 0x3b, 0x3b, 0x3a, 0x3a, 0x39, 0x39, 0x38, 0x38,
    0x37, 0x37, 0x36, 0x36, 0x36, 0x35, 0x35, 0x34, 0x34, 0x33, 0x33, 0x33,
    0x32, 0x32, 0x31, 0x31, 0x30, 0x30, 0x30, 0x2f, 0x2f, 0x2e, 0x2e, 0x2d,
    0x2d, 0x2d, 0x2c, 0x2c, 0x2b, 0x2b, 0x2b, 0x2a, 0x2a, 0x29, 0x29, 0x29,
    0x28, 0x28, 0x27, 0x27, 0x27, 0x26, 0x26, 0x26, 0x25, 0x25, 0x24, 0x24,
    0x24, 0x23, 0x23, 0x23, 0x22, 0x22, 0x21, 0x21, 0x21, 0x20, 0x20, 0x20,
    0x1f, 0x1f, 0x1f, 0x1e, 0x1e, 0x1e, 0x1d, 0x1d, 0x1d, 0x1c, 0x1c, 0x1c,
    0x1b, 0x1b, 0x1a, 0x1a, 0x1a, 0x19, 0x19, 0x19, 0x18, 0x18, 0x18, 0x17,
    0x17, 0x17, 0x17, 0x16, 0x16, 0x16, 0x15, 0x15, 0x15, 0x14, 0x14, 0x14,
    0x13, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10,
    0x10, 0x0f, 0x0f, 0x0f, 0x0e, 0x0e, 0x0e, 0x0e, 0x0d, 0x0d, 0x0d, 0x0c,
    0x0c, 0x0c, 0x0c, 0x0b, 0x0b, 0x0b, 0x0a, 0x0a, 0x0a, 0x0a, 0x09, 0x09,
    0x09, 0x08, 0x08, 0x08, 0x08, 0x07, 0x07, 0x07, 0x07, 0x06, 0x06, 0x06,
    0x05, 0x05, 0x05, 0x05, 0x04, 0x04, 0x04, 0x04, 0x03, 0x03, 0x03, 0x03,
    0x02, 0x02, 0x02, 0x02, 0x01, 0x01, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00,
};

/* compute floor (sqrt (a)) */
uint32_t my_isqrt32 (uint32_t a)
{
    uint32_t b, r, s, scal, rem;

    if (a == 0) return a;
    /* Normalize argument */
    scal = clz32 (a) & ~1;
    b = a << scal;
    /* Compute initial approximation to 1/sqrt(a) */
    r = rsqrt_tab [(b >> 23) - 128] | 0x100;
    /* Compute initial approximation to sqrt(a) */
    s = umul32_hi (b, r << 8); 
    /* Refine sqrt approximation */
    b = b - s * s;
    s = s + ((r * (b >> 2)) >> 23);
    /* Denormalize result*/
    s = s >> (scal >> 1);
    /* Ensure we got the floor correct */
    rem = a - s * s;
    if      (rem < (2 * s + 1)) s = s + 0;
    else if (rem < (4 * s + 4)) s = s + 1;
    else if (rem < (6 * s + 9)) s = s + 2;
    else                        s = s + 3;
    return s;
}

#else // ALT_IMPL

#if LARGE_TABLE
uint32_t rsqrt_tab [96] = 
{
    0xfa0bfafa, 0xee6b2aee, 0xe5f02ae5, 0xdaf26ed9, 0xd2f002d0, 0xc890c2c4,
    0xc1037abb, 0xb9a75ab2, 0xb4da42ac, 0xadcea2a3, 0xa6f27a9a, 0xa279c294,
    0x9beb4a8b, 0x97a5ca85, 0x9163427c, 0x8d4fca76, 0x89500270, 0x8563ba6a,
    0x818ac264, 0x7dc4ea5e, 0x7a120258, 0x7671da52, 0x72e4424c, 0x6f690a46,
    0x6db24243, 0x6a52423d, 0x67042637, 0x6563c234, 0x62302a2e, 0x609cea2b,
    0x5d836a25, 0x5bfd1a22, 0x58fd421c, 0x5783ae19, 0x560e4a16, 0x53300210,
    0x51c7120d, 0x50623a0a, 0x4da4c204, 0x4c4c1601, 0x4af769fe, 0x49a6b9fb,
    0x485a01f8, 0x471139f5, 0x45cc59f2, 0x448b5def, 0x4214fde9, 0x40df89e6,
    0x3fade1e3, 0x3e8001e0, 0x3d55e1dd, 0x3c2f79da, 0x3c2f79da, 0x3b0cc5d7,
    0x39edc1d4, 0x38d265d1, 0x37baa9ce, 0x36a689cb, 0x359601c8, 0x348909c5,
    0x348909c5, 0x337f99c2, 0x3279adbf, 0x317741bc, 0x30784db9, 0x30784db9,
    0x2f7cc9b6, 0x2e84b1b3, 0x2d9001b0, 0x2d9001b0, 0x2c9eb1ad, 0x2bb0b9aa,
    0x2bb0b9aa, 0x2ac615a7, 0x29dec1a4, 0x29dec1a4, 0x28fab5a1, 0x2819e99e,
    0x2819e99e, 0x273c599b, 0x273c599b, 0x26620198, 0x258ad995, 0x258ad995,
    0x24b6d992, 0x24b6d992, 0x23e5fd8f, 0x2318418c, 0x2318418c, 0x224d9d89,
    0x224d9d89, 0x21860986, 0x21860986, 0x20c18183, 0x20c18183, 0x20000180,
};
#else // LARGE_TABLE
uint8_t rsqrt_tab [96] = 
{
    0xfe, 0xfa, 0xf7, 0xf3, 0xf0, 0xec, 0xe9, 0xe6, 0xe4, 0xe1, 0xde, 0xdc,
    0xd9, 0xd7, 0xd4, 0xd2, 0xd0, 0xce, 0xcc, 0xca, 0xc8, 0xc6, 0xc4, 0xc2,
    0xc1, 0xbf, 0xbd, 0xbc, 0xba, 0xb9, 0xb7, 0xb6, 0xb4, 0xb3, 0xb2, 0xb0,
    0xaf, 0xae, 0xac, 0xab, 0xaa, 0xa9, 0xa8, 0xa7, 0xa6, 0xa5, 0xa3, 0xa2,
    0xa1, 0xa0, 0x9f, 0x9e, 0x9e, 0x9d, 0x9c, 0x9b, 0x9a, 0x99, 0x98, 0x97,
    0x97, 0x96, 0x95, 0x94, 0x93, 0x93, 0x92, 0x91, 0x90, 0x90, 0x8f, 0x8e,
    0x8e, 0x8d, 0x8c, 0x8c, 0x8b, 0x8a, 0x8a, 0x89, 0x89, 0x88, 0x87, 0x87,
    0x86, 0x86, 0x85, 0x84, 0x84, 0x83, 0x83, 0x82, 0x82, 0x81, 0x81, 0x80,
};
#endif //LARGE_TABLE 

/* compute floor (sqrt (a)) */
uint32_t my_isqrt32 (uint32_t a)
{
    uint32_t b, r, s, t, scal, rem;

    if (a == 0) return a;
    /* Normalize argument */
    scal = clz32 (a) & ~1;
    b = a << scal;
    /* Initial approximation to 1/sqrt(a)*/
    t = rsqrt_tab [(b >> 25) - 32];
    /* First NR iteration */
#if LARGE_TABLE
    r = (t << 22) - umul32_hi (b, t);
#else // LARGE_TABLE
    r = ((3 * t) << 22) - umul32_hi (b, (t * t * t) << 8);
#endif // LARGE_TABLE
    /* Second NR iteration */
    s = umul32_hi (r, b);
    s = 0x30000000 - umul32_hi (r, s);
    r = umul32_hi (r, s);
    /* Compute sqrt(a) = a * 1/sqrt(a). Adjust to ensure it's an underestimate*/
    r = umul32_hi (r, b) - 1;
    /* Denormalize result */
    r = r >> ((scal >> 1) + 11);
    /* Make sure we got the floor correct */
    rem = a - r * r;
    if (rem >= (2 * r + 1)) r++;
    return r;
}
#endif // ALT_IMPL

uint32_t umul32_hi (uint32_t a, uint32_t b)
{
    return (uint32_t)(((uint64_t)a * b) >> 32);
}

uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

int clz32 (uint32_t a)
{
#if (CLZ_IMPL == CLZ_FPU)
    // Henry S. Warren, Jr, " Hacker's Delight 2nd ed", p. 105
    int n = 158 - (float_as_uint32 ((float)(int32_t)(a & ~(a >> 1))+.5f) >> 23);
    return (n < 0) ? 0 : n;
#elif (CLZ_IMPL == CLZ_CPU)
    static const uint8_t clz_tab[32] = {
        31, 22, 30, 21, 18, 10, 29,  2, 20, 17, 15, 13, 9,  6, 28, 1,
        23, 19, 11,  3, 16, 14,  7, 24, 12,  4,  8, 25, 5, 26, 27, 0
    };
    a |= a >> 16;
    a |= a >> 8;
    a |= a >> 4;
    a |= a >> 2;
    a |= a >> 1;
    return clz_tab [0x07c4acddu * a >> 27] + (!a);
#else // CLZ_IMPL == CLZ_BUILTIN
#if defined(_MSC_VER) && defined(_WIN64)
    return __lzcnt (a);
#else // defined(_MSC_VER) && defined(_WIN64)
    return __builtin_clz (a);
#endif // defined(_MSC_VER) && defined(_WIN64)
#endif // CLZ_IMPL
}

/* Henry S. Warren, Jr.,  "Hacker's Delight, 2nd e.d", p. 286 */
uint32_t ref_isqrt32 (uint32_t x)
{
    uint32_t m, y, b;
    m = 0x40000000U;
    y = 0U;
    while (m != 0) {
        b = y | m;
        y = y >> 1;
        if (x >= b) {
            x = x - b;
            y = y | m;
        }
        m = m >> 2;
    }
    return y;
}

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

int main (void)
{
#if ALT_IMPL
    printf ("Alternate integer square root implementation\n");
#else // ALT_IMPL
#if LARGE_TABLE
    printf ("Integer square root implementation w/ large table\n");
#else // LARGE_TAB
    printf ("Integer square root implementation w/ small table\n");
#endif
#endif // ALT_IMPL

#if GEN_TAB
    printf ("Generating lookup table ...\n");
#if ALT_IMPL
      for (int i = 0; i < 384; i++) {
        double x = 1.0 + (i + 1) * 1.0 / 128;
        double y = 1.0 / sqrt (x);
        uint8_t val = (uint8_t)((y * 512) - 256);
        rsqrt_tab[i] = val;
        printf ("0x%02x, ", rsqrt_tab[i]);
        if (i % 12 == 11) printf("\n");
    }
#else // ALT_IMPL
    for (int i = 0; i < 96; i++ ) {
        double x1 = 1.0 + i * 1.0 / 32;
        double x2 = 1.0 + (i + 1) * 1.0 / 32;
        double y = (1.0 / sqrt(x1) + 1.0 / sqrt(x2)) * 0.5;
        uint32_t val = (uint32_t)(y * 256 + 0.5);
#if LARGE_TABLE
        uint32_t cube = val * val * val;
        rsqrt_tab[i] = (((cube + 1) / 4) << 10) + (3 * val);
        printf ("0x%08x, ", rsqrt_tab[i]);
        if (i % 6 == 5) printf ("\n");
#else // LARGE_TABLE
        rsqrt_tab[i] = val;
        printf ("0x%02x, ", rsqrt_tab[i]);
        if (i % 12 == 11) printf ("\n");
#endif // LARGE_TABLE
    }
#endif // ALT_IMPL
#endif // GEN_TAB
    printf ("Running exhaustive test ... ");
    uint32_t i = 0;
    do {
        uint32_t ref = ref_isqrt32 (i);
        uint32_t res = my_isqrt32 (i);
        if (res != ref) {
            printf ("error: arg=%08x  res=%08x  ref=%08x\n", i, res, ref);
            return EXIT_FAILURE;
        }
        i++;
    } while (i);

    printf ("PASSED\n");
    printf ("Running benchmark ...\n");
    i = 0;
    uint32_t sum[8] = {0, 0, 0, 0, 0, 0, 0, 0};
    double start = second();
    do {
        sum [0] += my_isqrt32 (i + 0);
        sum [1] += my_isqrt32 (i + 1);
        sum [2] += my_isqrt32 (i + 2);
        sum [3] += my_isqrt32 (i + 3);
        sum [4] += my_isqrt32 (i + 4);
        sum [5] += my_isqrt32 (i + 5);
        sum [6] += my_isqrt32 (i + 6);
        sum [7] += my_isqrt32 (i + 7);
        i += 8;
    } while (i);
    double stop = second();
    printf ("%08x \relapsed=%.5f sec\n", 
            sum[0]+sum[1]+sum[2]+sum[3]+sum[4]+sum[5]+sum[6]+sum[7],
            stop - start);
    return EXIT_SUCCESS;
}

回复收藏 0 原文

瑶笙 2024-08-02 09:05:27

这是 Java 中的一个解决方案，它结合了整数 log_2 和牛顿法来创建无循环算法。缺点是需要分裂。注释行需要上转换为 64 位算法。

private static final int debruijn= 0x07C4ACDD;
//private static final long debruijn= ( ~0x0218A392CD3D5DBFL)>>>6;

static
{
  for(int x= 0; x<32; ++x)
  {
    final long v= ~( -2L<<(x));
    DeBruijnArray[(int)((v*debruijn)>>>27)]= x; //>>>58
  }
  for(int x= 0; x<32; ++x)
    SQRT[x]= (int) (Math.sqrt((1L<<DeBruijnArray[x])*Math.sqrt(2)));
}

public static int sqrt(final int num)
{
  int y;
  if(num==0)
    return num;
  {
    int v= num;
    v|= v>>>1; // first round up to one less than a power of 2 
    v|= v>>>2;
    v|= v>>>4;
    v|= v>>>8;
    v|= v>>>16;
    //v|= v>>>32;
    y= SQRT[(v*debruijn)>>>27]; //>>>58
  }
  //y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  return y*y>num?y-1:y;
}

工作原理：第一部分产生精确到大约三位的平方根。 y= (y+num/y)>>1; 行将位精度加倍。最后一行消除了可能生成的屋顶根部。

Here is a solution in Java that combines integer log_2 and Newton's method to create a loop free algorithm. As a downside, it needs division. The commented lines are required to upconvert to a 64-bit algorithm.

private static final int debruijn= 0x07C4ACDD;
//private static final long debruijn= ( ~0x0218A392CD3D5DBFL)>>>6;

static
{
  for(int x= 0; x<32; ++x)
  {
    final long v= ~( -2L<<(x));
    DeBruijnArray[(int)((v*debruijn)>>>27)]= x; //>>>58
  }
  for(int x= 0; x<32; ++x)
    SQRT[x]= (int) (Math.sqrt((1L<<DeBruijnArray[x])*Math.sqrt(2)));
}

public static int sqrt(final int num)
{
  int y;
  if(num==0)
    return num;
  {
    int v= num;
    v|= v>>>1; // first round up to one less than a power of 2 
    v|= v>>>2;
    v|= v>>>4;
    v|= v>>>8;
    v|= v>>>16;
    //v|= v>>>32;
    y= SQRT[(v*debruijn)>>>27]; //>>>58
  }
  //y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  return y*y>num?y-1:y;
}

How this works: The first part produces a square root accurate to about three bits. The line y= (y+num/y)>>1; doubles the accuracy in bits. The last line eliminates the roof roots that can be generated.

回复收藏 0 原文

-黛色若梦 2024-08-02 09:05:27

如果您仅需要 ARM Thumb 2 处理器，那么 ARM 的 CMSIS DSP 库是您的最佳选择。它是由设计 Thumb 2 处理器的人员制作的。还有谁能打败它？

实际上，您甚至不需要算法，而是专门的平方根硬件指令，例如 VSQRT。 ARM 公司通过尝试使用 VSQRT 等硬件，保持针对 Thumb 2 支持的处理器高度优化的数学和 DSP 算法实现。您可以获取源代码：

<代码>arm_sqrt_f32()
arm_sqrt_q15.c / arm_sqrt_q31.c（q15 和 q31 是专门用于 ARM DSP 内核的定点数据类型，通常随 Thum 2 一起提供兼容处理器。）

请注意，ARM 还维护 CMSIS DSP 的已编译二进制文件，以保证 ARM Thumb 架构特定指令的最佳性能。因此，您在使用该库时应该考虑静态链接它们。您可以在此处获取二进制文件。

回复收藏 0 原文

放血 2024-08-02 09:05:27

此方法类似于长除法：您对根的下一个数字进行猜测，进行减法，如果差值满足特定条件，则输入数字。对于二进制版本，下一个数字的唯一选择是 0 或 1，因此您总是猜测 1，进行减法，然后输入 1，除非差值为负数。

http://www.realitypixels.com/turk/opensource/index.html#FractSqrt

回复收藏 0 原文

禾厶谷欠 2024-08-02 09:05:27

我在 C# 中针对 64 位整数实现了 Warren 的建议和牛顿法。 Isqrt 使用牛顿法，而 Isqrt 使用沃伦法。这是源代码：

using System;

namespace Cluster
{
    public static class IntegerMath
    {


        /// <summary>
        /// Compute the integer square root, the largest whole number less than or equal to the true square root of N.
        /// 
        /// This uses the integer version of Newton's method.
        /// </summary>
        public static long Isqrt(this long n)
        {
            if (n < 0) throw new ArgumentOutOfRangeException("n", "Square root of negative numbers is not defined.");
            if (n <= 1) return n;

            var xPrev2 = -2L;
            var xPrev1 = -1L;
            var x = 2L;
            // From Wikipedia: if N + 1 is a perfect square, then the algorithm enters a two-value cycle, so we have to compare 
            // to two previous values to test for convergence.
            while (x != xPrev2 && x != xPrev1)
            {
                xPrev2 = xPrev1;
                xPrev1 = x;
                x = (x + n/x)/2;
            }
            // The two values x and xPrev1 will be above and below the true square root. Choose the lower one.
            return x < xPrev1 ? x : xPrev1;
        }

        #region Sqrt using Bit-shifting and magic numbers.

        // From http://stackoverflow.com/questions/1100090/looking-for-an-efficient-integer-square-root-algorithm-for-arm-thumb2
        // Converted to C#.
        private static readonly ulong debruijn= ( ~0x0218A392CD3D5DBFUL)>>6;
        private static readonly ulong[] SQRT = new ulong[64];
        private static readonly int[] DeBruijnArray = new int[64];

        static IntegerMath()
        {
          for(int x= 0; x<64; ++x)
          {
            ulong v= (ulong) ~( -2L<<(x));
            DeBruijnArray[(v*debruijn)>>58]= x;
          }
          for(int x= 0; x<64; ++x)
            SQRT[x]= (ulong) (Math.Sqrt((1L<<DeBruijnArray[x])*Math.Sqrt(2)));
        }

        public static long Isqrt2(this long n)
        {
          ulong num = (ulong) n; 
          ulong y;
          if(num==0)
            return (long)num;
          {
            ulong v= num;
            v|= v>>1; // first round up to one less than a power of 2 
            v|= v>>2;
            v|= v>>4;
            v|= v>>8;
            v|= v>>16;
            v|= v>>32;
            y= SQRT[(v*debruijn)>>58];
          }
          y= (y+num/y)>>1;
          y= (y+num/y)>>1;
          y= (y+num/y)>>1;
          y= (y+num/y)>>1;
          // Make sure that our answer is rounded down, not up.
          return (long) (y*y>num?y-1:y);
        }

        #endregion

    }
}

我使用以下内容对代码进行基准测试：

using System;
using System.Diagnostics;
using Cluster;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace ClusterTests
{
    [TestClass]
    public class IntegerMathTests
    {
        [TestMethod]
        public void Isqrt_Accuracy()
        {
            for (var n = 0L; n <= 100000L; n++)
            {
                var expectedRoot = (long) Math.Sqrt(n);
                var actualRoot = n.Isqrt();
                Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = {0}.", n));
            }
        }

        [TestMethod]
        public void Isqrt2_Accuracy()
        {
            for (var n = 0L; n <= 100000L; n++)
            {
                var expectedRoot = (long)Math.Sqrt(n);
                var actualRoot = n.Isqrt2();
                Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = {0}.", n));
            }
        }

        [TestMethod]
        public void Isqrt_Speed()
        {
            var integerTimer = new Stopwatch();
            var libraryTimer = new Stopwatch();

            integerTimer.Start();
            var total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            {
                var root = n.Isqrt();
                total += root;
            }
            integerTimer.Stop();

            libraryTimer.Start();
            total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            {
                var root = (long)Math.Sqrt(n);
                total += root;
            }
            libraryTimer.Stop();

            var isqrtMilliseconds = integerTimer.ElapsedMilliseconds;
            var libraryMilliseconds = libraryTimer.ElapsedMilliseconds;
            var msg = String.Format("Isqrt: {0} ms versus library: {1} ms", isqrtMilliseconds, libraryMilliseconds);
            Debug.WriteLine(msg);
            Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg);
        }

        [TestMethod]
        public void Isqrt2_Speed()
        {
            var integerTimer = new Stopwatch();
            var libraryTimer = new Stopwatch();

            var warmup = (10L).Isqrt2();

            integerTimer.Start();
            var total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            {
                var root = n.Isqrt2();
                total += root;
            }
            integerTimer.Stop();

            libraryTimer.Start();
            total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            {
                var root = (long)Math.Sqrt(n);
                total += root;
            }
            libraryTimer.Stop();

            var isqrtMilliseconds = integerTimer.ElapsedMilliseconds;
            var libraryMilliseconds = libraryTimer.ElapsedMilliseconds;
            var msg = String.Format("isqrt2: {0} ms versus library: {1} ms", isqrtMilliseconds, libraryMilliseconds);
            Debug.WriteLine(msg);
            Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg);
        }

    }
}

我在发布模式下的 Dell Latitude E6540、Visual Studio 2012 上的结果是
图书馆调用 Math.Sqrt 更快。

59 毫秒 - 牛顿 (Isqrt)
12 毫秒 - 位移位 (Isqrt2)
5 毫秒 - Math.Sqrt

我对编译器指令并不擅长，因此可以调整编译器以更快地获得整数数学。显然，位移位方法非常接近库。在没有数学协处理器的系统上，速度会非常快。

I implemented Warren's suggestion and the Newton method in C# for 64-bit integers. Isqrt uses the Newton method, while Isqrt uses Warren's method. Here is the source code:

using System;

namespace Cluster
{
    public static class IntegerMath
    {


        /// <summary>
        /// Compute the integer square root, the largest whole number less than or equal to the true square root of N.
        /// 
        /// This uses the integer version of Newton's method.
        /// </summary>
        public static long Isqrt(this long n)
        {
            if (n < 0) throw new ArgumentOutOfRangeException("n", "Square root of negative numbers is not defined.");
            if (n <= 1) return n;

            var xPrev2 = -2L;
            var xPrev1 = -1L;
            var x = 2L;
            // From Wikipedia: if N + 1 is a perfect square, then the algorithm enters a two-value cycle, so we have to compare 
            // to two previous values to test for convergence.
            while (x != xPrev2 && x != xPrev1)
            {
                xPrev2 = xPrev1;
                xPrev1 = x;
                x = (x + n/x)/2;
            }
            // The two values x and xPrev1 will be above and below the true square root. Choose the lower one.
            return x < xPrev1 ? x : xPrev1;
        }

        #region Sqrt using Bit-shifting and magic numbers.

        // From http://stackoverflow.com/questions/1100090/looking-for-an-efficient-integer-square-root-algorithm-for-arm-thumb2
        // Converted to C#.
        private static readonly ulong debruijn= ( ~0x0218A392CD3D5DBFUL)>>6;
        private static readonly ulong[] SQRT = new ulong[64];
        private static readonly int[] DeBruijnArray = new int[64];

        static IntegerMath()
        {
          for(int x= 0; x<64; ++x)
          {
            ulong v= (ulong) ~( -2L<<(x));
            DeBruijnArray[(v*debruijn)>>58]= x;
          }
          for(int x= 0; x<64; ++x)
            SQRT[x]= (ulong) (Math.Sqrt((1L<<DeBruijnArray[x])*Math.Sqrt(2)));
        }

        public static long Isqrt2(this long n)
        {
          ulong num = (ulong) n; 
          ulong y;
          if(num==0)
            return (long)num;
          {
            ulong v= num;
            v|= v>>1; // first round up to one less than a power of 2 
            v|= v>>2;
            v|= v>>4;
            v|= v>>8;
            v|= v>>16;
            v|= v>>32;
            y= SQRT[(v*debruijn)>>58];
          }
          y= (y+num/y)>>1;
          y= (y+num/y)>>1;
          y= (y+num/y)>>1;
          y= (y+num/y)>>1;
          // Make sure that our answer is rounded down, not up.
          return (long) (y*y>num?y-1:y);
        }

        #endregion

    }
}

I used the following to benchmark the code:

using System;
using System.Diagnostics;
using Cluster;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace ClusterTests
{
    [TestClass]
    public class IntegerMathTests
    {
        [TestMethod]
        public void Isqrt_Accuracy()
        {
            for (var n = 0L; n <= 100000L; n++)
            {
                var expectedRoot = (long) Math.Sqrt(n);
                var actualRoot = n.Isqrt();
                Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = {0}.", n));
            }
        }

        [TestMethod]
        public void Isqrt2_Accuracy()
        {
            for (var n = 0L; n <= 100000L; n++)
            {
                var expectedRoot = (long)Math.Sqrt(n);
                var actualRoot = n.Isqrt2();
                Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = {0}.", n));
            }
        }

        [TestMethod]
        public void Isqrt_Speed()
        {
            var integerTimer = new Stopwatch();
            var libraryTimer = new Stopwatch();

            integerTimer.Start();
            var total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            {
                var root = n.Isqrt();
                total += root;
            }
            integerTimer.Stop();

            libraryTimer.Start();
            total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            {
                var root = (long)Math.Sqrt(n);
                total += root;
            }
            libraryTimer.Stop();

            var isqrtMilliseconds = integerTimer.ElapsedMilliseconds;
            var libraryMilliseconds = libraryTimer.ElapsedMilliseconds;
            var msg = String.Format("Isqrt: {0} ms versus library: {1} ms", isqrtMilliseconds, libraryMilliseconds);
            Debug.WriteLine(msg);
            Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg);
        }

        [TestMethod]
        public void Isqrt2_Speed()
        {
            var integerTimer = new Stopwatch();
            var libraryTimer = new Stopwatch();

            var warmup = (10L).Isqrt2();

            integerTimer.Start();
            var total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            {
                var root = n.Isqrt2();
                total += root;
            }
            integerTimer.Stop();

            libraryTimer.Start();
            total = 0L;
            for (var n = 0L; n <= 300000L; n++)
            {
                var root = (long)Math.Sqrt(n);
                total += root;
            }
            libraryTimer.Stop();

            var isqrtMilliseconds = integerTimer.ElapsedMilliseconds;
            var libraryMilliseconds = libraryTimer.ElapsedMilliseconds;
            var msg = String.Format("isqrt2: {0} ms versus library: {1} ms", isqrtMilliseconds, libraryMilliseconds);
            Debug.WriteLine(msg);
            Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg);
        }

    }
}

My results on a Dell Latitude E6540 in Release mode, Visual Studio 2012 were
that the Library call Math.Sqrt is faster.

59 ms - Newton (Isqrt)
12 ms - Bit shifting (Isqrt2)
5 ms - Math.Sqrt

I am not clever with compiler directives, so it may be possible to tune the compiler to get the integer math faster. Clearly, the bit-shifting approach is very close to the library. On a system with no math coprocessor, it would be very fast.

回复收藏 0 原文

人生百味 2024-08-02 09:05:27

我设计了一个 16 位 sqrt 用于 RGB gamma 压缩。它根据高 8 位分派到 3 个不同的表中。缺点：它使用大约 KB 的表，舍入不可预测，如果精确的 sqrt 是不可能的，并且在最坏的情况下，使用单次乘法（但仅适用于很少的输入值）。

static uint8_t sqrt_50_256[] = {
  114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,
  133,134,135,136,137,138,139,140,141,142,143,143,144,145,146,147,148,149,150,
  150,151,152,153,154,155,155,156,157,158,159,159,160,161,162,163,163,164,165,
  166,167,167,168,169,170,170,171,172,173,173,174,175,175,176,177,178,178,179,
  180,181,181,182,183,183,184,185,185,186,187,187,188,189,189,190,191,191,192,
  193,193,194,195,195,196,197,197,198,199,199,200,201,201,202,203,203,204,204,
  205,206,206,207,207,208,209,209,210,211,211,212,212,213,214,214,215,215,216,
  217,217,218,218,219,219,220,221,221,222,222,223,223,224,225,225,226,226,227,
  227,228,229,229,230,230,231,231,232,232,233,234,234,235,235,236,236,237,237,
  238,238,239,239,240,241,241,242,242,243,243,244,244,245,245,246,246,247,247,
  248,248,249,249,250,250,251,251,252,252,253,253,254,254,255,255
};

static uint8_t sqrt_0_10[] = {
  1,2,3,3,4,4,5,5,5,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9,9,10,10,10,10,10,11,11,11,
  11,11,11,12,12,12,12,12,12,13,13,13,13,13,13,13,14,14,14,14,14,14,14,15,15,
  15,15,15,15,15,15,16,16,16,16,16,16,16,16,17,17,17,17,17,17,17,17,17,18,18,
  18,18,18,18,18,18,18,19,19,19,19,19,19,19,19,19,19,20,20,20,20,20,20,20,20,
  20,20,21,21,21,21,21,21,21,21,21,21,21,22,22,22,22,22,22,22,22,22,22,22,23,
  23,23,23,23,23,23,23,23,23,23,23,24,24,24,24,24,24,24,24,24,24,24,24,25,25,
  25,25,25,25,25,25,25,25,25,25,25,26,26,26,26,26,26,26,26,26,26,26,26,26,27,
  27,27,27,27,27,27,27,27,27,27,27,27,27,28,28,28,28,28,28,28,28,28,28,28,28,
  28,28,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,30,30,30,30,30,30,30,30,
  30,30,30,30,30,30,30,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,32,32,
  32,32,32,32,32,32,32,32,32,32,32,32,32,32,33,33,33,33,33,33,33,33,33,33,33,
  33,33,33,33,33,33,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,35,35,
  35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,36,36,36,36,36,36,36,36,36,
  36,36,36,36,36,36,36,36,36,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,
  37,37,37,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,39,39,39,
  39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,40,40,40,40,40,40,40,40,
  40,40,40,40,40,40,40,40,40,40,40,40,41,41,41,41,41,41,41,41,41,41,41,41,41,
  41,41,41,41,41,41,41,41,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,
  42,42,42,42,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,
  43,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,45,45,
  45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,46,46,46,46,
  46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,47,47,47,47,47,47,
  47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,48,48,48,48,48,48,48,
  48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,49,49,49,49,49,49,49,49,
  49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,50,50,50,50,50,50,50,50,
  50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,51,51,51,51,51,51,51,51,
  51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,52,52,52,52,52,52,52,
  52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,53,53
};

static uint8_t sqrt_11_49[] = {
  54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,0,76,77,78,
  0,79,80,81,82,83,0,84,85,86,0,87,88,89,0,90,0,91,92,93,0,94,0,95,96,97,0,98,0,
  99,100,101,0,102,0,103,0,104,105,106,0,107,0,108,0,109,0,110,0,111,112,113
};

uint16_t isqrt16(uint16_t v) {
  uint16_t a, b;
  uint16_t h = v>>8;
  if (h <= 10) return v ? sqrt_0_10[v>>2] : 0;
  if (h >= 50) return sqrt_50_256[h-50];
  h = (h-11)<<1;
  a = sqrt_11_49[h];
  b = sqrt_11_49[h+1];
  if (!a) return b;
  return b*b > v ? a : b;
}

我将其与基于 log2 的 sqrt 进行了比较，使用 clang 的 __builtin_clz （它应该扩展为单个程序集操作码），以及库的 sqrtf（使用 ( 调用） int)sqrtf((float)i)。并得到了相当奇怪的结果：

$ gcc -O3 test.c -o test && ./test 
isqrt16: 6.498955
sqrtf: 6.981861
log2_sqrt: 61.755873

Clang将对 sqrtf 的调用编译为 sqrtss 指令，几乎与该表 sqrt 一样快。经验教训：在 x86 上，编译器可以提供足够快的 sqrt，这比你自己能想到的慢不到 10%，浪费了大量时间，或者可以快 10 倍，如果你使用了一些丑陋的按位黑客。而且 sqrtss 仍然比自定义函数慢一点，所以如果你确实需要这 5%，你可以得到它们，而 ARM 例如没有 sqrtss，所以 log2_sqrt 应该不会落后那么严重。

在支持 FPU 的 x86 上，旧的 Quake hack似乎是计算整数开方最快的方法。它比该表或 FPU 的 sqrtss 快 2 倍。

I've designed a 16-bit sqrt for RGB gamma compression. It dispatches into 3 different tables, based on the higher 8 bits. Disadvantages: it uses about a kilobyte for the tables, rounds unpredictable, if exact sqrt is impossible, and, in the worst case, uses single multiplication (but only for a very few input values).

static uint8_t sqrt_50_256[] = {
  114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,
  133,134,135,136,137,138,139,140,141,142,143,143,144,145,146,147,148,149,150,
  150,151,152,153,154,155,155,156,157,158,159,159,160,161,162,163,163,164,165,
  166,167,167,168,169,170,170,171,172,173,173,174,175,175,176,177,178,178,179,
  180,181,181,182,183,183,184,185,185,186,187,187,188,189,189,190,191,191,192,
  193,193,194,195,195,196,197,197,198,199,199,200,201,201,202,203,203,204,204,
  205,206,206,207,207,208,209,209,210,211,211,212,212,213,214,214,215,215,216,
  217,217,218,218,219,219,220,221,221,222,222,223,223,224,225,225,226,226,227,
  227,228,229,229,230,230,231,231,232,232,233,234,234,235,235,236,236,237,237,
  238,238,239,239,240,241,241,242,242,243,243,244,244,245,245,246,246,247,247,
  248,248,249,249,250,250,251,251,252,252,253,253,254,254,255,255
};

static uint8_t sqrt_0_10[] = {
  1,2,3,3,4,4,5,5,5,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9,9,10,10,10,10,10,11,11,11,
  11,11,11,12,12,12,12,12,12,13,13,13,13,13,13,13,14,14,14,14,14,14,14,15,15,
  15,15,15,15,15,15,16,16,16,16,16,16,16,16,17,17,17,17,17,17,17,17,17,18,18,
  18,18,18,18,18,18,18,19,19,19,19,19,19,19,19,19,19,20,20,20,20,20,20,20,20,
  20,20,21,21,21,21,21,21,21,21,21,21,21,22,22,22,22,22,22,22,22,22,22,22,23,
  23,23,23,23,23,23,23,23,23,23,23,24,24,24,24,24,24,24,24,24,24,24,24,25,25,
  25,25,25,25,25,25,25,25,25,25,25,26,26,26,26,26,26,26,26,26,26,26,26,26,27,
  27,27,27,27,27,27,27,27,27,27,27,27,27,28,28,28,28,28,28,28,28,28,28,28,28,
  28,28,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,30,30,30,30,30,30,30,30,
  30,30,30,30,30,30,30,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,32,32,
  32,32,32,32,32,32,32,32,32,32,32,32,32,32,33,33,33,33,33,33,33,33,33,33,33,
  33,33,33,33,33,33,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,35,35,
  35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,36,36,36,36,36,36,36,36,36,
  36,36,36,36,36,36,36,36,36,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,
  37,37,37,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,39,39,39,
  39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,40,40,40,40,40,40,40,40,
  40,40,40,40,40,40,40,40,40,40,40,40,41,41,41,41,41,41,41,41,41,41,41,41,41,
  41,41,41,41,41,41,41,41,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,
  42,42,42,42,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,
  43,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,45,45,
  45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,46,46,46,46,
  46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,47,47,47,47,47,47,
  47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,48,48,48,48,48,48,48,
  48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,49,49,49,49,49,49,49,49,
  49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,50,50,50,50,50,50,50,50,
  50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,51,51,51,51,51,51,51,51,
  51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,52,52,52,52,52,52,52,
  52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,53,53
};

static uint8_t sqrt_11_49[] = {
  54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,0,76,77,78,
  0,79,80,81,82,83,0,84,85,86,0,87,88,89,0,90,0,91,92,93,0,94,0,95,96,97,0,98,0,
  99,100,101,0,102,0,103,0,104,105,106,0,107,0,108,0,109,0,110,0,111,112,113
};

uint16_t isqrt16(uint16_t v) {
  uint16_t a, b;
  uint16_t h = v>>8;
  if (h <= 10) return v ? sqrt_0_10[v>>2] : 0;
  if (h >= 50) return sqrt_50_256[h-50];
  h = (h-11)<<1;
  a = sqrt_11_49[h];
  b = sqrt_11_49[h+1];
  if (!a) return b;
  return b*b > v ? a : b;
}

I've compared it against the log2 based sqrt, using clang's __builtin_clz (which should expand to a single assembly opcode), and the library's sqrtf, called using (int)sqrtf((float)i). And got rather strange results:

$ gcc -O3 test.c -o test && ./test 
isqrt16: 6.498955
sqrtf: 6.981861
log2_sqrt: 61.755873

Clang compiled the call to sqrtf to a sqrtss instruction, which is nearly as fast as that table sqrt. Lesson learned: on x86 the compiler can provide fast enough sqrt, which is less than 10% slower than what you yourself can come up with, wasting a lot of time, or can be 10 times faster, if you use some ugly bitwise hacks. And still sqrtss is a bit slower than custom function, so if you really need these 5%, you can get them, and ARM for example has no sqrtss, so log2_sqrt shouldn't lag that bad.

On x86, where FPU is available, the old Quake hack appears to be the fastest way to calculate integer sqrt. It is 2 times faster than this table or the FPU's sqrtss.

回复收藏 0 原文

~没有更多了~