为 Goldschmidt 部门挑选良好的初步估计

发布于 2024-08-29 10:45:00 字数 2485 浏览 9 评论 0原文

我正在使用 Goldschmidt 除法计算 Q22.10 中的定点倒数在我的 ARM 上的软件光栅化器中。

只需将分子设置为 1 即可完成此操作，即分子在第一次迭代时变为标量。老实说，我在这里有点盲目地遵循维基百科的算法。文章说，如果分母在半开范围（0.5, 1.0] 内缩放，则可以仅基于分母进行良好的初步估计：设 F 为估计标量，D 为分母，则 F = 2 - D.

但是这样做时，我失去了很多精度。假设我想找到 512.00002f 的倒数，为了缩小数字，我失去了小数部分的 10 位精度，这些精度被移出。所以，我的问题是：

有没有一种方法可以选择不需要标准化的更好的估计？为什么不呢？为什么可以或不可能的数学证明也很好。
另外，是否可以预先计算。第一个估计使得该系列收敛得更快？现在，在 ARM 上，它平均在 50 个周期后收敛，并且没有考虑 clz/bsr 的模拟，也没有考虑内存查找。可能的话，我想知道这样做是否会增加错误，以及增加多少。

这是我的测试用例注意：第 13 行 clz 的软件实现来自我的帖子此处。如果需要，您可以将其替换为内在函数。 clz 应返回前导零的数量，对于值 0，应返回 32。

#include <stdio.h>
#include <stdint.h>

const unsigned int BASE = 22ULL;

static unsigned int divfp(unsigned int val, int* iter)
{
  /* Numerator, denominator, estimate scalar and previous denominator */
  unsigned long long N,D,F, DPREV;
  int bitpos;

  *iter = 1;
  D = val;
  /* Get the shift amount + is right-shift, - is left-shift. */
  bitpos = 31 - clz(val) - BASE;
  /* Normalize into the half-range (0.5, 1.0] */
  if(0 < bitpos)
    D >>= bitpos;
  else
    D <<= (-bitpos);

  /* (FNi / FDi) == (FN(i+1) / FD(i+1)) */
  /* F = 2 - D */
  F = (2ULL<<BASE) - D;
  /* N = F for the first iteration, because the numerator is simply 1.
     So don't waste a 64-bit UMULL on a multiply with 1 */
  N = F;
  D = ((unsigned long long)D*F)>>BASE;

  while(1){
    DPREV = D;
    F = (2<<(BASE)) - D;
    D = ((unsigned long long)D*F)>>BASE;
    /* Bail when we get the same value for two denominators in a row.
      This means that the error is too small to make any further progress. */
    if(D == DPREV)
      break;
    N = ((unsigned long long)N*F)>>BASE;
    *iter = *iter + 1;
  }
  if(0 < bitpos)
    N >>= bitpos;
  else
    N <<= (-bitpos);
  return N;
}


int main(int argc, char* argv[])
{
  double fv, fa;
  int iter;
  unsigned int D, result;

  sscanf(argv[1], "%lf", &fv);

  D = fv*(double)(1<<BASE);
  result = divfp(D, &iter); 

  fa = (double)result / (double)(1UL << BASE);
  printf("Value: %8.8lf 1/value: %8.8lf FP value: 0x%.8X\n", fv, fa, result);
  printf("iteration: %d\n",iter);

  return 0;
}

原文

I'm calculating fixedpoint reciprocals in Q22.10 with Goldschmidt division for use in my software rasterizer on ARM.

This is done by just setting the numerator to 1, i.e the numerator becomes the scalar on the first iteration. To be honest, I'm kind of following the wikipedia algorithm blindly here. The article says that if the denominator is scaled in the half-open range (0.5, 1.0], a good first estimate can be based on the denominator alone: Let F be the estimated scalar and D be the denominator, then F = 2 - D.

But when doing this, I lose a lot of precision. Say if I want to find the reciprocal of 512.00002f. In order to scale the number down, I lose 10 bits of precision in the fraction part, which is shifted out. So, my questions are:

Is there a way to pick a better estimate which does not require normalization? Why? Why not? A mathematical proof of why this is or is not possible would be great.
Also, is it possible to pre-calculate the first estimates so the series converges faster? Right now, it converges after the 4th iteration on average. On ARM this is about ~50 cycles worst case, and that's not taking emulation of clz/bsr into account, nor memory lookups. If it's possible, I'd like to know if doing so increases the error, and by how much.

Here is my testcase. Note: The software implementation of clz on line 13 is from my post here. You can replace it with an intrinsic if you want. clz should return the number of leading zeros, and 32 for the value 0.

#include <stdio.h>
#include <stdint.h>

const unsigned int BASE = 22ULL;

static unsigned int divfp(unsigned int val, int* iter)
{
  /* Numerator, denominator, estimate scalar and previous denominator */
  unsigned long long N,D,F, DPREV;
  int bitpos;

  *iter = 1;
  D = val;
  /* Get the shift amount + is right-shift, - is left-shift. */
  bitpos = 31 - clz(val) - BASE;
  /* Normalize into the half-range (0.5, 1.0] */
  if(0 < bitpos)
    D >>= bitpos;
  else
    D <<= (-bitpos);

  /* (FNi / FDi) == (FN(i+1) / FD(i+1)) */
  /* F = 2 - D */
  F = (2ULL<<BASE) - D;
  /* N = F for the first iteration, because the numerator is simply 1.
     So don't waste a 64-bit UMULL on a multiply with 1 */
  N = F;
  D = ((unsigned long long)D*F)>>BASE;

  while(1){
    DPREV = D;
    F = (2<<(BASE)) - D;
    D = ((unsigned long long)D*F)>>BASE;
    /* Bail when we get the same value for two denominators in a row.
      This means that the error is too small to make any further progress. */
    if(D == DPREV)
      break;
    N = ((unsigned long long)N*F)>>BASE;
    *iter = *iter + 1;
  }
  if(0 < bitpos)
    N >>= bitpos;
  else
    N <<= (-bitpos);
  return N;
}


int main(int argc, char* argv[])
{
  double fv, fa;
  int iter;
  unsigned int D, result;

  sscanf(argv[1], "%lf", &fv);

  D = fv*(double)(1<<BASE);
  result = divfp(D, &iter); 

  fa = (double)result / (double)(1UL << BASE);
  printf("Value: %8.8lf 1/value: %8.8lf FP value: 0x%.8X\n", fv, fa, result);
  printf("iteration: %d\n",iter);

  return 0;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

战皆罪 2024-09-05 10:45:01

我忍不住花一个小时来解决你的问题...

该算法在 Jean-Michel Muller 的“Arithmetique des ordinators”（法语）的第 5.5.2 节中进行了描述。它实际上是以1为起点的牛顿迭代的特例。本书给出了计算 N/D 的算法的简单公式，其中 D 在 [1/2,1[ 范围内标准化：

e = 1 - D
Q = N
repeat K times:
  Q = Q * (1+e)
  e = e*e

每次迭代时正确位数加倍。在 32 位的情况下，4 次迭代就足够了。您还可以迭代，直到 e 变得太小而无法修改 Q。

使用归一化是因为它提供了结果中有效位的最大数量。当输入在已知范围内时，计算所需的误差和迭代次数也更容易。

一旦你的输入值被归一化，你就不需要关心 BASE 的值，直到你得到逆值。您只需将一个 32 位数字 X 归一化在 0x80000000 到 0xFFFFFFFF 范围内，然后计算 Y=2^64/X 的近似值（Y 最多为 2^33）。

这个简化的算法可以为您的 Q22.10 表示实现，如下所示：

// Fixed point inversion
// EB Apr 2010

#include <math.h>
#include <stdio.h>

// Number X is represented by integer I: X = I/2^BASE.
// We have (32-BASE) bits in integral part, and BASE bits in fractional part
#define BASE 22
typedef unsigned int uint32;
typedef unsigned long long int uint64;

// Convert FP to/from double (debug)
double toDouble(uint32 fp) { return fp/(double)(1<<BASE); }
uint32 toFP(double x) { return (int)floor(0.5+x*(1<<BASE)); }

// Return inverse of FP
uint32 inverse(uint32 fp)
{
  if (fp == 0) return (uint32)-1; // invalid

  // Shift FP to have the most significant bit set
  int shl = 0; // normalization shift
  uint32 nfp = fp; // normalized FP
  while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead

  uint64 q = 0x100000000ULL; // 2^32
  uint64 e = 0x100000000ULL - (uint64)nfp; // 2^32-NFP
  int i;
  for (i=0;i<4;i++) // iterate
    {
      // Both multiplications are actually
      // 32x32 bits truncated to the 32 high bits
      q += (q*e)>>(uint64)32;
      e = (e*e)>>(uint64)32;
      printf("Q=0x%llx E=0x%llx\n",q,e);
    }
  // Here, (Q/2^32) is the inverse of (NFP/2^32).
  // We have 2^31<=NFP<2^32 and 2^32<Q<=2^33
  return (uint32)(q>>(64-2*BASE-shl));
}

int main()
{
  double x = 1.234567;
  uint32 xx = toFP(x);
  uint32 yy = inverse(xx);
  double y = toDouble(yy);

  printf("X=%f Y=%f X*Y=%f\n",x,y,x*y);
  printf("XX=0x%08x YY=0x%08x XX*YY=0x%016llx\n",xx,yy,(uint64)xx*(uint64)yy);
}

如代码中所述，乘法不是完整的 32x32->64 位。 E 将变得越来越小，最初适合 32 位。 Q 将始终为 34 位。我们只取产品的高 32 位。

64-2*BASE-shl 的推导留给读者作为练习:-)。如果变成0或负数，则结果无法表示（输入值太小）。

编辑。作为我评论的后续内容，这是第二个版本，Q 上有隐式第 32 位。E 和 Q 现在都存储在 32 位上：

uint32 inverse2(uint32 fp)
{
  if (fp == 0) return (uint32)-1; // invalid

  // Shift FP to have the most significant bit set
  int shl = 0; // normalization shift for FP
  uint32 nfp = fp; // normalized FP
  while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
  int shr = 64-2*BASE-shl; // normalization shift for Q
  if (shr <= 0) return (uint32)-1; // overflow

  uint64 e = 1 + (0xFFFFFFFF ^ nfp); // 2^32-NFP, max value is 2^31
  uint64 q = e; // 2^32 implicit bit, and implicit first iteration
  int i;
  for (i=0;i<3;i++) // iterate
    {
      e = (e*e)>>(uint64)32;
      q += e + ((q*e)>>(uint64)32);
    }
  return (uint32)(q>>shr) + (1<<(32-shr)); // insert implicit bit
}

I could not resist spending an hour on your problem...

This algorithm is described in section 5.5.2 of "Arithmetique des ordinateurs" by Jean-Michel Muller (in french). It is actually a special case of Newton iterations with 1 as starting point. The book gives a simple formulation of the algorithm to compute N/D, with D normalized in range [1/2,1[:

e = 1 - D
Q = N
repeat K times:
  Q = Q * (1+e)
  e = e*e

The number of correct bits doubles at each iteration. In the case of 32 bits, 4 iterations will be enough. You can also iterate until e becomes too small to modify Q.

Normalization is used because it provides the max number of significant bits in the result. It is also easier to compute the error and number of iterations needed when the inputs are in a known range.

Once your input value is normalized, you don't need to bother with the value of BASE until you have the inverse. You simply have a 32-bit number X normalized in range 0x80000000 to 0xFFFFFFFF, and compute an approximation of Y=2^64/X (Y is at most 2^33).

This simplified algorithm may be implemented for your Q22.10 representation as follows:

// Fixed point inversion
// EB Apr 2010

#include <math.h>
#include <stdio.h>

// Number X is represented by integer I: X = I/2^BASE.
// We have (32-BASE) bits in integral part, and BASE bits in fractional part
#define BASE 22
typedef unsigned int uint32;
typedef unsigned long long int uint64;

// Convert FP to/from double (debug)
double toDouble(uint32 fp) { return fp/(double)(1<<BASE); }
uint32 toFP(double x) { return (int)floor(0.5+x*(1<<BASE)); }

// Return inverse of FP
uint32 inverse(uint32 fp)
{
  if (fp == 0) return (uint32)-1; // invalid

  // Shift FP to have the most significant bit set
  int shl = 0; // normalization shift
  uint32 nfp = fp; // normalized FP
  while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead

  uint64 q = 0x100000000ULL; // 2^32
  uint64 e = 0x100000000ULL - (uint64)nfp; // 2^32-NFP
  int i;
  for (i=0;i<4;i++) // iterate
    {
      // Both multiplications are actually
      // 32x32 bits truncated to the 32 high bits
      q += (q*e)>>(uint64)32;
      e = (e*e)>>(uint64)32;
      printf("Q=0x%llx E=0x%llx\n",q,e);
    }
  // Here, (Q/2^32) is the inverse of (NFP/2^32).
  // We have 2^31<=NFP<2^32 and 2^32<Q<=2^33
  return (uint32)(q>>(64-2*BASE-shl));
}

int main()
{
  double x = 1.234567;
  uint32 xx = toFP(x);
  uint32 yy = inverse(xx);
  double y = toDouble(yy);

  printf("X=%f Y=%f X*Y=%f\n",x,y,x*y);
  printf("XX=0x%08x YY=0x%08x XX*YY=0x%016llx\n",xx,yy,(uint64)xx*(uint64)yy);
}

As noted in the code, the multiplications are not full 32x32->64 bits. E will become smaller and smaller and fits initially on 32 bits. Q will always be on 34 bits. We take only the high 32 bits of the products.

The derivation of 64-2*BASE-shl is left as an exercise for the reader :-). If it becomes 0 or negative, the result is not representable (the input value is too small).

EDIT. As a follow-up to my comment, here is a second version with an implicit 32-th bit on Q. Both E and Q are now stored on 32 bits:

uint32 inverse2(uint32 fp)
{
  if (fp == 0) return (uint32)-1; // invalid

  // Shift FP to have the most significant bit set
  int shl = 0; // normalization shift for FP
  uint32 nfp = fp; // normalized FP
  while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
  int shr = 64-2*BASE-shl; // normalization shift for Q
  if (shr <= 0) return (uint32)-1; // overflow

  uint64 e = 1 + (0xFFFFFFFF ^ nfp); // 2^32-NFP, max value is 2^31
  uint64 q = e; // 2^32 implicit bit, and implicit first iteration
  int i;
  for (i=0;i<3;i++) // iterate
    {
      e = (e*e)>>(uint64)32;
      q += e + ((q*e)>>(uint64)32);
    }
  return (uint32)(q>>shr) + (1<<(32-shr)); // insert implicit bit
}

回复收藏 0 原文

云淡月浅 2024-09-05 10:45:01

为您提供一些想法，但没有一个可以直接解决您的问题。

为什么要使用这种除法算法？我在 ARM 中看到的大多数划分都使用某种变体
<前><代码>
adcs 嗨，den，嗨，lsl #1
subcc 嗨，嗨，书房
adc 洛，洛，洛

通过 clz 的二分搜索重复 n 位以确定从哪里开始。这真是太快了。

如果精度是一个大问题，那么定点表示并不限于 32/64 位。它会慢一点，但您可以执行 add/adc 或 sub/sbc 在寄存器之间移动值。 mul/mla 也是为此类工作而设计的。

再说一次，这不是给你直接的答案，但可能是一些推进这个问题的想法。查看实际的 ARM 代码可能也会对我有所帮助。

A couple of ideas for you, though none that solve your problem directly as stated.

Why this algo for division? Most divides I've seen in ARM use some varient of


      adcs hi, den, hi, lsl #1
      subcc hi, hi, den
      adcs lo, lo, lo

repeated n bits times with a binary search off of the clz to determine where to start. That's pretty dang fast.

If precision is a big problem, you are not limited to 32/64 bits for your fixed point representation. It'll be a bit slower, but you can do add/adc or sub/sbc to move values across registers. mul/mla are also designed for this kind of work.

Again, not direct answers for you, but possibly a few ideas to go forward this. Seeing the actual ARM code would probably help me a bit as well.

回复收藏 0 原文

清晨说晚安 2024-09-05 10:45:01

Mads，你根本没有失去任何精确度。当您将 512.00002f 除以 2^10 时，您只需将浮点数的指数减 10。尾数保持不变。当然，除非指数达到最小值，但这种情况不应该发生，因为您要缩放到 (0.5, 1]。

编辑：好的，所以您使用固定的小数点。在这种情况下，您应该允许使用不同的表示形式D 的值不仅在开始时而且在整个计算过程中都来自 (0.5, 1]（很容易证明 x < 1 时 x * (2-x) < 1）。因此，您应该用小数点表示分母，基数为 32。这样您将始终具有 32 位精度

编辑：要实现此目的，您必须更改代码的以下几行：

  //bitpos = 31 - clz(val) - BASE;
  bitpos = 31 - clz(val) - 31;
...
  //F = (2ULL<<BASE) - D;
  //N = F;
  //D = ((unsigned long long)D*F)>>BASE;
  F = -D;
  N = F >> (31 - BASE);
  D = ((unsigned long long)D*F)>>31;
...
    //F = (2<<(BASE)) - D;
    //D = ((unsigned long long)D*F)>>BASE;
    F = -D;
    D = ((unsigned long long)D*F)>>31;
...
    //N = ((unsigned long long)N*F)>>BASE;
    N = ((unsigned long long)N*F)>>31;

同样，最后您。必须将 N 移位，而不是按 bitpos，而是按一些不同的值，我现在懒得弄清楚:)。

Mads, you are not losing any precision at all. When you divide 512.00002f by 2^10, you merely decrease the exponent of your floating point number by 10. Mantissa remains the same. Of course unless the exponent hits its minimum value but that shouldn't happen since you're scaling to (0.5, 1].

EDIT: Ok so you're using a fixed decimal point. In that case you should allow a different representation of the denominator in your algorithm. The value of D is from (0.5, 1] not only at the beginning but throughout the whole calculation (it's easy to prove that x * (2-x) < 1 for x < 1). So you should represent the denominator with decimal point at base = 32. This way you will have 32 bits of precision all the time.

EDIT: To implement this you'll have to change the following lines of your code:

  //bitpos = 31 - clz(val) - BASE;
  bitpos = 31 - clz(val) - 31;
...
  //F = (2ULL<<BASE) - D;
  //N = F;
  //D = ((unsigned long long)D*F)>>BASE;
  F = -D;
  N = F >> (31 - BASE);
  D = ((unsigned long long)D*F)>>31;
...
    //F = (2<<(BASE)) - D;
    //D = ((unsigned long long)D*F)>>BASE;
    F = -D;
    D = ((unsigned long long)D*F)>>31;
...
    //N = ((unsigned long long)N*F)>>BASE;
    N = ((unsigned long long)N*F)>>31;

Also in the end you'll have to shift N not by bitpos but some different value which I'm too lazy to figure out right now :).

回复收藏 0 原文

~没有更多了~