当前位置：文江博客话题详情

如何在没有除法硬件和没有浮点硬件的情况下实现二进制浮点除法

发布于 2024-12-28 18:30:50 字数 214 浏览 6 评论 0原文

我想知道如何在没有除法硬件和浮点硬件的情况下以二进制实现 IEEE-754 32 位单精度浮点除法？

我有移位硬件、加法、减法和乘法。

我已经使用 16 位字实现了浮点乘法、加法和减法。

我正在专有的多核处理器上实现这些指令，并用汇编语言编写代码。在此之前，我使用 matlab 来验证我的算法。

我知道我需要减去指数，但是如何对尾数执行无符号除法？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

春夜浅 2025-01-04 18:30:50

取决于你想把它弄得有多复杂。为了保持相当简单，您可以尝试通过倒数近似除法。

不是计算：(n / d)，而是计算：n * (1 / d)。

为此，您需要使用某种方法计算出倒数，例如 Newton-Raphson 使用牛顿法连续计算除数倒数的更准确估计，直到在进行最后的乘法步骤之前，“足够”准确地满足您的目的。

编辑

刚刚看到你的更新。这可能对你有用，也可能没用！

回复收藏 0 原文

请爱~陌生人 2025-01-04 18:30:50

如果您的硬件具有足够快的整数乘法器（例如 4-5 个周期），则使用迭代方法来计算 recip = 1 / 除数可能是最快的方法。然后，您可以将商计算为股息 * 收益。如果硬件提供双宽度结果（即完整乘积）的整数乘法或提供两个 32 位整数乘积的高 32 位的“mulhi”指令，这对于必要的定点计算非常有帮助。

您需要动态重新缩放定点计算，以在中间结果中保留接近 32 位，从而导出精确到舍入后最终结果所需的 24 位的结果。

最快的方法可能是生成一个 9 位起始近似值“近似”为 1 / 除数，然后对倒数进行三次收敛迭代：

e = 1 - divisor * approx
e = e * e + e
recip = e * approx + approx

最简单的方法是预先计算起始近似值并将其存储在 256 字节的数组中，按位索引23:16 的除数（即尾数的 8 个最高有效小数位）。由于所有近似值都是 0x100 ... 0x1FF 范围内的数字（对应于 0.5 到 0.998046875），因此无需存储每个值的最高有效位，因为它将是“1”。只需将 0x100 添加到检索到的表元素即可获得初始近似值的最终值。

如果您无法负担 256 字节的表存储空间，则生成精确到 9 位的起始近似值的另一种方法是近似 1 / (1+f) 的 3 次多项式，其中 f 是除数尾数 m 的小数部分。由于在 IEEE-754 中，已知 m 在 [1.0,2.0) 中，f 在 [0.0,1.0) 中。

正确舍入到最接近或偶数（如果需要）可以通过将初步商与除数进行反乘来确定余数，并选择最终商以使余数最小化。

以下代码演示了上面讨论的近似原理，使用更简单的倒数情况，根据 IEEE-754 最近或偶数模式进行适当舍入，并适当处理特殊情况（零、非正规数、无穷大、NaN）。它假设 32 位 IEEE-754 单精度浮点可以按位从 32 位无符号整型转换而来。然后，所有操作都对 32 位整数执行。

unsigned int frcp_rn_core (unsigned int z)
{
  unsigned int x, y;
  int sign;
  int expo;

  sign = z & 0x80000000;
  expo = (z >> 23) & 0xff;
  x = expo - 1;
  if (x > 0xfd) {
    /* handle special cases */
    x = z << 1;
    /* zero or small denormal returns infinity of like sign */
    if (x <= 0x00400000) {
      return sign | 0x7f800000;
    }
    /* infinity returns zero of like sign */
    else if (x == 0xff000000) {
      return sign;
    }
    /* convert SNaNs to QNaNs */
    else if (x > 0xff000000) {
      return z | 0x00400000;
    } 
    /* large denormal, normalize it */      
    else {
      y = x < 0x00800000;
      z = x << y;
      expo = expo - y;
    }
  }
  y = z & 0x007fffff;
#if USE_TABLE
  z = 256 + rcp_tab[y >> 15];  /* approx */
#else
  x = y << 3;
  z =                    0xe39ad7a0;
  z = mul_32_hi (x, z) + 0x0154bde4;
  z = mul_32_hi (x, z) + 0xfff87521;
  z = mul_32_hi (x, z) + 0x00001ffa;
  z = z >> 4;
#endif /* USE_TABLE */
  y = y | 0x00800000;
  /* cubically convergent approximation to reciprocal */
  x = (unsigned int)-(int)(y * z); /* x = 1 - arg * approx */
  x = mul_32_hi (x, x) + x;        /* x = x * x + x */
  z = z << 15;
  z = mul_32_hi (x, z) + z;        /* approx = x * approx + approx */
  /* compute result exponent */
  expo = 252 - expo;
  if (expo >= 0) {
    /* result is a normal */
    x = y * z + y;
    z = (expo << 23) + z;
    z = z | sign;
    /* round result */
    if ((int)x <= (int)(y >> 1)) {
      z++;
    }
    return z;
  } 
  /* result is a denormal */
  expo = -expo;
  z = z >> expo;
  x = y * z + y;
  z = z | sign;
  /* round result */
  if ((int)x <= (int)(y >> 1)) {
    z++;
  }
  return z;
}

函数 mul_32_high() 是机器特定操作的占位符，该操作返回两个 32 位整数的有符号乘法的高 32 位。代替机器特定版本的半可移植实现是

/* 32-bit int, 64-bit long long int */
int mul_32_hi (int a, int b)
{
  return (int)(unsigned int)(((unsigned long long)(((long long)a)*b)) >> 32);
}

基于表的变体使用的 256 项倒数表可以构造如下：

static unsigned char rcp_tab[256];  
for (int i = 0; i < 256; i++) {
  rcp_tab[i] = (unsigned char)(((1./(1.+((i+.5)/256.)))*512.+.5)-256.);
}

If your hardware has a sufficiently fast integer multiplier (say 4-5 cycles), using an iterative approach to compute recip = 1 / divisor is likely to be the fastest approach. You would then compute the quotient as dividend * recip. It is very helpful for the necessary fixed-point computations if the hardware offers either integer multiplication with double-width result (i.e. the full product) or a "mulhi" instruction that delivers the upper 32 bits of the product of two 32 bit integers.

You will need to re-scale the fixed-point computations on the fly to retain close to 32 bits in intermediate results to derive a result that is accurate to the 24 bits that are required for the final result after rounding.

The fastest approach is likely generating a 9-bit starting approximation "approx" to 1 / divisor followed by a cubically convergent iteration for the reciprocal:

e = 1 - divisor * approx
e = e * e + e
recip = e * approx + approx

It is easiest to precompute the starting approximation and sstore it in an array of 256 bytes, indexed by bits 23:16 of the divisor (i.e. the 8 most significant fractional bits of the mantissa). As all approximation values are numbers in the range 0x100 ... 0x1FF (corresponding to 0.5 to 0.998046875) there is no need to store the most significant bit of each value as it will be '1'. Simply add 0x100 to the table element retrieved to get the final value of the initial approximation.

If you cannot afford 256 bytes of table storage, an alternative way to generate a starting approximation accurate to 9 bits would be a polynomial of degree 3 that approximates 1 / (1+f) where f is the fractional part of the divisor mantissa, m. Since with IEEE-754, m is known to be in [1.0,2.0), f is in [0.0,1.0).

Correct rounding to nearest-or-even (if required) could be implemented by the back-multiplication of the preliminary quotient by the divisor to establish the remainder, and selecting the final quotient such that the remainder is minimized.

The following code demonstrates the approximation principles discussed above, using the simpler case of the reciprocal, with proper rounding according to IEEE-754 nearest-or-even mode, and with appropriate handling of special cases (zero, denormals, infinity, NaNs). It assumes that a 32-bit IEEE-754 single-precision float can be transferred bit-wise from and to a 32-bit unsigned int. All operations are then performed on 32-bit integers.

unsigned int frcp_rn_core (unsigned int z)
{
  unsigned int x, y;
  int sign;
  int expo;

  sign = z & 0x80000000;
  expo = (z >> 23) & 0xff;
  x = expo - 1;
  if (x > 0xfd) {
    /* handle special cases */
    x = z << 1;
    /* zero or small denormal returns infinity of like sign */
    if (x <= 0x00400000) {
      return sign | 0x7f800000;
    }
    /* infinity returns zero of like sign */
    else if (x == 0xff000000) {
      return sign;
    }
    /* convert SNaNs to QNaNs */
    else if (x > 0xff000000) {
      return z | 0x00400000;
    } 
    /* large denormal, normalize it */      
    else {
      y = x < 0x00800000;
      z = x << y;
      expo = expo - y;
    }
  }
  y = z & 0x007fffff;
#if USE_TABLE
  z = 256 + rcp_tab[y >> 15];  /* approx */
#else
  x = y << 3;
  z =                    0xe39ad7a0;
  z = mul_32_hi (x, z) + 0x0154bde4;
  z = mul_32_hi (x, z) + 0xfff87521;
  z = mul_32_hi (x, z) + 0x00001ffa;
  z = z >> 4;
#endif /* USE_TABLE */
  y = y | 0x00800000;
  /* cubically convergent approximation to reciprocal */
  x = (unsigned int)-(int)(y * z); /* x = 1 - arg * approx */
  x = mul_32_hi (x, x) + x;        /* x = x * x + x */
  z = z << 15;
  z = mul_32_hi (x, z) + z;        /* approx = x * approx + approx */
  /* compute result exponent */
  expo = 252 - expo;
  if (expo >= 0) {
    /* result is a normal */
    x = y * z + y;
    z = (expo << 23) + z;
    z = z | sign;
    /* round result */
    if ((int)x <= (int)(y >> 1)) {
      z++;
    }
    return z;
  } 
  /* result is a denormal */
  expo = -expo;
  z = z >> expo;
  x = y * z + y;
  z = z | sign;
  /* round result */
  if ((int)x <= (int)(y >> 1)) {
    z++;
  }
  return z;
}

The function mul_32_high() is a placeholder for a machine-specific operation that returns the upper 32 bits of a signed multiplication of two 32-bit bit integers. A semi-portable implementation in lieu of a machine-specific version is

/* 32-bit int, 64-bit long long int */
int mul_32_hi (int a, int b)
{
  return (int)(unsigned int)(((unsigned long long)(((long long)a)*b)) >> 32);
}

The 256-entry reciprocal table used by the table-based variant can be constructed as follows:

static unsigned char rcp_tab[256];  
for (int i = 0; i < 256; i++) {
  rcp_tab[i] = (unsigned char)(((1./(1.+((i+.5)/256.)))*512.+.5)-256.);
}

回复收藏 0 原文

~没有更多了~

关于作者

墨小沫ゞ

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

如何在没有除法硬件和没有浮点硬件的情况下实现二进制浮点除法

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如何在没有除法硬件和没有浮点硬件的情况下实现二进制浮点除法

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。