当前位置：文江博客话题详情

使用浮点数或双精度数代替整数

发布于 2024-11-08 14:50:44 字数 293 浏览 6 评论 0原文

我知道Lua的默认实现仅使用浮点数，从而避免了在选择要使用的数学函数变体之前动态确定数字子类型的问题。

我的问题是——如果我尝试在标准 C99 中将整数模拟为双精度数（或浮点数），是否有一种可靠（且简单）的方法来判断可精确表示的最大值是多少？

我的意思是，如果我使用 64 位浮点数来表示整数，我当然无法表示所有 64 位整数（鸽巢原理适用于此）。我怎样才能知道可表示的最大整数？

（尝试列出所有值并不是解决方案 - 例如，如果我在 64 位架构中使用双精度数，因为我必须列出 2^{64} 数字）

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

请你别敷衍 2024-11-15 14:50:44

对于 64 位双精度数，可表示的最大整数为 2⁵³ (9007199254740992)；对于 32 位双精度数，可表示的最大整数为 2²⁴ (16777216)漂浮。请参阅IEEE 浮点数的维基百科页面上的基数。

在 Lua 中验证这一点非常简单：

local maxdouble = 2^53

-- one less than the maximum can be represented precisely
print (string.format("%.0f",maxdouble-1)) --> 9007199254740991
-- the maximum itself can be represented precisely
print (string.format("%.0f",maxdouble))   --> 9007199254740992
-- one more than the maximum gets rounded down
print (string.format("%.0f",maxdouble+1)) --> 9007199254740992 again

如果我们手头没有 IEEE 定义的字段大小，那么只要知道我们对浮点数的设计的了解，我们就可以使用一个简单的循环来确定这些值超过可能的值：

#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#define min(a, b) (a < b ? a : b)
#define bits(type) (sizeof(type) * 8)
#define testimax(test_t) { \
  uintmax_t in = 1, out = 2; \
  size_t pow = 0, limit = min(bits(test_t), bits(uintmax_t)); \
  while (pow < limit && out == in + 1) { \
    in = in << 1; \
    out = (test_t) in + 1; \
    ++pow; \
  } \
  if (pow == limit) \
    puts(#test_t " is as precise as longest integer type"); \
  else printf(#test_t " conversion imprecise for 2^%d+1:\n" \
    "   in: %llu\n  out: %llu\n\n", pow, in + 1, out); \
}

int main(void)
{
    testimax(float);
    testimax(double);
    return 0;
}

上述代码的输出：

float conversion imprecise for 2^24+1:
   in: 16777217
  out: 16777216

double conversion imprecise for 2^53+1:
   in: 9007199254740993
  out: 9007199254740992

当然，由于浮点精度的工作方式，随着浮点指数变正，64 位双精度数可以表示远大于 2⁶⁴ 的数字。有关双精度浮点的 Wikipedia 页面描述了：

在 2⁵²=4,503,599,627,370,496 和 2⁵³=9,007,199,254,740,992 之间，可表示的数字恰好是整数。对于下一个范围，从 2⁵³ 到 2⁵⁴，所有内容都乘以 2，因此可表示的数字是偶数，等等。相反，对于前一个范围2⁵¹到2⁵²，间距为0.5等

进一步列出了双精度型可以容纳的绝对最大值该页面下方：0x7feffffffffffffff，计算结果为 (1 + (1 − 2⁻⁵²)) * 2¹⁰²³，即大约 1.7976931348623157e308。

The maximum ones-representable integer is 2⁵³ (9007199254740992) for a 64-bit double and 2²⁴ (16777216) for a 32-bit float. See the base digits on the Wikipedia page for IEEE floating point numbers.

Verifying this in Lua is pretty simple:

local maxdouble = 2^53

-- one less than the maximum can be represented precisely
print (string.format("%.0f",maxdouble-1)) --> 9007199254740991
-- the maximum itself can be represented precisely
print (string.format("%.0f",maxdouble))   --> 9007199254740992
-- one more than the maximum gets rounded down
print (string.format("%.0f",maxdouble+1)) --> 9007199254740992 again

If we don't have the IEEE-defined field sizes handy, knowing what we know about the design of floating point numbers, we can determine these values using a simple loop over the possible values:

#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#define min(a, b) (a < b ? a : b)
#define bits(type) (sizeof(type) * 8)
#define testimax(test_t) { \
  uintmax_t in = 1, out = 2; \
  size_t pow = 0, limit = min(bits(test_t), bits(uintmax_t)); \
  while (pow < limit && out == in + 1) { \
    in = in << 1; \
    out = (test_t) in + 1; \
    ++pow; \
  } \
  if (pow == limit) \
    puts(#test_t " is as precise as longest integer type"); \
  else printf(#test_t " conversion imprecise for 2^%d+1:\n" \
    "   in: %llu\n  out: %llu\n\n", pow, in + 1, out); \
}

int main(void)
{
    testimax(float);
    testimax(double);
    return 0;
}

The output of the above code:

float conversion imprecise for 2^24+1:
   in: 16777217
  out: 16777216

double conversion imprecise for 2^53+1:
   in: 9007199254740993
  out: 9007199254740992

Of course, due to the way floating-point precision works, a 64-bit double can represent numbers much larger than 2⁶⁴ as the floating exponent grows positive. The Wikipedia page on double-precision floating-point describes:

Between 2⁵²=4,503,599,627,370,496 and 2⁵³=9,007,199,254,740,992 the representable numbers are exactly the integers. For the next range, from 2⁵³ to 2⁵⁴, everything is multiplied by 2, so the representable numbers are the even ones, etc. Conversely, for the previous range from 2⁵¹ to 2⁵², the spacing is 0.5, etc.

The absolute largest value a double can hold is listed further down that page: 0x7fefffffffffffff, which computes to (1 + (1 − 2⁻⁵²)) * 2¹⁰²³, or roughly 1.7976931348623157e308.

回复收藏 0 原文

徒留西风 2024-11-15 14:50:44

IEEE 浮点维基页面说：

通过使用以下方法转换为十进制并再次转换回来，将保留原始二进制值：
二进制 16 的 5 位十进制数字
9 位十进制数字表示二进制 32
二进制 64 的 17 位十进制数字
二进制128的36位十进制数字

回复收藏 0 原文

世界和平 2024-11-15 14:50:44

如果您正在查看 int 到 float 和返回 int 之间的转换，它在我的系统上大约分解为 16,777,217 （double 没有任何问题）：

#include <stdio.h>
#include <limits.h>

int main (void)
{
  long in, out;
  double d;
  float f;

  for (in=0; in < (LONG_MAX); in++) {
    d=in;
    f=in;
    out=d;
    if (in != out) {
      printf ("Double conversion imprecise for %ld\n", in);
    }
    out=f;
    if (in != out) {
      printf ("Float conversion imprecise for %ld\n", in);
    }
  }
  return 0;
}

If you're looking at a conversion between int to float and back to int, it breaks down around 16,777,217 on my system (double didn't have any issues):

#include <stdio.h>
#include <limits.h>

int main (void)
{
  long in, out;
  double d;
  float f;

  for (in=0; in < (LONG_MAX); in++) {
    d=in;
    f=in;
    out=d;
    if (in != out) {
      printf ("Double conversion imprecise for %ld\n", in);
    }
    out=f;
    if (in != out) {
      printf ("Float conversion imprecise for %ld\n", in);
    }
  }
  return 0;
}

回复收藏 0 原文

~没有更多了~