使用浮点数或双精度数代替整数

发布于 2024-11-08 14:50:44 字数 293 浏览 6 评论 0原文

我知道Lua的默认实现仅使用浮点数,从而避免了在选择要使用的数学函数变体之前动态确定数字子类型的问题。

我的问题是——如果我尝试在标准 C99 中将整数模拟为双精度数(或浮点数),是否有一种可靠(且简单)的方法来判断可精确表示的最大值是多少?

我的意思是,如果我使用 64 位浮点数来表示整数,我当然无法表示所有 64 位整数(鸽巢原理适用于此)。我怎样才能知道可表示的最大整数?

(尝试列出所有值并不是解决方案 - 例如,如果我在 64 位架构中使用双精度数,因为我必须列出 2^{64} 数字)

谢谢!

I know that the default implementation of Lua uses floating point numbers only, thus circumventing the problem of dynamically determining the subtype of a number before choosing which variant of math function to use.

My question is -- if I try to emulate integers as doubles (or floats) in standard C99, is there a reliable (and simple) way to tell what is the maximum value representable precisely?

I mean, if I use 64-bit floats to represent integers, I certainly cannot represent all 64-bit integers (the pigeonhole principle applies here). How can I tell the maximum integer that is representable?

(Trying to list all values is not a solution -- if, for example, I'm using doubles in a 64-bit architecture, as I'd have to list 2^{64} numbers)

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

请你别敷衍 2024-11-15 14:50:44

对于 64 位双精度数,可表示的最大整数为 253 (9007199254740992);对于 32 位双精度数,可表示的最大整数为 224 (16777216)漂浮。请参阅IEEE 浮点数的维基百科页面上的基数。

在 Lua 中验证这一点非常简单:

local maxdouble = 2^53

-- one less than the maximum can be represented precisely
print (string.format("%.0f",maxdouble-1)) --> 9007199254740991
-- the maximum itself can be represented precisely
print (string.format("%.0f",maxdouble))   --> 9007199254740992
-- one more than the maximum gets rounded down
print (string.format("%.0f",maxdouble+1)) --> 9007199254740992 again

如果我们手头没有 IEEE 定义的字段大小,那么只要知道我们对浮点数的设计的了解,我们就可以使用一个简单的循环来确定这些值超过可能的值:

#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#define min(a, b) (a < b ? a : b)
#define bits(type) (sizeof(type) * 8)
#define testimax(test_t) { \
  uintmax_t in = 1, out = 2; \
  size_t pow = 0, limit = min(bits(test_t), bits(uintmax_t)); \
  while (pow < limit && out == in + 1) { \
    in = in << 1; \
    out = (test_t) in + 1; \
    ++pow; \
  } \
  if (pow == limit) \
    puts(#test_t " is as precise as longest integer type"); \
  else printf(#test_t " conversion imprecise for 2^%d+1:\n" \
    "   in: %llu\n  out: %llu\n\n", pow, in + 1, out); \
}

int main(void)
{
    testimax(float);
    testimax(double);
    return 0;
}

上述代码的输出

float conversion imprecise for 2^24+1:
   in: 16777217
  out: 16777216

double conversion imprecise for 2^53+1:
   in: 9007199254740993
  out: 9007199254740992

当然,由于浮点精度的工作方式,随着浮点指数变正,64 位双精度数可以表示远大于 264 的数字。 有关双精度浮点的 Wikipedia 页面 描述了:

在 252=4,503,599,627,370,496 和 253=9,007,199,254,740,992 之间,可表示的数字恰好是整数。对于下一个范围,从 253 到 254,所有内容都乘以 2,因此可表示的数字是偶数,等等。相反,对于前一个范围251到252,间距为0.5等

进一步列出了双精度型可以容纳的绝对最大值该页面下方:0x7feffffffffffffff,计算结果为 (1 + (1 − 2−52)) * 21023,即大约 1.7976931348623157e308。

The maximum ones-representable integer is 253 (9007199254740992) for a 64-bit double and 224 (16777216) for a 32-bit float. See the base digits on the Wikipedia page for IEEE floating point numbers.

Verifying this in Lua is pretty simple:

local maxdouble = 2^53

-- one less than the maximum can be represented precisely
print (string.format("%.0f",maxdouble-1)) --> 9007199254740991
-- the maximum itself can be represented precisely
print (string.format("%.0f",maxdouble))   --> 9007199254740992
-- one more than the maximum gets rounded down
print (string.format("%.0f",maxdouble+1)) --> 9007199254740992 again

If we don't have the IEEE-defined field sizes handy, knowing what we know about the design of floating point numbers, we can determine these values using a simple loop over the possible values:

#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#define min(a, b) (a < b ? a : b)
#define bits(type) (sizeof(type) * 8)
#define testimax(test_t) { \
  uintmax_t in = 1, out = 2; \
  size_t pow = 0, limit = min(bits(test_t), bits(uintmax_t)); \
  while (pow < limit && out == in + 1) { \
    in = in << 1; \
    out = (test_t) in + 1; \
    ++pow; \
  } \
  if (pow == limit) \
    puts(#test_t " is as precise as longest integer type"); \
  else printf(#test_t " conversion imprecise for 2^%d+1:\n" \
    "   in: %llu\n  out: %llu\n\n", pow, in + 1, out); \
}

int main(void)
{
    testimax(float);
    testimax(double);
    return 0;
}

The output of the above code:

float conversion imprecise for 2^24+1:
   in: 16777217
  out: 16777216

double conversion imprecise for 2^53+1:
   in: 9007199254740993
  out: 9007199254740992

Of course, due to the way floating-point precision works, a 64-bit double can represent numbers much larger than 264 as the floating exponent grows positive. The Wikipedia page on double-precision floating-point describes:

Between 252=4,503,599,627,370,496 and 253=9,007,199,254,740,992 the representable numbers are exactly the integers. For the next range, from 253 to 254, everything is multiplied by 2, so the representable numbers are the even ones, etc. Conversely, for the previous range from 251 to 252, the spacing is 0.5, etc.

The absolute largest value a double can hold is listed further down that page: 0x7fefffffffffffff, which computes to (1 + (1 − 2−52)) * 21023, or roughly 1.7976931348623157e308.

徒留西风 2024-11-15 14:50:44

IEEE 浮点维基页面 说:

通过使用以下方法转换为十进制并再次转换回来,将保留原始二进制值:

  • 二进制 16 的 5 位十进制数字
  • 9 位十进制数字表示二进制 32
  • 二进制 64 的 17 位十进制数字
  • 二进制128的36位十进制数字

The IEEE floating point wikipage says:

The original binary value will be preserved by converting to decimal and back again using:

  • 5 decimal digits for binary16
  • 9 decimal digits for binary32
  • 17 decimal digits for binary64
  • 36 decimal digits for binary128
世界和平 2024-11-15 14:50:44

如果您正在查看 int 到 float 和返回 int 之间的转换,它在我的系统上大约分解为 16,777,217 (double 没有任何问题):

#include <stdio.h>
#include <limits.h>

int main (void)
{
  long in, out;
  double d;
  float f;

  for (in=0; in < (LONG_MAX); in++) {
    d=in;
    f=in;
    out=d;
    if (in != out) {
      printf ("Double conversion imprecise for %ld\n", in);
    }
    out=f;
    if (in != out) {
      printf ("Float conversion imprecise for %ld\n", in);
    }
  }
  return 0;
}

If you're looking at a conversion between int to float and back to int, it breaks down around 16,777,217 on my system (double didn't have any issues):

#include <stdio.h>
#include <limits.h>

int main (void)
{
  long in, out;
  double d;
  float f;

  for (in=0; in < (LONG_MAX); in++) {
    d=in;
    f=in;
    out=d;
    if (in != out) {
      printf ("Double conversion imprecise for %ld\n", in);
    }
    out=f;
    if (in != out) {
      printf ("Float conversion imprecise for %ld\n", in);
    }
  }
  return 0;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文