可以存储在 double 中的最大整数

发布于 2024-08-13 07:24:27 字数 324 浏览 4 评论 0原文

可以以 IEEE 754 双精度类型存储且不损失精度的最大“非浮动”整数是多少?

换句话说,将返回以下代码片段:

UInt64 i = 0;
Double d = 0;

while (i == d)
{
        i += 1; 
        d += 1;
}
Console.WriteLine("Largest Integer: {0}", i-1);

What is the biggest "no-floating" integer that can be stored in an IEEE 754 double type without losing precision?

In other words, at would the follow code fragment return:

UInt64 i = 0;
Double d = 0;

while (i == d)
{
        i += 1; 
        d += 1;
}
Console.WriteLine("Largest Integer: {0}", i-1);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

感情洁癖 2024-08-20 07:24:27

可以存储在双精度型中而不丢失精度的最大/最大整数与双精度型的最大可能值相同。也就是说,DBL_MAX 或大约 1.8×10308(如果您的双精度数是 IEEE 754 64 位双精度数)。它是一个整数,并且被精确表示。

您可能想知道最大的整数是多少,这样它和所有较小的整数就可以存储在 IEEE 64 位双精度数中而不会丢失精度。 IEEE 64 位双精度数有 52 位尾数,因此为 253(负数为 -253):

  • 253 + 1 无法存储,因为开头的 1 和结尾的 1 之间有太多的零。
  • 任何小于 253 的值都可以存储,其中 52 位显式存储在尾数中,然后指数实际上为您提供了另一个值。
  • 253 显然可以被存储,因为它是 2 的一个小幂。

或者另一种看待它的方式:一旦消除了指数的偏差,并忽略与问题无关的符号位,double 存储的值是 2 的幂加上 52 位整数乘以 2指数− 52。因此,使用指数 52,您可以存储从 252 到 253 的所有值 − 1。然后,对于指数 53,253 之后可以存储的下一个数字是 253 + 1 × 253 − 52。因此,精度损失首先发生在 253 + 1 时。

The biggest/largest integer that can be stored in a double without losing precision is the same as the largest possible value of a double. That is, DBL_MAX or approximately 1.8 × 10308 (if your double is an IEEE 754 64-bit double). It's an integer, and it's represented exactly.

What you might want to know instead is what the largest integer is, such that it and all smaller integers can be stored in IEEE 64-bit doubles without losing precision. An IEEE 64-bit double has 52 bits of mantissa, so it's 253 (and -253 on the negative side):

  • 253 + 1 cannot be stored, because the 1 at the start and the 1 at the end have too many zeros in between.
  • Anything less than 253 can be stored, with 52 bits explicitly stored in the mantissa, and then the exponent in effect giving you another one.
  • 253 obviously can be stored, since it's a small power of 2.

Or another way of looking at it: once the bias has been taken off the exponent, and ignoring the sign bit as irrelevant to the question, the value stored by a double is a power of 2, plus a 52-bit integer multiplied by 2exponent − 52. So with exponent 52 you can store all values from 252 through to 253 − 1. Then with exponent 53, the next number you can store after 253 is 253 + 1 × 253 − 52. So loss of precision first occurs with 253 + 1.

女中豪杰 2024-08-20 07:24:27

9007199254740992(即 9,007,199,254,740,992 或 2^53),无保证:)

计划

#include <math.h>
#include <stdio.h>

int main(void) {
  double dbl = 0; /* I started with 9007199254000000, a little less than 2^53 */
  while (dbl + 1 != dbl) dbl++;
  printf("%.0f\n", dbl - 1);
  printf("%.0f\n", dbl);
  printf("%.0f\n", dbl + 1);
  return 0;
}

结果

9007199254740991
9007199254740992
9007199254740992

9007199254740992 (that's 9,007,199,254,740,992 or 2^53) with no guarantees :)

Program

#include <math.h>
#include <stdio.h>

int main(void) {
  double dbl = 0; /* I started with 9007199254000000, a little less than 2^53 */
  while (dbl + 1 != dbl) dbl++;
  printf("%.0f\n", dbl - 1);
  printf("%.0f\n", dbl);
  printf("%.0f\n", dbl + 1);
  return 0;
}

Result

9007199254740991
9007199254740992
9007199254740992
变身佩奇 2024-08-20 07:24:27

IEEE 754 double(64 位)可以表示的最大整数与该类型可以表示的最大值相同,因为该值本身就是整数。

这表示为 0x7FEFFFFFFFFFFFFFF,其组成如下:

  • 符号位 0(正)而不是 1(负)
  • 最大指数 0x7FE(2046 表示 1023 后的 1023)减去偏差)而不是 0x7FF(2047 表示 NaN 或无穷大)。
  • 最大尾数 0xFFFFFFFFFFFFFF 为 52 位全 1。

在二进制中,该值是隐式 1,后跟尾数中的另外 52 个 1,然后是指数中的 971 个零 (1023 - 52 = 971)。

The exact decimal value is:

179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368

This is approximately 1.8 x 10308.

The largest integer that can be represented in IEEE 754 double (64-bit) is the same as the largest value that the type can represent, since that value is itself an integer.

This is represented as 0x7FEFFFFFFFFFFFFF, which is made up of:

  • The sign bit 0 (positive) rather than 1 (negative)
  • The maximum exponent 0x7FE (2046 which represents 1023 after the bias is subtracted) rather than 0x7FF (2047 which indicates a NaN or infinity).
  • The maximum mantissa 0xFFFFFFFFFFFFF which is 52 bits all 1.

In binary, the value is the implicit 1 followed by another 52 ones from the mantissa, then 971 zeros (1023 - 52 = 971) from the exponent.

The exact decimal value is:

179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368

This is approximately 1.8 x 10308.

ㄖ落Θ余辉 2024-08-20 07:24:27

维基百科在相同的上下文中也有这样的说法,并提供了 IEEE 754 的链接:

在典型的计算机系统上,“双精度”(64 位)二进制浮点数具有 53 位系数(隐含其中一位)、11 位指数和 1 个符号位。< /p>

2^53 刚刚超过 9 * 10^15。

Wikipedia has this to say in the same context with a link to IEEE 754:

On a typical computer system, a 'double precision' (64-bit) binary floating-point number has a coefficient of 53 bits (one of which is implied), an exponent of 11 bits, and one sign bit.

2^53 is just over 9 * 10^15.

妖妓 2024-08-20 07:24:27

您需要查看尾数的大小。 IEEE 754 64 位浮点数(有 52 位,隐含加 1)可以精确表示绝对值小于或等于 2^53 的整数。

You need to look at the size of the mantissa. An IEEE 754 64 bit floating point number (which has 52 bits, plus 1 implied) can exactly represent integers with an absolute value of less than or equal to 2^53.

平生欢 2024-08-20 07:24:27

确实,对于 64 位 IEEE754 double,可以精确表示 9007199254740992 == 2^53 以内的所有整数。

然而,还值得一提的是,所有超出 4503599627370496 == 2^52 的可表示数字都是整数。
超过 2^52 时,测试它们是否是整数就变得毫无意义,因为它们都隐式舍入到附近的可表示值。

在 2^51 到 2^52 范围内,唯一的非整数值是以“.5”结尾的中点,这意味着计算后的任何整数测试都必须预期会产生至少 50% 的错误答案。

在 2^51 以下,我们还有“.25”和“.75”,因此将数字与其四舍五入的对应数字进行比较以确定它是否可能是整数开始有意义。

TLDR:如果要测试计算结果是否为整数,请避免大于 2251799813685248 == 2^51 的数字

It is true that, for 64-bit IEEE754 double, all integers up to 9007199254740992 == 2^53 can be exactly represented.

However, it is also worth mentioning that all representable numbers beyond 4503599627370496 == 2^52 are integers.
Beyond 2^52 it becomes meaningless to test whether or not they are integers, because they are all implicitly rounded to a nearby representable value.

In the range 2^51 to 2^52, the only non-integer values are the midpoints ending with ".5", meaning any integer test after a calculation must be expected to yield at least 50% false answers.

Below 2^51 we also have ".25" and ".75", so comparing a number with its rounded counterpart in order to determine if it may be integer or not starts making some sense.

TLDR: If you want to test whether a calculated result may be integer, avoid numbers larger than 2251799813685248 == 2^51

一场春暖 2024-08-20 07:24:27

正如其他人所指出的,我将假设OP要求最大的浮点值,以便所有小于其自身的整数都可以精确表示。

您可以使用 float.h 中定义的 FLT_MANT_DIGDBL_MANT_DIG 来不依赖显式值(例如 53):

#include <stdio.h>
#include <float.h>

int main(void)
{
    printf("%d, %.1f\n", FLT_MANT_DIG, (float)(1L << FLT_MANT_DIG));
    printf("%d, %.1lf\n", DBL_MANT_DIG, (double)(1L << DBL_MANT_DIG));
}

输出:

24, 16777216.0
53, 9007199254740992.0

As others has noted, I will assume that the OP asked for the largest floating-point value such that all whole numbers less than itself is precisely representable.

You can use FLT_MANT_DIG and DBL_MANT_DIG defined in float.h to not rely on the explicit values (e.g., 53):

#include <stdio.h>
#include <float.h>

int main(void)
{
    printf("%d, %.1f\n", FLT_MANT_DIG, (float)(1L << FLT_MANT_DIG));
    printf("%d, %.1lf\n", DBL_MANT_DIG, (double)(1L << DBL_MANT_DIG));
}

outputs:

24, 16777216.0
53, 9007199254740992.0
后eg是否自 2024-08-20 07:24:27

双精度数,“简单”解释

最大的“双精度”数(双精度浮点数)通常是 64 位或 8 字节数,表示为:

1.79E308
or
1.79 x 10 (to the power of) 308

正如您所猜测的,10 的 308 次方是一个巨大的数字,像 170000000000000000000000000000000000000000000 甚至更大!

另一方面,双精度浮点 64 位数字支持使用“点”表示法的微小小数,最小的是:

4.94E-324
or
4.94 x 10 (to the power of) -324

任何乘以 10 的负次方的值都是微小的小数,例如 0.0000000000000000000000000000000000494 甚至更小。

但令人困惑的是,他们会听到计算机迷和数学家说,“但这个数字的范围只有 15 个数字值”。事实证明,上述值是计算机可以从内存中存储和呈现的所有时间的最大值和最小值。但在规模变得这么大之前,他们就失去了准确性和创造数字的能力。因此,大多数程序员都会避免可能的最大双数,并尝试坚持在已知的小得多的范围内。

但为什么?使用的最佳最大双精度数是多少?我在网上数学网站上阅读了数十种糟糕的解释,但找不到答案。因此,这个简单的解释可能会对您有所帮助。它帮助了我!

双数字事实和缺陷

JavaScript(它也在计算机中使用 64 位双精度存储系统)使用双精度浮点数来存储所有已知的数值。因此,它使用与上面所示相同的 MAX 和 MIN 范围。但大多数语言使用带有范围的类型化数字系统来避免准确性问题。然而,双精度和浮点存储系统似乎都具有相同的缺陷,即随着它们变大或变小,会失去数值精度。我将解释为什么它会影响“最大”值的概念...

为了解决这个问题,JavaScript 有一个所谓的 Number.MAX_SAFE_INTEGER 值,即 9007199254740991。这是它可以表示的最准确的整数数字,但不是可以存储的最大数字。它准确是因为它保证任何等于或小于该值的数字都可以被查看、计算、存储等。超出该范围,就会出现“丢失”的数字。原因是因为 9007199254740991 之后的双精度数字使用额外的数字将它们乘以越来越大的值,包括 1.79E308 的真实最大数字。这个新数字称为指数。指数

邪恶指数

事实上,9007199254740991 的最大值也是 64 位存储系统中使用的 53 位计算机内存中可以存储的最大数字。存储在内存中 53 位中的这个 9007199254740991 数字是可以直接存储在所使用的典型双精度浮点数的内存尾数部分中的最大可能值通过 JavaScript。

顺便说一句,9007199254740991采用的是我们称为 Base10 或十进制的格式,这是人类使用的数字。但它也以 53 位的形式存储在计算机内存中...

11111111111111111111111111111111111111111111111111111

这是计算机使用 64 位数字存储系统实际上可以存储双精度数字的整数部分的最大位数。

为了获得可能的更大的最大数 (1.79E308),JavaScript 必须使用一种名为指数的额外技巧来将其乘以越来越大的值。因此,上面计算机内存中的53位尾数值旁边有一个11位指数数字,它允许数字变得更大和更小,从而创建最终的预计代表的数字范围为 double 。 (此外,也有一个用于正数和负数的位。)

在计算机达到最大整数值的限制(大约 9 万万亿)并用 53 位填充内存的尾数部分后,JavaScript 使用新的 11 位存储区域来存储指数,它允许更大的整数增长(最大为 10 的 308 次方!),并且允许更小的小数变得更小(10 到-324 的力量!)。因此,这个指数数字允许使用浮动基数或小数点来创建全范围的大和小小数,以上下移动数字,从而创建您期望的复杂分数或小数值看。同样,这个指数是另一个11位的大数存储,它本身的最大值为2048

您会注意到 9007199254740991 是一个最大整数,但没有解释存储中可能的较大最大值或最小十进制数,甚至没有解释小数是如何创建和存储的。这个计算机比特值是如何创造这一切的?

答案又是,通过指数

原来,指数11位值将自身分为正值和负值,这样就可以创建大整数,但是还有小数。

为此,它有自己的正值范围和负值范围,方法是从其 2048 最大值中减去 1024,以从 +1023 获得新的值范围code> 到 -1023(减去 0 的保留值)以创建正/负指数范围。为了获得最终的双数,将尾数9007199254740991)乘以指数(加上添加的一位符号)以获得最终值!这允许指数尾数值乘以超过9千万亿的更大整数范围,但也可以以相反的方式将小数乘以非常小的分数。

但是,存储在指数中的-+1023数字不会与尾数相乘以获得双精度,而是用于计算数字2的指数次方。指数是十进制数,但不适用于十进制指数,如 10 次方或 1023。它再次应用于 Base2 系统并创建一个值 2 次方(指数数).

然后将生成的值乘以尾数,以获得允许存储在 JavaScript 中的最大和最小数字,以及该范围内的所有较大和较小的值。出于精确目的,它使用“2”而不是 10,因此随着指数值的每次增加,它只会使尾数值加倍。这减少了数字的损失。但是这个指数乘数也意味着随着它的增长,它会失去越来越多的双倍数字范围,以至于当你达到最大存储的指数和尾数时,大量的数字就会从最终计算出的数字,因此某些数字现在在数学计算中是不可能的!

这就是为什么大多数人使用SAFE最大整数范围(9007199254740991或更少),因为大多数人都知道JavaScript中非常大和小的数字是非常不准确的!另请注意,2 的 -1023 次方得到与典型“浮点数”相关的 MIN 数字或小十进制分数。因此,指数用于将尾数整数转换为非常大和非常小的数字,直至其可以存储的最大和最小范围。

请注意,2 的 1023 次方 转换为十进制指数,使用 10 的 308 次方 作为最大值。这使您可以查看人类值中的数字,或二进制计算的 Base10 数字格式。数学专家通常不会解释所有这些值都是相同的数字,只是基数或格式不同。

双精度数的真正最大值是无穷大

最后,当整数达到可能的最大数或可能的最小小数时会发生什么?

事实证明,双精度浮点数为 64 位指数和尾数值保留了一组位值来存储其他四个可能的数字:

  1. +Infinity
  2. -Infinity
  3. +0
  4. -0

例如,双精度数中的 +0 存储在64 位内存是计算机内存中的一大排空位。以下是使用双精度浮点数超出可能的最小小数 (4.94E-324) 后会发生的情况。内存耗尽后就变成+0了!计算机将返回+0,但在内存中存储 0 位。下面是计算机内存中双精度的完整 64 位存储设计。第一位控制 +(0) 或 -(1) 为数,11-接下来是指数位(全零都是 0,因此变为 2 的 0 次方 = 1),以及尾数的 53 位大块< /em> 或 significand,代表 0。所以 +0 表示全零!

0 00000000000 0000000000000000000000000000000000000000000000000000

如果双精度数达到其正最大值或最小值,或其负最大值或最小值,许多语言将始终以某种形式返回这些值之一。然而,有些返回 NaN,或溢出,异常等。如何处理是一个不同的讨论。但通常这四个值是 double 的真实最小值和最大值。通过返回无理值,您至少可以得到双精度数中的最大值和最小值的表示,以解释无法合理存储或解释的双精度类型的最后形式。

摘要

因此,正负双精度数的最大和最小范围如下:

MAXIMUM TO MINIMUM POSITIVE VALUE RANGE
1.79E308 to 4.94E-324 (+Infinity to +0 for out of range)

MAXIMUM TO MINIMUM NEGATIVE VALUE RANGE
-4.94E-324 to -1.79E308 (-0 to -Infinity for out of range)

But the SAFE and ACCURATE MAX and MIN range is really:
9007199254740991 (max) to -9007199254740991 (min)

因此,您可以看到添加了 +-Infinity 和 +-0,双精度数有额外的最大和最小范围,可以在您超过最大和最小值时为您提供帮助。

如上所述,当您从最大的正值到最小的十进制正值或分数时,这些位将被清零,并且您会得到 0 Past 4.94E-324 双精度数无法存储任何更小的十进制分数值,因此它在位注册表中折叠为+0。同样的事件也会发生在微小的负小数上,超过其值后会崩溃为 -0。如您所知,-0 = +0,因此尽管内存中存储的值不同,但在应用程序中它们通常会被强制为 0。但请注意,许多应用程序确实会提供带符号的零!

相反的情况发生在大值上......过去的 1.79E308 它们变成负版本的 +Infinity 和 -Infinity。这就是 JavaScript 等语言中所有奇怪数字范围的原因。双精度数有奇怪的返回!

请注意,上面未显示小数/分数的最小安全范围,因为它根据分数所需的精度而变化。当您将整数与小数部分结合起来时,小数位精度随着它变小而迅速下降。网上对此有很多讨论和争论。从来没有人有答案。下面的列表可能会有所帮助。如果您想保证精度,您可能需要将列出的这些范围更改为更小的值。正如您所看到的,如果您希望浮点数支持高达 9 位小数位的精度,则需要将尾数中的 MAX 值限制为这些值。精度是指您需要多少位小数才能准确。不安全意味着超过这些值,数字将失去精度并丢失数字:

            Precision   Unsafe 
            1           5,629,499,534,21,312
            2           703,687,441,770,664
            3           87,960,930,220,208
            4           5,497,558,130,888
            5           68,719,476,736
            6           8,589,934,592
            7           536,870,912
            8           67,108,864
            9           8,388,608

我花了一段时间才理解双精度浮点数和计算机的真正限制。在阅读了网上数学专家的大量混乱之后,我在上面创建了这个简单的解释,他们擅长创建数字,但不擅长解释任何事情!我希望我对您的编码之旅有所帮助 - 和平:)

Doubles, the "Simple" Explanation

The largest "double" number (double precision floating point number) is typically a 64-bit or 8-byte number expressed as:

1.79E308
or
1.79 x 10 (to the power of) 308

As you can guess, 10 to the power of 308 is a GIGANTIC NUMBER, like 170000000000000000000000000000000000000000000 and even larger!

On the other end of the scale, double precision floating point 64-bit numbers support tiny tiny decimal numbers of fractions using the "dot" notation, the smallest being:

4.94E-324
or
4.94 x 10 (to the power of) -324

Anything multiplied times 10 to the power of a negative power is a tiny tiny decimal, like 0.0000000000000000000000000000000000494 and even smaller.

But what confuses people is they will hear computer nerds and math people say, "but that number has a range of only 15 numbers values". It turns out that the values described above are the all-time MAXIMUM and MINIMUM values the computer can store and present from memory. But they lose accuracy and the ability to create numbers LONG BEFORE they get that big. So most programmers AVOID the maximum double number possible, and try and stick within a known, much smaller range.

But why? And what is the best maximum double number to use? I could not find the answer reading dozens of bad explanations on math sites online. So this SIMPLE explanation may help you below. It helped me!!

DOUBLE NUMBER FACTS and FLAWS

JavaScript (which also uses the 64-bit double precision storage system for numbers in computers) uses double precision floating point numbers for storing all known numerical values. It thus uses the same MAX and MIN ranges shown above. But most languages use a typed numerical system with ranges to avoid accuracy problems. The double and float number storage systems, however, seem to all share the same flaw of losing numerical precision as they get larger and smaller. I will explain why as it affects the idea of "maximum" values...

To address this, JavaScript has what is called a Number.MAX_SAFE_INTEGER value, which is 9007199254740991. This is the most accurate number it can represent for Integers, but is NOT the largest number that can be stored. It is accurate because it guarantees any number equal to or less than that value can be viewed, calculated, stored, etc. Beyond that range, there are "missing" numbers. The reason is because double precision numbers AFTER 9007199254740991 use an additional number to multiple them to larger and larger values, including the true max number of 1.79E308. That new number is called an exponent.

THE EVIL EXPONENT

It happens to be the fact that this max value of 9007199254740991 is also the max number you can store in the 53 bits of computer memory used in the 64-bit storage system. This 9007199254740991 number stored in the 53-bits in memory is the largest value possible that can be stored directly in the mantissa section of memory of a typical double precision floating point number used by JavaScript.

9007199254740991, by-the-way, is in a format we call Base10 or decimal, the number Humans use. But it is also stored in computer memory as 53-bits as this value...

11111111111111111111111111111111111111111111111111111

This the maximum number of bits computers can actually store the integer part of double precision numbers using the 64-bit number memory system.

To get to the even LARGER max number possible (1.79E308), JavaScript has to use an extra trick called the exponent to multiple it to larger and larger values. So there is an 11-bit exponent number next to the 53-bit mantissa value in computer memory above that allows the number to grow much larger and much smaller, creating the final range of numbers double are expected to represent. (Also, there is a single bit for positive and negative numbers, as well.)

After the computer reaches this limit of max Integer value (around ~9 quadrillion) and filling up the mantissa section of memory with 53 bits, JavaScript uses a new 11-bit storage area for the exponent which allows much larger integers to grow (up to 10 to the power of 308!) and much smaller decimals to get smaller (10 to the power of -324!). Thus, this exponent number allows for a full range of large and small decimals to be created with the floating radix or decimal point to move up and down the number, creating the complex fractional or decimal values you expect to see. Again, this exponent is another large number store in 11-bits, and itself has a max value of 2048.

You will notice 9007199254740991 is a max integer, but does not explain the larger MAX value possible in storage or the MINIMUM decimal number, or even how decimal fractions get created and stored. How does this computer bit value create all that?

The answer is again, through the exponent!

It turns out that the exponent 11-bit value is divided itself into a positive and negative value so that it can create large integers but also small decimal numbers.

To do so, it has its own positive and negative range created by subtracting 1024 from its 2048 max value to get a new range of values from +1023 to -1023 (minus reserved values for 0) to create the positive/negative exponent range. To then get the FINAL DOUBLE NUMBER, the mantissa (9007199254740991) is multiplied by the exponent (plus the single bit sign added) to get the final value! This allows the exponent to multiply the mantissa value to even larger integer ranges beyond 9 quadrillion, but also go the opposite way with the decimal to very tiny fractions.

However, the -+1023 number stored in the exponent is not multiplied to the mantissa to get the double, but used to raise a number 2 to a power of the exponent. The exponent is a decimal number, but not applied to a decimal exponent like 10 to the power or 1023. It is applied to a Base2 system again and creates a value of 2 to the power of (the exponent number).

That value generated is then multiplied to the mantissa to get the MAX and MIN number allowed to be stored in JavaScript, as well as all the larger and smaller values within the range. It uses "2" rather than 10 for precision purposes, so with each increase in the exponent value, it only doubles the mantissa value. This reduces the loss of numbers. But this exponent multiplier also means it will lose an increasing range of numbers in doubles as it grows, to the point where as you reach the MAX stored exponent and mantissa possible, very large swaths of numbers disappear from the final calculated number, and so certain numbers are now not possible in math calculations!

That is why most use the SAFE max integer ranges (9007199254740991 or less), as most know very large and small numbers in JavaScript are highly inaccurate! Also note that 2 to the power of -1023 gets the MIN number or small decimal fractions you associate with a typical "float". The exponent is thus used to translate the mantissa integer to very large and small numbers up to the Maximum and Minimum ranges it can store.

Notice that the 2 to power of 1023 translates to a decimal exponent using 10 to the power of 308 for max values. That allows you to see the number in Human values, or Base10 numerical format of the binary calculation. Often math experts do not explain that all these values are the same number just in different bases or formats.

THE TRUE MAX FOR DOUBLES IS INFINITY

Finally, what happens when integers reach the MAX number possible, or the smallest decimal fraction possible?

It turns out, double precision floating point numbers have reserved a set of bit values for the 64-bit exponent and mantissa values to store four other possible numbers:

  1. +Infinity
  2. -Infinity
  3. +0
  4. -0

For example, +0 in double numbers stored in 64-bit memory is a large row of empty bits in computer memory. Below is what happens after you go beyond the smallest decimal possible (4.94E-324) in using a Double precision floating point number. It becomes +0 after it runs out of memory! The computer will return +0, but stores 0 bits in memory. Below is the FULL 64-bit storage design in bits for a double in computer memory. The first bit controls +(0) or -(1) for positive or negative numbers, the 11-bit exponent is next (all zeros is 0, so becomes 2 to the power of 0 = 1), and the large block of 53 bits for the mantissa or significand, which represents 0. So +0 is represented by all zeroes!

0 00000000000 0000000000000000000000000000000000000000000000000000

If the double reaches its positive max or min, or its negative max or min, many languages will always return one of those values in some form. However, some return NaN, or overflow, exceptions, etc. How that is handled is a different discussion. But often these four values are your TRUE min and max values for double. By returning irrational values, you at least have have a representation of the max and min in doubles that explain the last forms of the double type that cannot be stored or explained rationally.

SUMMARY

So the MAXIMUM and MINIMUM ranges for positive and negative Doubles are as follows:

MAXIMUM TO MINIMUM POSITIVE VALUE RANGE
1.79E308 to 4.94E-324 (+Infinity to +0 for out of range)

MAXIMUM TO MINIMUM NEGATIVE VALUE RANGE
-4.94E-324 to -1.79E308 (-0 to -Infinity for out of range)

But the SAFE and ACCURATE MAX and MIN range is really:
9007199254740991 (max) to -9007199254740991 (min)

So you can see with +-Infinity and +-0 added, Doubles have extra max and min ranges to help you when you exceed the max and mins.

As mentioned above, when you go from the largest positive value to smallest decimal positive value or fraction, the bits zero out and you get 0 Past 4.94E-324 the double cannot store any decimal fraction value smaller so it collapses to +0 in the bit registry. The same event happens for tiny negative decimals which collapse past their value to -0. As you know -0 = +0 so though not the same values stored in memory, in applications they often are coerced to 0. But be aware many applications do deliver signed zeros!

The opposite happens to the large values...past 1.79E308 they turn into +Infinity and -Infinity for the negative version. This is what creates all the weird number ranges in languages like JavaScript. Double precision numbers have weird returns!

Note that he MINIMUM SAFE RANGE for decimals/fractions is not shown above as it varies based on the precision needed in the fraction. When you combine the integer with the fractional part, the decimal place accuracy drops away quickly as it goes smaller. There are many discussions and debates about this online. No one ever has an answer. The list below might help. You might need to change these ranges listed to much smaller values if you want guaranteed precision. As you can see, if you want to support up to 9-decimal place accuracy in floats, you will need to limit MAX values in the mantissa to these values. Precision means how many decimal places you need with accuracy. Unsafe means past these values, the number will lose precision and have missing numbers:

            Precision   Unsafe 
            1           5,629,499,534,21,312
            2           703,687,441,770,664
            3           87,960,930,220,208
            4           5,497,558,130,888
            5           68,719,476,736
            6           8,589,934,592
            7           536,870,912
            8           67,108,864
            9           8,388,608

It took me awhile to understand the TRUE limits of Double precision floating point numbers and computers. I created this simple explanation above after reading so much MASS CONFUSION from math experts online who are great at creating numbers but terrible at explaining anything! I hope I helped you on your coding journey - Peace :)

來不及說愛妳 2024-08-20 07:24:27

考虑您的编译器,它可能不遵循当前的 IEEE 754 Double Type 规范。以下是在 VB6 或 Excel VBA 中尝试的修改后的代码片段。它在 999,999,999,999,999 处退出循环,这只是预期值的 1/9。这不会测试所有数字,因此可能存在一个较小的数字,其中加 1 不会增加总和。您还可以尝试以下行
调试窗口:打印格式(1E15# + 1#,"#,###")

    Microsoft VB6, Microsoft Excel 2013 VBA (Both obsolete) 
    Sub TestDbl()
    Dim dSum    As Double      'Double Precision Sum
    Dim vSum    As Variant     'Decimal Precision Sum
    Dim vSumL   As Variant     'Last valid comparison
   
    Dim dStep   As Double
    Dim vStep   As Variant
   
    dStep = 2# ^ 49#           'Starting step
    vStep = CDec(dStep)
   
    dSum = dStep               'Starting Sums
    vSum = vStep
    vSumL = vSum
   
   
    Debug.Print Format(dSum, "###,###,###,###,###,###,###"); " "; _
                Format(vSum, "###,###,###,###,###,###,###"); " "; _
                vStep; " "; Now()
    Do
       dSum = dSum + dStep     'Increment Sums
       vSum = CDec(vSum + vStep)
                              
       If dSum <> vSum Then
                              'Print bad steps
          Debug.Print Format(dSum, "###,###,###,###,###,###,###"); " "; _
                      Format(vSum, "###,###,###,###,###,###,###"); " "; _ 
                      vStep; " "; Now()
                              'Go back 2 steps
          vSum = CDec(vSumL - vStep)
          dSum = CDbl(vSum)
                              'Exit if Step is 1
          If dStep < 2 Then Exit Do
                              'Adjust Step, if <1 make 1
          vStep = CDec(Int(vStep / 4))
          If vStep < 2 Then vStep = CDec(1)
          dStep = CDbl(vStep)
       End If                  'End check for matching sums
       vSumL = vSum            'Last Valid reading
       DoEvents
    Loop                       'Take another step
                               'Last Valid step
    Debug.Print Format(dSum, "###,###,###,###,###,###,###"); " "; _
                Format(vSum, "###,###,###,###,###,###,###"); " ";  _
                vStep; " "; Now()
   
    End Sub

Consider your compiler, which may not follow the current IEEE 754 Double Type specification. Here is a revised snippet to try in VB6 or in Excel VBA. It exits the loop at 999,999,999,999,999 which is only 1/9 the expected value. This doesn't test all numbers, so there may be a lower number where an increment by 1 does not increment the sum. You can also try the following line in the
debug window: Print Format(1E15# + 1#,"#,###")

    Microsoft VB6, Microsoft Excel 2013 VBA (Both obsolete) 
    Sub TestDbl()
    Dim dSum    As Double      'Double Precision Sum
    Dim vSum    As Variant     'Decimal Precision Sum
    Dim vSumL   As Variant     'Last valid comparison
   
    Dim dStep   As Double
    Dim vStep   As Variant
   
    dStep = 2# ^ 49#           'Starting step
    vStep = CDec(dStep)
   
    dSum = dStep               'Starting Sums
    vSum = vStep
    vSumL = vSum
   
   
    Debug.Print Format(dSum, "###,###,###,###,###,###,###"); " "; _
                Format(vSum, "###,###,###,###,###,###,###"); " "; _
                vStep; " "; Now()
    Do
       dSum = dSum + dStep     'Increment Sums
       vSum = CDec(vSum + vStep)
                              
       If dSum <> vSum Then
                              'Print bad steps
          Debug.Print Format(dSum, "###,###,###,###,###,###,###"); " "; _
                      Format(vSum, "###,###,###,###,###,###,###"); " "; _ 
                      vStep; " "; Now()
                              'Go back 2 steps
          vSum = CDec(vSumL - vStep)
          dSum = CDbl(vSum)
                              'Exit if Step is 1
          If dStep < 2 Then Exit Do
                              'Adjust Step, if <1 make 1
          vStep = CDec(Int(vStep / 4))
          If vStep < 2 Then vStep = CDec(1)
          dStep = CDbl(vStep)
       End If                  'End check for matching sums
       vSumL = vSum            'Last Valid reading
       DoEvents
    Loop                       'Take another step
                               'Last Valid step
    Debug.Print Format(dSum, "###,###,###,###,###,###,###"); " "; _
                Format(vSum, "###,###,###,###,###,###,###"); " ";  _
                vStep; " "; Now()
   
    End Sub
我要还你自由 2024-08-20 07:24:27

更新1

刚刚意识到5 ^ 1074不是你可以免费获得的真正上限IEEE 754 双精度浮点,因为我只计算了非规范化指数,而忘记了尾数本身可以容纳另外 22 个 5 的幂这一事实,所以据我所知,可以免费获得 5 的最大幂双精度格式为::

5的最大幂:

  • 5 ^ 1096

最大奇数:

  • 5 ^ 1074 x 9007199254740991

  • 5^1074×(2^53-1)

mawk 'BEGIN { OFS = "\f\r\t";

 CONVFMT = "IEEE754::4 字节字::%.16lX"; 
   
 打印 ””, 
 sprintf("%.*g", __=(_+=_+=_^=_<_)^++_+_*(_+_),
                ___=_=((_+_)/_)^-__), (_ ""),
                        \
 sprintf("%.*g",__,_=_*((_+=(_^=!_)+(_+=_))*_\
                           )^(_+=_++)), (_ ""),
                           \
 sprintf("%.*g",__,_=___*= \
        (_+=_+=_^=_<_)^--_^_/--_-+--_), (_ "") }'
  • <代码>4.-324

     — IEEE754 :: 4 字节字 :: 0000000000000001
    
    494065645841246544176568792......682506419718265533447265625 } 751 个数据:      
      5^1,074    
    
  • <代码>1.-308

     — IEEE754 :: 4 字节字 :: 000878678326EAC9
    
    117794429264365802806989858......070818103849887847900390625 } 767 个数据:
      5^1,096
    
  • <代码>4.-308

     — IEEE754 :: 4 字节字 :: 001FFFFFFFFFFFFF
    
    445014771701440227211481959......317493580281734466552734375 } 767 个数据:
          5^1,074
          6361
          69431
          20394401 
    

这是一个快速的 awk 代码片段,用于打印 2 到 1023 的每个正幂,以及 5 的每个正幂到 1096,以及它们的共同幂零,针对有和没有 bigint 库进行了优化:

{m,g,n}awk' BEGIN {

 CONVFMT = "%." ((_+=_+=_^=_<_)*_+--_*_++)(!++_) "g"
    OFMT = "%." (_*_) "g"

 if (((_+=_+_)^_%(_+_))==(_)) {
    print __=_=\
            int((___=_+=_+=_*=++_)^!_)
     OFS = ORS
    while (--___) {
        print int(__+=__), int(_+=_+(_+=_))
    }
    __=((_+=_+=_^=!(__=_))^--_+_*_) substr("",_=__)
    do {
        print _+=_+(_+=_) } while (--__)
    exit
 } else { _=_<_ }

    __=((___=_+=_+=++_)^++_+_*(_+_--))
      _=_^(-(_^_--))*--_^(_++^_^--_-__)
  _____=-log(_<_)
    __^=_<_
   ___=-___+--___^___

 while (--___) {
     print ____(_*(__+=__+(__+=__))) }
 do {
     print ____(_) } while ((_+=_)<_____)
 }

 function ____(__,_) {
     return (_^=_<_)<=+__ \
     ?              sprintf( "%.f", __) \
     : substr("", _=sprintf("%.*g", (_+=++_)^_*(_+_),__),
         gsub("^[+-]*[0][.][0]*|[.]|[Ee][+-]?[[:digit:]]+$","",_))_
 }'

================ =============

取决于您对“表示”和“可表示”的定义有多灵活 -

尽管典型的文献所说,IEEE 754 中实际上“最大”的整数双精度,没有任何bigint库或外部函数调用,具有完全完整尾数,可计算、可存储和可打印实际上是:

9,007,199,254,740,991 * 5 ^ 1074 (~2546.750773909...位)

  4450147717014402272114819593418263951869639092703291
  2960468522194496444440421538910330590478162701758282
  9831782607924221374017287738918929105531441481564124
  3486759976282126534658507104573762744298025962244902
  9037796981144446145705102663115100318287949527959668
  2360399864792509657803421416370138126133331198987655
  1545144031526125381326665295130600018491776632866075
  5595837392240989947807556594098101021612198814605258
  7425791790000716759993441450860872056815779154359230
  1891033496486942061405218289243144579760516365090360
  6514140377217442262561590244668525767372446430075513
  3324500796506867194913776884780053099639677097589658
  4413789443379662199396731693628045708486661320679701
  7728916080020698679408551343728867675409720757232455
  434770912461317493580281734466552734375

我使用 xxhash 将此与 进行比较>gnu-bc 并确认它确实相同并且没有精度丢失。尽管指数范围被如此标记,但这个数字根本没有“非规范化”。

如果您不相信我,请在您自己的系统上尝试一下。 (我通过现成的 mawk 打印出来) - 你也可以相当容易地得到它:

  1. one(1) 幂/幂(^ 又名 < code>**) op、
  2. 一 (1) 乘法 (*) op、
  3. 一 (1) 个 sprintf() 调用,以及
  4. 一 (1)的
    substr() 或 regex-gsub()
    执行必要的清理

就像经常提到的 1.79…E309 数字一样,

  • 两者都是尾数有限的,
  • 都是指数有限的
  • ,都具有大得离谱的ULP(最后一位的单位)
  • 而且两者都距离“压倒性”只有 1 步。 “具有上溢或下溢的浮点单元可以给您返回一个可用的答案

对工作流程的二进制指数求反,您可以完全在这个空间中完成操作,然后只需在工作流程的末尾再次将其反转即可回到我们通常认为“更大”的一侧,

但请记住,在倒置的情况下 
指数领域,不存在“逐渐溢出”

— 4Chan 出纳员

UPDATE 1 :

just realized 5 ^ 1074 is NOT the true upper limit of what you can get for free out of IEEE 754 double-precision floating point, because I only counted denormalized exponents and forgot the fact the mantissa itself can fit another 22 powers of 5, so to the best of my understanding, the largest power of 5 one can get for free out of the double-precision format is ::

largest power of 5 :

  • 5 ^ 1096

largest odd number :

  • 5 ^ 1074 x 9007199254740991

  • 5 ^ 1074 x ( 2 ^ 53 - 1 )

mawk 'BEGIN { OFS = "\f\r\t";

 CONVFMT = "IEEE754 :: 4-byte word :: %.16lX"; 
   
 print "", 
 sprintf("%.*g", __=(_+=_+=_^=_<_)^++_+_*(_+_),
                ___=_=((_+_)/_)^-__),   (_ ""),
                        \
 sprintf("%.*g",__,_=_*((_+=(_^=!_)+(_+=_))*_\
                           )^(_+=_++)), (_ ""),
                           \
 sprintf("%.*g",__,_=___*=  \
        (_+=_+=_^=_<_)^--_^_/--_-+--_), (_ "") }'
  • 4.940656458412465441765687928682213723650598026143247644255856825006755072702087518652998363616359923797965646954457177309266567103559397963987747960107818781263007131903114045278458171678489821036887186360569987307230500063874091535649843873124733972731696151400317153853980741262385655911710266585566867681870395603106249319452715914924553293054565444011274801297099995419319894090804165633245247571478690147267801593552386115501348035264934720193790268107107491703332226844753335720832431936092382893458368060106011506169809753078342277318329247904982524730776375927247874656084778203734469699533647017972677717585125660551199131504891101451037862738167250955837389733598993664809941164205702637090279242767544565229087538682506419718265533447265625e-324

      — IEEE754 :: 4-byte word :: 0000000000000001
    
    494065645841246544176568792......682506419718265533447265625 } 751 dgts :      
      5^1,074    
    
  • 1.1779442926436580280698985883431944188238616052015418158187524855152976686244219586021896275559329804892458073984282439492384355315111632261247033977765604928166883306272301781841416768261169960586755720044541328685833215865788678015827760393916926318959465387821953663477851727634395732669139543975751084522891987808004020022041120326339133484493650064495265010111570347355174765803347028811562651566216206901711944564705815590623254860079132843479610128658074120767908637153514231969910697784644086106916351461663273587631725676246505444808791274797874748064938487833137213363849587926231550453981511635715193075144590522172925785791614297511667878003519179715722536405560955202126362715257889359212587458533154881546706053453699158950485070818103849887847900390625e-308

      — IEEE754 :: 4-byte word :: 000878678326EAC9
    
    117794429264365802806989858......070818103849887847900390625 } 767 dgts :
      5^1,096
    
  • 4.4501477170144022721148195934182639518696390927032912960468522194496444440421538910330590478162701758282983178260792422137401728773891892910553144148156412434867599762821265346585071045737627442980259622449029037796981144446145705102663115100318287949527959668236039986479250965780342141637013812613333119898765515451440315261253813266652951306000184917766328660755595837392240989947807556594098101021612198814605258742579179000071675999344145086087205681577915435923018910334964869420614052182892431445797605163650903606514140377217442262561590244668525767372446430075513332450079650686719491377688478005309963967709758965844137894433796621993967316936280457084866613206797017728916080020698679408551343728867675409720757232455434770912461317493580281734466552734375e-308

      — IEEE754 :: 4-byte word :: 001FFFFFFFFFFFFF
    
    445014771701440227211481959......317493580281734466552734375 } 767 dgts :
          5^1,074
          6361
          69431
          20394401 
    

and here's a quick awk code snippet to print out every positive power of 2 up to 1023, every positive power of 5 up to 1096, and their common power of zero, optimized for both with and without a bigint library :

{m,g,n}awk' BEGIN {

 CONVFMT = "%." ((_+=_+=_^=_<_)*_+--_*_++)(!++_) "g"
    OFMT = "%." (_*_) "g"

 if (((_+=_+_)^_%(_+_))==(_)) {
    print __=_=\
            int((___=_+=_+=_*=++_)^!_)
     OFS = ORS
    while (--___) {
        print int(__+=__), int(_+=_+(_+=_))
    }
    __=((_+=_+=_^=!(__=_))^--_+_*_) substr("",_=__)
    do {
        print _+=_+(_+=_) } while (--__)
    exit
 } else { _=_<_ }

    __=((___=_+=_+=++_)^++_+_*(_+_--))
      _=_^(-(_^_--))*--_^(_++^_^--_-__)
  _____=-log(_<_)
    __^=_<_
   ___=-___+--___^___

 while (--___) {
     print ____(_*(__+=__+(__+=__))) }
 do {
     print ____(_) } while ((_+=_)<_____)
 }

 function ____(__,_) {
     return (_^=_<_)<=+__ \
     ?              sprintf( "%.f", __) \
     : substr("", _=sprintf("%.*g", (_+=++_)^_*(_+_),__),
         gsub("^[+-]*[0][.][0]*|[.]|[Ee][+-]?[[:digit:]]+
quot;,"",_))_
 }'

=============================

depends on how flexible you are with the definition of "represented" and "representable" -

Despite what typical literature says, the integer that's actually "largest" in IEEE 754 double precision, without any bigint library or external function call, with a completely full mantissa, that is computable, storable, and printable is actually :

9,007,199,254,740,991 * 5 ^ 1074 (~2546.750773909... bits)

  4450147717014402272114819593418263951869639092703291
  2960468522194496444440421538910330590478162701758282
  9831782607924221374017287738918929105531441481564124
  3486759976282126534658507104573762744298025962244902
  9037796981144446145705102663115100318287949527959668
  2360399864792509657803421416370138126133331198987655
  1545144031526125381326665295130600018491776632866075
  5595837392240989947807556594098101021612198814605258
  7425791790000716759993441450860872056815779154359230
  1891033496486942061405218289243144579760516365090360
  6514140377217442262561590244668525767372446430075513
  3324500796506867194913776884780053099639677097589658
  4413789443379662199396731693628045708486661320679701
  7728916080020698679408551343728867675409720757232455
  434770912461317493580281734466552734375

I used xxhash to compare this with gnu-bc and confirmed it's indeed identical and no precision lost. There's nothing "denormalized" about this number at all, despite the exponent range being labeled as such.

Try it on ur own system if u don't believe me. (I got this print out via off-the-shelf mawk) - and you can get to it fairly easily too :

  1. one(1) exponentiation/power (^ aka **) op,
  2. one(1) multiplication (*) op,
  3. one (1) sprintf() call, and
  4. either one(1) of
    substr() or regex-gsub()
    to perform the cleanup necessary

Just like the 1.79…E309 number frequently mentioned,

  • both are mantissa limited
  • both are exponent limited
  • both have ridiculously large ULPs (unit in last place)
  • and both are exactly 1 step from "overwhelming" the floating point unit with either an overflow or underflow to give you back a usable answer

Negate the binary exponents of the workflow, and you can have the ops done entirely in this space, then just invert it once more at tail end of workflow to get back to the side what we typically consider "larger",

but keep in mind that in the inverted 
exponent realm, there's no "gradual overflow"

— The 4Chan Teller

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文