浮点字面的最小小数位数的最低数量是多少，以表示尽可能正确的值？

发布于 2025-01-23 20:00:16 字数 1945 浏览 2 评论 0原文

例如，使用IEEE-754 32位二进制浮点，让我们表示1/3 < / code>的值。它不能准确地完成，但是0x3EAAAAAB产生的值最接近1/3。您可能需要以十进制编写值，然后让编译器将小数的文字转换为二进制浮点数。

0.333333f    -> 0x3eaaaa9f (0.333332986)
0.3333333f   -> 0x3eaaaaaa (0.333333313)
0.33333333f  -> 0x3eaaaaab (0.333333343)
0.333333333f -> 0x3eaaaaab (0.333333343)

您可以看到8（显着）十进制数字足以表示该值尽可能正确（最接近实际值）。

我用π和e（自然对数的基部）测试，并且两者都需要8个小数位数才能正确。

3.14159f    -> 0x40490fd0 (3.14159012)
3.141593f   -> 0x40490fdc (3.14159298)
3.1415927f  -> 0x40490fdb (3.14159274)
3.14159265f -> 0x40490fdb (3.14159274)

2.71828f    -> 0x402df84d (2.71828008)
2.718282f   -> 0x402df855 (2.71828198)
2.7182818f  -> 0x402df854 (2.71828175)
2.71828183f -> 0x402df854 (2.71828175)

但是，√2似乎需要9位数字。

1.41421f     -> 0x3fb504d5 (1.41420996)
1.414214f    -> 0x3fb504f7 (1.41421402)
1.4142136f   -> 0x3fb504f4 (1.41421366)
1.41421356f  -> 0x3fb504f3 (1.41421354)
1.414213562f -> 0x3fb504f3 (1.41421354)

https://godbolt.org/z/w5vecs695

看这些结果可能是对的，这可能是正确的。带有9个重要位数的浮点字面文字足以产生最正确的32位二进制二进制浮点数，而实际上，如果存储额外的数字的空间不重要，那么诸如12〜15位的数字可以肯定有效。

但是我对它背后的数学感兴趣。在这种情况下，如何确定9位数足够？ double甚至任意精度呢，是否有一个简单的公式来得出所需的数字数？

当前答案和评论中的链接确认9数字足以满足最多案例，但是我找到了一个反例，其中9 digits还不够。实际上，必须以十进制格式的无限精度正确转换（最接近的）到某些二进制浮点格式（IEEE-754 binary32 floats进行讨论）。

8388609.499用9代表大量小数是8388609.50。此数字转换为float具有8388610的值。另一方面，用10或更多数字表示的数字始终保留原始值，并且此数字转换为float具有值8388609。

您可以看到8388609.499需要超过9数字最准确地转换为float。有许多这样的数字，在二进制浮点格式中非常接近两个代表值的半点。

原文

For example, using IEEE-754 32-bit binary floating points, let's represent the value of 1 / 3. It cannot be done exactly, but 0x3eaaaaab produces the closest value to 1 / 3. You might want to write the value in decimal, and let the compiler to convert the decimal literal to a binary floating point number.

0.333333f    -> 0x3eaaaa9f (0.333332986)
0.3333333f   -> 0x3eaaaaaa (0.333333313)
0.33333333f  -> 0x3eaaaaab (0.333333343)
0.333333333f -> 0x3eaaaaab (0.333333343)

You can see that 8 (significant) decimal digits is enough to represent the value as correct as possible (closest to the actual value).

I tested with π and e (base of the natural log), and both needed 8 decimal digits for the correctest.

3.14159f    -> 0x40490fd0 (3.14159012)
3.141593f   -> 0x40490fdc (3.14159298)
3.1415927f  -> 0x40490fdb (3.14159274)
3.14159265f -> 0x40490fdb (3.14159274)

2.71828f    -> 0x402df84d (2.71828008)
2.718282f   -> 0x402df855 (2.71828198)
2.7182818f  -> 0x402df854 (2.71828175)
2.71828183f -> 0x402df854 (2.71828175)

However, √2 appears to need 9 digits.

1.41421f     -> 0x3fb504d5 (1.41420996)
1.414214f    -> 0x3fb504f7 (1.41421402)
1.4142136f   -> 0x3fb504f4 (1.41421366)
1.41421356f  -> 0x3fb504f3 (1.41421354)
1.414213562f -> 0x3fb504f3 (1.41421354)

https://godbolt.org/z/W5vEcs695

Looking at these results, it's probably right that a decimal floating-point literal with 9 significant digits is sufficient to produce a most correct 32-bit binary floating point value, and in practice something like 12~15 digits would work for sure if space for storing the extra digits doesn't matter.

But I'm interested in the math behind it. How can one be sure that 9 digits is enough in this case? What about double or even arbitrary precision, is there a simple formula to derive the number of digits needed?

The current answers and the links in the comments confirm that 9 digits is enough for most cases, but I've found a counterexample where 9 digits is not enough. In fact, infinite precision in the decimal format is required to be always correctly converted (rounded to the closest) to some binary floating point format (IEEE-754 binary32 floats for the discussion).

8388609.499 represented with 9 significant decimal digits is 8388609.50. This number converted to float has the value of 8388610. On the other hand, the number represented with 10 or more digits will always preserve the original value, and this number converted to float has the value 8388609.

You can see 8388609.499 needs more than 9 digits to be most accurately converted to float. There are infinitely many such numbers, placed very close to the half point of two representable values in the binary float format.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

相对绾红妆 2025-01-30 20:00:16

我认为您正在寻找*_ DECIMAL_DIG常数。 C标准提供了有关如何计算它们的小解释和公式（N2176 C17草案）：

5.2.4.2.2浮动类型的特征＆lt; float.h＆gt;
以下列表中给出的值应由常数表达式取代，其实现定义的值的大小（绝对值）与所示的值更大或相等（绝对值）
相同的标志：
...
十进制数字的数量， n ，使得任何带有 p radix b 数字的浮点数可以舍入到带有 n 十进制数字的浮点数，然后再返回，而无需更改为
  p log10 b如果b是10的力量
⌈1 + p log10b⌉否则


flt_decimal_dig 6
dbl_decimal_dig 10
ldbl_decimal_dig 10
 

使用IEEE-754 32位float b = flt_radix = 2和p = flt_mant_dig = 24 ，结果为flt_decimal_dig =⌈1 + 24 log102⌉= 9。（⌈X⌉= ceil（x））是天花板功能：圆结果）

I think you are looking for *_DECIMAL_DIG constants. C standard provides small explanation and formula on how they are calculated (N2176 C17 draft):

5.2.4.2.2 Characteristics of floating types <float.h>
The values given in the following list shall be replaced by constant expressions with implementation-defined values that are greater or equal in magnitude (absolute value) to those shown, with the
same sign:
...
number of decimal digits, n, such that any floating-point number with p radix b digits can be rounded to a floating-point number with n decimal digits and back again without change to the value,
p log10 b        if b is a power of 10
⌈1 + p log10 b⌉  otherwise


FLT_DECIMAL_DIG  6
DBL_DECIMAL_DIG  10
LDBL_DECIMAL_DIG 10

With IEEE-754 32-bit float b = FLT_RADIX = 2 and p = FLT_MANT_DIG = 24, result is FLT_DECIMAL_DIG = ⌈1 + 24 log10 2⌉ = 9. (⌈x⌉=ceil(x)) is ceiling function: round result up)

回复收藏 0 原文

骄傲 2025-01-30 20:00:16

double或什至任意精度如何，有一个简单的公式来得出所需的数字数？＆gt;

来自C17§5.2.4.2.211 flt_decimal_dig，dbl_decimal_dig，ldbl_decimal_dig

十进制数字的数量， n ，以便可以将带有 p radix b b 数字的任何浮点数舍入浮动 - 用 n 十进制数字的点号，然后再次返回，而无需更改为

pmax log ₁₀b：如果b是一个功率为10
1 + p _maxlog ₁₀b：否则

，但我对它背后的数学感兴趣。在这种情况下，如何确定9位数字就足够了？

[1.0 ... 2.0），[128.0 ... 256.0），[0.125 ... 0.5）的每个二进制浮点的范围均包含2 ^{p -1}值均匀分布。例如，使用float，p = 24。

十年的小数文本的每个范围，带有n指数符号中的重要数字，例如[1.0 ... 9.999 ...），[ 100.0f ... 999.999 ...），[0.001 ... 0.00999 ...）包含10 ^{n -1}值均匀分布。

示例：common float：
当p是24 ²⁴组合时，n必须在最少 8中，才能形成16,777,216的组合-trip float将小数为float的文本。由于上面的两个小数范围的端点可能很大程度存在于2 ²⁴的集合中，因此较大的小数值远距离分开。这需要+1小数位数。

示例：

考虑2相邻float值

10.000009_5367431640625
10.000010_49041748046875

两个转换为8个重要数字十进制文本“ 10.000010”。 8还不够。

9总是足够的，因为我们不需要超过167,772,160来区分16,777,216 float值。

OP还询问8388609.499。（为简单起见，让我们仅考虑float。）

该值几乎是2 float值之间的一半。

8388609.0f  // Nearest lower float value
8388609.499 // OP's constant as code
8388610.0f  // Nearest upper float value

OP报告：“您可以看到8388609.499需要9位以上的数字才能最准确地转换为浮点。”

并让我们查看标题“浮点字面^*1中的最小小数位数的最小数量是多少，以表示 value 尽可能正确？”

这个新问题部分强调，有问题的 value 是源代码8388609.499的值，而不是浮点常数，它在发射的代码中变为：8388608.0f << /代码>。

如果我们将 value 视为浮点数常数的值> 8388608.0F。 8388608.49，作为源代码就足够了。

但是，要根据某个数字获得最接近的浮点常数，因为代码是的，确实可以花很多数字。

考虑典型的最小float， /code> 确切的小数为：

0.00000000000000000000000000000000000000000000140129846432481707092372958328991613128026194187651577175706828388979108268586060148663818836212158203125

一半和0.0之间的十进制值为0.000 ..（〜39个零）.. 0007006 ..（〜100多数数字）.. 15625。

最后一个数字是6或4，最接近的float将分别为flt_true_min或0.0F。因此，现在我们有一个“需要” 109个重要数字来选择2个可能的float之间的情况。

放弃我们浏览疯狂的悬崖，IEEE-758已经解决了这一问题。

即使额外的数字可以转换为另一个fp值，即使符合该规格的重大十进制数字（编译器）必须符合该规格（不一定是C规格）的数量。

IIRC，它实际上是flt_decimal_dig + 3。因此，对于常见的float，可以检查到9 + 3个重要的十进制数字。

[edit]

正确的四舍五入只能保证所需的十进制数量的数量支持的二进制格式。

^*1c不定义：浮点字面字面，但确实定义了浮点恒定，因此该术语使用。

What about double or even arbitrary precision, is there a simple formula to derive the number of digits needed?>

From C17 § 5.2.4.2.2 11 FLT_DECIMAL_DIG, DBL_DECIMAL_DIG, LDBL_DECIMAL_DIG

number of decimal digits, n, such that any floating-point number with p radix b digits can be rounded to a floating-point number with n decimal digits and back again without change to the value,

p_max log₁₀ b: if b is a power of 10
1 + p_max log₁₀ b: otherwise

But I'm interested in the math behind it. How can one be sure that 9 digits is enough in this case?

Each range of binary floating point like [1.0 ... 2.0), [128.0 ... 256.0), [0.125 ... 0.5) contains 2^{p - 1} values uniformly distributed. e.g. With float, p = 24.

Each range of a decade of decimal text with n significant digits in exponential notation like [1.0 ... 9.999...), [100.0f ... 999.999...), [0.001 ... 0.00999...) contains 10^{n - 1} values uniformly distributed.

Example: common float:
When p is 24 with 2²⁴ combinations, n must at least 8 to form the 16,777,216 combinations to distinctly round-trip float to decimal text to float. As the end-points of two decimal ranges above may exist well within that set of 2²⁴, the larger decimal values are spaced out further apart. This necessitates a +1 decimal digit.

Example:

Consider the 2 adjacent float values

10.000009_5367431640625
10.000010_49041748046875

Both convert to 8 significant digits decimal text "10.000010". 8 is not enough.

9 is always enough as we do not need more than 167,772,160 to distinguish 16,777,216 floatvalues.

OP also asks about 8388609.499. (Let us only consider float for simplicity.)

That value is nearly half-way between 2 float values.

8388609.0f  // Nearest lower float value
8388609.499 // OP's constant as code
8388610.0f  // Nearest upper float value

OP reports: "You can see 8388609.499 needs more than 9 digits to be most accurately converted to float."

And let us review the title "What is the minimum number of significant decimal digits in a floating point literal^*1 to represent the value as correct as possible?"

This new question part emphasizes that the value in question is the value of the source code 8388609.499 and not the floating point constant it becomes in emitted code: 8388608.0f.

If we consider the value to be the value of the floating point constant, only up to 9 significant decimal digits are needed to define the floating point constant 8388608.0f. 8388608.49, as source code is sufficient.

But to get the closest floating point constant based on some number as code yes indeed could take many digits.

Consider the typical smallest float, FLT_TRUE_MIN with the exact decimal value of :

0.00000000000000000000000000000000000000000000140129846432481707092372958328991613128026194187651577175706828388979108268586060148663818836212158203125

Half way between that and 0.0 is 0.000..(~39 more zeroes)..0007006..(~ 100 more digits)..15625.

It that last digit was 6 or 4, the closest float would be FLT_TRUE_MIN or 0.0f respectively. So now we have a case where 109 significant digits are "needed" to select between 2 possible float.

To forego us going over the cliffs of insanity, IEEE-758 has already addressed this.

The number of significant decimal digits a translation (compiler) must examine to be compliant with that spec (not necessarily the C spec) is far more limited, even if the extra digits could translate to another FP value.

IIRC, it is in effect FLT_DECIMAL_DIG + 3. So for a common float, as little as 9 + 3 significant decimal digits may be examined.

[Edit]

correct rounding is only guaranteed for the number of decimal digits required plus 3 for the largest supported binary format.

^*1 C does not define: floating point literal, but does define floating point constant, so that term is used.

回复收藏 0 原文

黑白记忆 2025-01-30 20:00:16

浮点字面的最小十进制数字的最低数量是多少，以表示尽可能正确的值？

不能保证C标准，即浮点字面的任何数量的小数位数都会产生以浮点格式实际表示的最接近的值。在讨论浮点文字时，C 2018 6.4.4.2 3说：

…对于十进制浮动常数，…结果是最近的代表值，或者是与最近的代表值相邻的较大或更小的代表值，以实现定义的方式选择……

对于质量，C实施应正确圆形浮点。文字达到最近的代表值，并与数字均匀的数字有联系。在这种情况下，flt_decimal_dig，dbl_decimal_dig 和ldbl_decimal_dig值值值＆lt; float.h＆gt; gt;提供数字提供数字提供数字总是足以唯一地识别代表价值的数字。

在这种情况下，如何确定9位数足够？

您需要在编译器文档中对此效果的陈述，例如它为浮点文字提供了正确的舍入语句，并且它使用IEEE-754 Binary32（又称“单个精度”）用于float（或仅需要九个重要数字才能唯一识别所有代表值的其他格式）。

double或什至任意精度又有什么简单的公式来得出所需的数字数？

C标准表示上述常数为 p log10 b 如果 b 是十的幂ceil（1 + p log ₁₀b ）否则，其中 p 是浮动 - 点格式和 b 是格式中使用的碱。这些总是足够的，但是后者并不总是必要的。如果指数范围未绑定，则后者提供了所需的数字数。它的“ 1 +”涵盖了 b 的力量如何与10的力量相互作用的所有可能津贴。但是任何浮点格式都有有限的指数范围，并且对于指数范围的某些选择，CEIL（ p log ₁₀b ）将足够而不是CEIL（1 + p log ₁₀b ）。没有简单的公式。它不会以标准IEEE-754格式发生，并且可以在实践中忽略。

What is the minimum number of significant decimal digits in a floating point literal to represent the value as correct as possible?

There is no guarantee from the C standard that any number of decimal digits in a floating-point literal will produce the nearest value actually representable in the floating-point format. In discussing floating-point literals, C 2018 6.4.4.2 3 says:

… For decimal floating constants, … the result is either the nearest representable value, or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined manner…

For quality, C implementations should correctly round floating-point literals to the nearest representable value, with ties going to the choice with the even low digit. In that case, the FLT_DECIMAL_DIG, DBL_DECIMAL_DIG, and LDBL_DECIMAL_DIG values defined in <float.h> provide numbers of digits that always suffice to uniquely identify a representable value.

How can one be sure that 9 digits is enough in this case?

You need statements to this effect in the compiler documentation, such as statements that it provides correct rounding for floating-point literals and that it uses IEEE-754 binary32 (a.k.a. “single precision”) for float (or some other format that would only require nine significant digits to uniquely identify all representable values).

What about double or even arbitrary precision, is there a simple formula to derive the number of digits needed?

The C standard indicates the constants above are calculated as p log₁₀ b if b is a power of ten and ceil(1 + p log₁₀ b) otherwise, where p is the number of digits in the floating-point format and b is the base used in the format. These always suffice, but the latter is not always necessary. The latter provides the number of digits needed if the exponent range were unbounded; its “1 +” covers all possible allowances for how the powers of b interact with the powers of 10, in a sense. But any floating-point format has a finite exponent range, and, for some choices of exponent range, ceil(p log₁₀ b) would suffice instead of ceil(1 + p log₁₀ b). There is no simple formula for this. It does not occur with the standard IEEE-754 formats and can be neglected in practice.