浮点变量存储的哪个数字可以准确?
就我的知识而言,浮点变量可以完全存储(例如0)。
因此,如果我有代码:
{
float var = 0;
printf(“%f”,var);
}
我会作为输出来:
0.00000000000000000
,
但我听说还有其他数字也可以使用浮点变量精确存储。
如何使用浮点变量确定是否可以精确存储一个变量?
As far as my knowledge goes there are numbers which floating point variables can store exactly such as 0.
So if I have the code:
{
float var=0;
printf("%f", var);
}
I would get as output:
0.0000000000000000000
But I have heard that there are other numbers that also can be stored exactly using floating point variable.
How do I determine if a variable can be stored exactly using floating point variable?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
简介
在IEEE-754“双重精度”格式中可以表示有限数,并且仅当它等于 m •2 e 整数 m 和 e ,因此-2 53 < m < +2 53 和-1074≤em>e ≤971。
例如,可以表示3.125,因为它等于25•2 -3 ,而25是(-2 53 ,+2 53 )中的整数,而-3是[-1074,971]中的整数。
讨论
通常,浮点格式表示有限数字为± d 0 。 d −2 d -3 … d 1- p < /em> • b e 其中:
注意radix点“。” d 0 之后。这意味着显着性在[0, b )中,并且在radix点后具有 p -1位。 (“ [0, b )”表示包含0但不包括 b 的半开间隔。 ,要么在第一个数字之前,。 sub> −2 d -3 … d 1- p ,或在最后一个数字之后, d 0 d -1 d < sub> −2 d -3 … d 1- p 。这些在数学上是等效的:
指数限制 e min 和 e max 将被调整以匹配所使用的位置。在右侧的radix点上,显着性是一个整数,这对于在分析和有关浮点数的证明中使用数字理论很方便。
某些浮点格式可能需要 d 0 为非零。现在这很少见。 d 0 的表示为正常形式,以及其中 d 的表示 0 零被认为是否定化。一个非零数,可以以一种格式的符合形式表示,但太小而无法以正常格式表示为 subsormoral formaloral 。
对于IEEE-754 Binary64格式,也称为“双精度”, b 是2, p IS 53, e e min < /sub>是-1022, e max 是1023。
使用整数尺度缩放,指数的最小值和最大值为-1074和971。然后,我们可以说一个当且仅当其等于某些整数M m 时,才能以这种格式表示有限的数字。和 e ,以便-2 53 &lt; m &lt; +2 53 和-1074≤em>e ≤971。
对于单精度,binary32格式,-2 24 &lt; m &lt; +2 24 和-149≤em>e ≤104。
这些格式还具有代表-∞, +∞和特殊NAN和特殊NAN(不是数字)值的编码。
不可能的数字
由于3⅓•2 p 不是任何整数 p ,因此无法表示 3⅓。如果有整数 m 和 e ,以便3⅓= m •2 e ,我们可以将每一侧乘以3 = 3• m •2 e ,然后5 = 3• m •2 e -1 。如果 e -1为负,则我们将两面都乘以2 1- e 具有5•2 1- E = 3• m 。然后是5 = 3• m •2 e -1 或5•2 1- e = 3• m 是一个具有整数的方程,但是右侧的倍数为3,左侧不具有 arithmetic的基本定理。
Introduction
A finite number can be represented in the IEEE-754 “double precision” format if and only if it equals M•2e for some integers M and e such that −253 < M < +253 and −1074 ≤ e ≤ 971.
For example, 3.125 can be represented because it equals 25•2−3, and 25 is an integer in (−253, +253), and −3 is an integer in [−1074, 971].
Discussion
Generally, a floating-point format represents finite numbers as ±d0.d−1d−2d−3…d1−p•be where:
Note the radix point “.” after d0. This means the significand is in [0, b) and has p−1 digits after the radix point. (“[0, b)” denotes a half-open interval that includes 0 but excludes b.) Sometimes floating-point formats are described with the radix point in different positions, either before the first digit, .d0d−1d−2d−3…d1−p, or after the last digit, d0d−1d−2d−3…d1−p. These are equivalent mathematically:
The exponent limits emin and emax would be adjusted to match the position used. With the radix-point on the right, the significand is an integer, and this is convenient for using number theory in analysis and proofs about floating-point.
Some floating-point formats may require d0 to be non-zero. This is rare now. Representations in which d0 is non-zero are said to be in normal form, and representations in which d0 is zero are said be denormalized. A non-zero number that can be represented in the denormalized form of a format but is too small to be represented in its normal format is said to be subnormal.
For the IEEE-754 binary64 format, also called “double precision,” b is 2, p is 53, emin is −1022, and emax is 1023.
Using the integer-significand scaling, the exponent minimum and maximum are −1074 and 971. Then we can say a finite number can be represented in this format if and only if it equals M•2e for some integers M and e such that −253 < M < +253 and −1074 ≤ e ≤ 971.
For single precision, the binary32 format, −224 < M < +224 and −149 ≤ e ≤ 104.
These formats also have encodings that represent −∞, +∞, and special NaN (Not a Number) values.
Example of an Unrepresentable Number
3⅓ cannot be represented because 3⅓•2p is not an integer for any integer p. If there were integers M and e such that 3⅓ = M•2e, we could multiply each side by 3 to get 10 = 3•M•2e, and then 5 = 3•M•2e−1. If e−1 is negative, we multiply both sides by 21−e to have 5•21−e = 3•M. Then either 5 = 3•M•2e−1 or 5•21−e = 3•M is an equation having only integers, but the right side has a factor of 3 and the left side does not, which contradicts the fundamental theorem of arithmetic.