浮点变量存储的哪个数字可以准确?

发布于 2025-02-10 02:23:20 字数 241 浏览 1 评论 0原文

就我的知识而言,浮点变量可以完全存储(例如0)。

因此,如果我有代码:

{

float var = 0;

printf(“%f”,var);

}

我会作为输出来:

0.00000000000000000

但我听说还有其他数字也可以使用浮点变量精确存储。

如何使用浮点变量确定是否可以精确存储一个变量?

As far as my knowledge goes there are numbers which floating point variables can store exactly such as 0.

So if I have the code:

{

float var=0;

printf("%f", var);

}

I would get as output:

0.0000000000000000000

But I have heard that there are other numbers that also can be stored exactly using floating point variable.

How do I determine if a variable can be stored exactly using floating point variable?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

时常饿 2025-02-17 02:23:20

简介

在IEEE-754“双重精度”格式中可以表示有限数,并且仅当它等于 m •2 e 整数 m e ,因此-2 53 < m < +2 53 和-1074≤em>e ≤971。

例如,可以表示3.125,因为它等于25•2 -3 ,而25是(-2 53 ,+2 53 )中的整数,而-3是[-1074,971]中的整数。

讨论

通常,浮点格式表示有限数字为± d 0 。 d −2 d -3 d 1- p < /em> b e 其中:

  • ±表示 +或 - ,
  • b 是一个为格式固定的正整数基础,通常为2或10,
  • p ,称为精度,是 base- b b base的数量。 /em>,
  • d 0 d -1 d - 2 d -3 d 1- p (示例:1.2345)是浮点数表示的重要意义,每个 d i 是一个具有0≤ d i &lt; b
  • e 是一个整数,具有 e min e e e max ,其中 e min e max 是格式的固定限制。

注意radix点“。” d 0 之后。这意味着显着性在[0, b )中,并且在radix点后具有 p -1位。 (“ [0, b )”表示包含0但不包括 b 的半开间隔。 ,要么在第一个数字之前,。 sub> −2 d -3 d 1- p ,或在最后一个数字之后, d 0 d -1 d < sub> −2 d -3 d 1- p 。这些在数学上是等效的:

  • d 0 。 -2 d -3 d 1- p •• b e =。 d 0 d < sub> -1 d -2 d -3 d 1- p b e +1 = d < /em> 0 d -1 d -2 d < /em> −3 boundem> d< sub> 1-1p> p b e +1- p

指数限制 e min e max 将被调整以匹配所使用的位置。在右侧的radix点上,显着性是一个整数,这对于在分析和有关浮点数的证明中使用数字理论很方便。

某些浮点格式可能需要 d 0 为非零。现在这很少见。 d 0 的表示为正常形式,以及其中 d 的表示 0 零被认为是否定化。一个非零数,可以以一种格式的符合形式表示,但太小而无法以正常格式表示为 subsormoral formaloral

对于IEEE-754 Binary64格式,也称为“双精度”, b 是2, p IS 53, e e min < /sub>是-1022, e max 是1023。

使用整数尺度缩放,指数的最小值和最大值为-1074和971。然后,我们可以说一个当且仅当其等于某些整数M m 时,才能以这种格式表示有限的数字。和 e ,以便-2 53 &lt; m &lt; +2 53 和-1074≤em>e ≤971。

对于单精度,binary32格式,-2 24 &lt; m &lt; +2 24 和-149≤em>e ≤104。

这些格式还具有代表-∞, +∞和特殊NAN和特殊NAN(不是数字)值的编码。

不可能的数字

由于3⅓•2 p 不是任何整数 p ,因此无法表示 3⅓。如果有整数 m e ,以便3⅓= m •2 e ,我们可以将每一侧乘以3 = 3• m •2 e ,然后5 = 3• m •2 e -1 。如果 e -1为负,则我们将两面都乘以2 1- e 具有5•2 1- E = 3• m 。然后是5 = 3• m •2 e -1 或5•2 1- e = 3• m 是一个具有整数的方程,但是右侧的倍数为3,左侧不具有 arithmetic的基本定理

Introduction

A finite number can be represented in the IEEE-754 “double precision” format if and only if it equals M•2e for some integers M and e such that −253 < M < +253 and −1074 ≤ e ≤ 971.

For example, 3.125 can be represented because it equals 25•2−3, and 25 is an integer in (−253, +253), and −3 is an integer in [−1074, 971].

Discussion

Generally, a floating-point format represents finite numbers as ±d0.d−1d−2d−3d1−pbe where:

  • ± means either + or −,
  • b is a positive integer base that is fixed for the format, usually 2 or 10,
  • p, called the precision, is the number of base-b digits in the significand,
  • d0.d−1d−2d−3d1−p (example: 1.2345) is the significand of the floating-point representation, and each di is an integer with 0 ≤ di < b, and
  • e is an integer with emineemax, where emin and emax are fixed limits for the format.

Note the radix point “.” after d0. This means the significand is in [0, b) and has p−1 digits after the radix point. (“[0, b)” denotes a half-open interval that includes 0 but excludes b.) Sometimes floating-point formats are described with the radix point in different positions, either before the first digit, .d0d−1d−2d−3d1−p, or after the last digit, d0d−1d−2d−3d1−p. These are equivalent mathematically:

  • d0.d−1d−2d−3d1−pbe = .d0d−1d−2d−3d1−pbe+1 = d0d−1d−2d−3d1−p.•be+1−p.

The exponent limits emin and emax would be adjusted to match the position used. With the radix-point on the right, the significand is an integer, and this is convenient for using number theory in analysis and proofs about floating-point.

Some floating-point formats may require d0 to be non-zero. This is rare now. Representations in which d0 is non-zero are said to be in normal form, and representations in which d0 is zero are said be denormalized. A non-zero number that can be represented in the denormalized form of a format but is too small to be represented in its normal format is said to be subnormal.

For the IEEE-754 binary64 format, also called “double precision,” b is 2, p is 53, emin is −1022, and emax is 1023.

Using the integer-significand scaling, the exponent minimum and maximum are −1074 and 971. Then we can say a finite number can be represented in this format if and only if it equals M•2e for some integers M and e such that −253 < M < +253 and −1074 ≤ e ≤ 971.

For single precision, the binary32 format, −224 < M < +224 and −149 ≤ e ≤ 104.

These formats also have encodings that represent −∞, +∞, and special NaN (Not a Number) values.

Example of an Unrepresentable Number

3⅓ cannot be represented because 3⅓•2p is not an integer for any integer p. If there were integers M and e such that 3⅓ = M•2e, we could multiply each side by 3 to get 10 = 3•M•2e, and then 5 = 3•M•2e−1. If e−1 is negative, we multiply both sides by 21−e to have 5•21−e = 3•M. Then either 5 = 3•M•2e−1 or 5•21−e = 3•M is an equation having only integers, but the right side has a factor of 3 and the left side does not, which contradicts the fundamental theorem of arithmetic.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文