当前位置：文江博客话题详情

单精度和双精度浮点运算有什么区别？

发布于 2024-07-19 08:01:43 字数 170 浏览 20 评论 0 原文

单精度浮点运算和双精度浮点运算有什么区别？

我对与视频游戏机相关的实用术语特别感兴趣。例如，Nintendo 64 是否有 64 位处理器？如果有，是否意味着它能够进行双精度浮点运算？ PS3 和 Xbox 360 能否实现双精度浮点运算或仅实现单精度，并且一般使用的是双精度功能（如果存在？）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦幻之岛 2024-07-26 08:01:43

注意：Nintendo 64 确实有 64 位处理器，但是：

许多游戏都利用了该芯片的 32 位处理模式，因为 3D 游戏通常不需要 64 位数据类型提供的更高数据精度，而且处理 64 位数据所用的数据精度是该芯片的两倍RAM、缓存和带宽，从而降低了整体系统性能。

来自 Webopedia：

术语双精度有点用词不当，因为精度并不是真正的双精度。
双精度一词源于这样一个事实：双精度数使用的位数是常规浮点数的两倍。
例如，如果单精度数需要 32 位，则其对应的双精度数将为 64 位长。

额外的位不仅增加了精度，还增加了可以表示的幅度范围。
精度和幅度范围增加的确切数量取决于程序用于表示浮点值的格式。
大多数计算机使用称为 IEEE 浮点格式的标准格式。

实际上，IEEE 双精度格式的精度位数是单精度格式的两倍以上，并且范围也更大。

来自 IEEE 浮点运算标准

单精度

IEEE 单精度浮点标准表示需要 32 位字，可以表示为从左到右从 0 到 31 编号。

第一位是符号位，S，
接下来的八位是指数位，“E”，以及
最后 23 位是分数“F”：

<前><代码>S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFF
0 1 8 9 31

该字表示的值 V 可以确定如下：

如果 E=255 并且 F 不为零，则 V=NaN（“不是数字” )
如果 E=255 且 F 为零且 S 为 1，则 V=-Infinity
如果 E=255 且 F 为零且 S 为 0，则 V=Infinity
如果 0 则V=(-1)**S * 2 ** (E-127) * (1.F) 其中“1.F”是旨在表示通过在 F 前面加上前缀创建的二进制数隐式前导 1 和二进制小数点。
如果 E=0 并且 F 非零，则 V=(-1)**S * 2 ** (-126) * (0.F)。这些
是“非标准化”值。
如果 E=0 且 F 为零且 S 为 1，则 V=-0
如果 E=0 且 F 为零且 S 为 0，则 V=0

特别是，

0 00000000 00000000000000000000000 = 0
1 00000000 00000000000000000000000 = -0

0 11111111 00000000000000000000000 = Infinity
1 11111111 00000000000000000000000 = -Infinity

0 11111111 00000100000000000000000 = NaN
1 11111111 00100010001001010101010 = NaN

0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2
0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5
1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5

0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)
0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127) 
0 00000000 00000000000000000000001 = +1 * 2**(-126) * 
                                     0.00000000000000000000001 = 
                                     2**(-149)  (Smallest positive value)

双精度

IEEE 双精度精度浮点标准表示需要 64 位字，可以表示为从左到右从 0 到 63 编号。

第一位是符号位，S，
接下来的十一位是指数位，“E”，以及
最后 52 位是分数“F”：

<前><代码>S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
0 1 11 12 63

由该字表示的值 V 可以确定如下：

如果 E=2047 并且 F 不为零，则 V=NaN（“不是数字” )
如果 E=2047 且 F 为零且 S 为 1，则 V=-Infinity
如果 E=2047 且 F 为零且 S 为 0，则 V=Infinity
如果 0 则V=(-1)**S * 2 ** (E-1023) * (1.F) 其中“1.F”是旨在表示通过在 F 前面加上前缀创建的二进制数隐式前导 1 和二进制小数点。
如果 E=0 并且 F 非零，则 V=(-1)**S * 2 ** (-1022) * (0.F) 这些
是“非标准化”值。
如果 E=0 且 F 为零且 S 为 1，则 V=-0
如果 E=0 且 F 为零且 S 为 0，则 V=0

参考：
ANSI/IEEE 标准 754-1985，
二进制浮点运算标准。

来自 cs.uaf.edu 关于 IEEE 浮点标准的注释，“分数”通常引用为尾数。

单精度 IEEE FPS 格式由 32 位组成，分为 23 位尾数 M、8 位指数 E 和符号位， S：

归一化尾数 m 存储在 0-22 位中，并隐藏
位，b₀，省略。
因此M = m-1。

指数 e 在第 23-30 位中表示为偏差 127 整数。
因此，E = e+127。

符号位 S 表示尾数的符号，S=0 表示正值，S=1 > 对于负值。

零由E = M = 0表示。
由于S可能是0或1，因此+0和-0有不同的表示。

Note: the Nintendo 64 does have a 64-bit processor, however:

Many games took advantage of the chip's 32-bit processing mode as the greater data precision available with 64-bit data types is not typically required by 3D games, as well as the fact that processing 64-bit data uses twice as much RAM, cache, and bandwidth, thereby reducing the overall system performance.

From Webopedia:

The term double precision is something of a misnomer because the precision is not really double.
The word double derives from the fact that a double-precision number uses twice as many bits as a regular floating-point number.
For example, if a single-precision number requires 32 bits, its double-precision counterpart will be 64 bits long.

The extra bits increase not only the precision but also the range of magnitudes that can be represented.
The exact amount by which the precision and range of magnitudes are increased depends on what format the program is using to represent floating-point values.
Most computers use a standard format known as the IEEE floating-point format.

The IEEE double-precision format actually has more than twice as many bits of precision as the single-precision format, as well as a much greater range.

From the IEEE standard for floating point arithmetic

Single Precision

The IEEE single precision floating point standard representation requires a 32 bit word, which may be represented as numbered from 0 to 31, left to right.

The first bit is the sign bit, S,
the next eight bits are the exponent bits, 'E', and

the final 23 bits are the fraction 'F':

S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
0 1      8 9                    31

The value V represented by the word may be determined as follows:

If E=255 and F is nonzero, then V=NaN ("Not a number")
If E=255 and F is zero and S is 1, then V=-Infinity
If E=255 and F is zero and S is 0, then V=Infinity
If 0<E<255 then V=(-1)**S * 2 ** (E-127) * (1.F) where "1.F" is
intended to represent the binary number created by prefixing F with an
implicit leading 1 and a binary point.
If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F). These
are "unnormalized" values.
If E=0 and F is zero and S is 1, then V=-0
If E=0 and F is zero and S is 0, then V=0

In particular,

0 00000000 00000000000000000000000 = 0
1 00000000 00000000000000000000000 = -0

0 11111111 00000000000000000000000 = Infinity
1 11111111 00000000000000000000000 = -Infinity

0 11111111 00000100000000000000000 = NaN
1 11111111 00100010001001010101010 = NaN

0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2
0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5
1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5

0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)
0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127) 
0 00000000 00000000000000000000001 = +1 * 2**(-126) * 
                                     0.00000000000000000000001 = 
                                     2**(-149)  (Smallest positive value)

Double Precision

The IEEE double precision floating point standard representation requires a 64 bit word, which may be represented as numbered from 0 to 63, left to right.

The first bit is the sign bit, S,
the next eleven bits are the exponent bits, 'E', and

the final 52 bits are the fraction 'F':

S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
0 1        11 12                                                63

The value V represented by the word may be determined as follows:

If E=2047 and F is nonzero, then V=NaN ("Not a number")
If E=2047 and F is zero and S is 1, then V=-Infinity
If E=2047 and F is zero and S is 0, then V=Infinity
If 0<E<2047 then V=(-1)**S * 2 ** (E-1023) * (1.F) where "1.F" is
intended to represent the binary number created by prefixing F with an
implicit leading 1 and a binary point.
If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These
are "unnormalized" values.
If E=0 and F is zero and S is 1, then V=-0
If E=0 and F is zero and S is 0, then V=0

Reference:
ANSI/IEEE Standard 754-1985,
Standard for Binary Floating Point Arithmetic.

From cs.uaf.edu notes on IEEE Floating Point Standard, "Fraction" is generally referenced as Mantissa.

The single precision IEEE FPS format is composed of 32 bits, divided into a 23 bit mantissa, M, an 8 bit exponent, E, and a sign bit, S:

The normalized mantissa, m, is stored in bits 0-22 with the hidden
bit, b₀, omitted.
Thus M = m-1.

The exponent, e, is represented as a bias-127 integer in bits 23-30.
Thus, E = e+127.

The sign bit, S, indicates the sign of the mantissa, with S=0 for positive values and S=1 for negative values.

Zero is represented by E = M = 0.
Since S may be 0 or 1, there are different representations for +0 and -0.

回复收藏 0 原文

堇年纸鸢 2024-07-26 08:01:43

我读了很多答案，但似乎没有一个能正确解释“double”这个词的来源。我记得几年前一位大学教授给了我一个很好的解释。

回想一下 VonC 的回答风格，单精度浮点表示使用 32 位字。

1 位用于符号，S
8 位用于指数，'E'
24 位用于分数，也称为尾数< /strong> 或系数（即使只表示 23）。我们称其为“M”（对于尾数，我更喜欢这个名称，因为“分数”可能会被误解）。

表示：（

          S  EEEEEEEE   MMMMMMMMMMMMMMMMMMMMMMM
bits:    31 30      23 22                     0

只是指出，符号位是最后一个，而不是第一个。）

双精度浮点表示使用 64 位字。

1 位用于符号，S
11 位用于指数，'E'
53 位用于分数 / 尾数 > / 系数（即使只表示 52），“M”

表示：

           S  EEEEEEEEEEE   MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
bits:     63 62         52 51                                                  0

正如您可能注意到的，我写道，在两种类型中，尾数都多了一位信息与其表示形式的比较。事实上，尾数是一个没有所有非有意义的 0 的数字。例如，

0.000124 变为 0.124 × 10⁻³
237.141 变为 0.237141 × 10³

这意味着尾数始终采用

0.α_{1 的形式sub>α₂...α_t × β^p}

其中 β 是表示的基础。但由于分数是二进制数，α₁将始终等于1，因此分数可以重写为1.α₂α_{3...α_t+1 × 2^p 和初始 1 可以隐式假设，为额外位腾出空间 (α_{t+1< /子>）。}}

现在，显然 32 的倍数是 64，但这不是这个词的来源。

精度表示正确的小数位数，即没有任何表示错误或近似值。换句话说，它表示可以安全使用多少个十进制数字。

话虽如此，很容易估计可以安全使用的小数位数：

单精度：log₁₀(2²⁴)，大约7~8位小数
双精度：log₁₀(2⁵³)，大约15~16位小数

I read a lot of answers but none seems to correctly explain where the word double comes from. I remember a very good explanation given by a University professor I had some years ago.

Recalling the style of VonC's answer, a single precision floating point representation uses a word of 32 bit.

1 bit for the sign, S
8 bits for the exponent, 'E'
24 bits for the fraction, also called mantissa, or coefficient (even though just 23 are represented). Let's call it 'M' (for mantissa, I prefer this name as "fraction" can be misunderstood).

Representation:

          S  EEEEEEEE   MMMMMMMMMMMMMMMMMMMMMMM
bits:    31 30      23 22                     0

(Just to point out, the sign bit is the last, not the first.)

A double precision floating point representation uses a word of 64 bit.

1 bit for the sign, S
11 bits for the exponent, 'E'
53 bits for the fraction / mantissa / coefficient (even though only 52 are represented), 'M'

Representation:

           S  EEEEEEEEEEE   MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
bits:     63 62         52 51                                                  0

As you may notice, I wrote that the mantissa has, in both types, one bit more of information compared to its representation. In fact, the mantissa is a number represented without all its non-significative 0. For example,

0.000124 becomes 0.124 × 10⁻³
237.141 becomes 0.237141 × 10³

This means that the mantissa will always be in the form

0.α₁α₂...α_t × β^p

where β is the base of representation. But since the fraction is a binary number, α₁ will always be equal to 1, thus the fraction can be rewritten as 1.α₂α₃...α_t+1 × 2^p and the initial 1 can be implicitly assumed, making room for an extra bit (α_t+1).

Now, it's obviously true that the double of 32 is 64, but that's not where the word comes from.

The precision indicates the number of decimal digits that are correct, i.e. without any kind of representation error or approximation. In other words, it indicates how many decimal digits one can safely use.

With that said, it's easy to estimate the number of decimal digits which can be safely used:

single precision: log₁₀(2²⁴), which is about 7~8 decimal digits
double precision: log₁₀(2⁵³), which is about 15~16 decimal digits

回复收藏 0 原文

昵称有卵用 2024-07-26 08:01:43

好吧，机器的基本区别是双精度使用的位数是单精度的两倍。在通常的实现中，单精度数为 32 位，双精度数为 64 位。

但这是什么意思？？如果我们假设IEEE标准，那么一个单精度数的尾数约为23位，最大指数约为38；双精度尾数有 52 位，最大指数约为 308。

详细信息请参见 Wikipedia ，像往常一样。

回复收藏 0 原文

忘你却要生生世世 2024-07-26 08:01:43

添加到这里的所有精彩答案

首先 float 和 double 都用于表示数字小数。因此，两者之间的差异源于它们可以存储数字的精度。

例如：我必须存储 123.456789，其中一个可能只能存储 123.4567，而其他人可能能够存储精确的 123.456789。

所以，基本上我们想知道数字可以存储到什么程度，这就是我们所说的精度。

在这里引用@Alessandro

精度表示正确的小数位数，
即没有任何类型的表示错误或近似。在
换句话说，它表示可以安全使用多少个十进制数字。

float可以精确存储小数部分大约7-8位，而
Double 可以精确存储小数部分大约 15-16 位数字

，因此，float 可以存储小数部分数量的两倍。这就是为什么 Double 被称为double the float

回复收藏 0 原文

娜些时光，永不杰束 2024-07-26 08:01:43

所有内容都已详细解释，我无法再补充。虽然我想用外行术语或简单的英语来解释它

1.9 is less precise than 1.99
1.99 is less precise than 1.999
1.999 is less precise than 1.9999

......

能够存储或表示“1.9”的变量提供的精度低于能够保存或表示 1.9999 的变量。在大型计算中，这些分数可能会产生巨大的差异。

All have explained in great detail and nothing I could add further. Though I would like to explain it in Layman's Terms or plain ENGLISH

1.9 is less precise than 1.99
1.99 is less precise than 1.999
1.999 is less precise than 1.9999

.....

A variable, able to store or represent "1.9" provides less precision than the one able to hold or represent 1.9999. These Fraction can amount to a huge difference in large calculations.

回复收藏 0 原文