单精度浮点运算和双精度浮点运算有什么区别?
我对与视频游戏机相关的实用术语特别感兴趣。 例如,Nintendo 64 是否有 64 位处理器?如果有,是否意味着它能够进行双精度浮点运算? PS3 和 Xbox 360 能否实现双精度浮点运算或仅实现单精度,并且一般使用的是双精度功能(如果存在?)。
What is the difference between a single precision floating point operation and double precision floating operation?
I'm especially interested in practical terms in relation to video game consoles. For example, does the Nintendo 64 have a 64 bit processor and if it does then would that mean it was capable of double precision floating point operations? Can the PS3 and Xbox 360 pull off double precision floating point operations or only single precision and in general use is the double precision capabilities made use of (if they exist?).
发布评论
评论(11)
注意:Nintendo 64 确实有 64 位处理器,但是:
来自 Webopedia:
实际上,IEEE 双精度格式的精度位数是单精度格式的两倍以上,并且范围也更大。
来自 IEEE 浮点运算标准
单精度
IEEE 单精度浮点标准表示需要 32 位字,可以表示为从左到右从 0 到 31 编号。
第一位是符号位,S,
接下来的八位是指数位,“E”,以及
最后 23 位是分数“F”:
<前><代码>S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFF
0 1 8 9 31
该字表示的值 V 可以确定如下:
0 则
V=(-1)**S * 2 ** (E-127) * (1.F)
其中“1.F”是旨在表示通过在 F 前面加上前缀创建的二进制数
隐式前导 1 和二进制小数点。
V=(-1)**S * 2 ** (-126) * (0.F)
。 这些是“非标准化”值。
特别是,
双精度
IEEE 双精度精度浮点标准表示需要 64 位字,可以表示为从左到右从 0 到 63 编号。
第一位是符号位,S,
接下来的十一位是指数位,“E”,以及
最后 52 位是分数“F”:
<前><代码>S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
0 1 11 12 63
由该字表示的值 V 可以确定如下:
0 则
V=(-1)**S * 2 ** (E-1023) * (1.F)
其中“1.F”是旨在表示通过在 F 前面加上前缀创建的二进制数
隐式前导 1 和二进制小数点。
V=(-1)**S * 2 ** (-1022) * (0.F)
这些是“非标准化”值。
参考:
ANSI/IEEE 标准 754-1985,
二进制浮点运算标准。
来自 cs.uaf.edu 关于 IEEE 浮点标准的注释,“分数”通常引用为尾数 。
Note: the Nintendo 64 does have a 64-bit processor, however:
From Webopedia:
The IEEE double-precision format actually has more than twice as many bits of precision as the single-precision format, as well as a much greater range.
From the IEEE standard for floating point arithmetic
Single Precision
The IEEE single precision floating point standard representation requires a 32 bit word, which may be represented as numbered from 0 to 31, left to right.
The first bit is the sign bit, S,
the next eight bits are the exponent bits, 'E', and
the final 23 bits are the fraction 'F':
The value V represented by the word may be determined as follows:
0<E<255
thenV=(-1)**S * 2 ** (E-127) * (1.F)
where "1.F" isintended to represent the binary number created by prefixing F with an
implicit leading 1 and a binary point.
V=(-1)**S * 2 ** (-126) * (0.F)
. Theseare "unnormalized" values.
In particular,
Double Precision
The IEEE double precision floating point standard representation requires a 64 bit word, which may be represented as numbered from 0 to 63, left to right.
The first bit is the sign bit, S,
the next eleven bits are the exponent bits, 'E', and
the final 52 bits are the fraction 'F':
The value V represented by the word may be determined as follows:
0<E<2047
thenV=(-1)**S * 2 ** (E-1023) * (1.F)
where "1.F" isintended to represent the binary number created by prefixing F with an
implicit leading 1 and a binary point.
V=(-1)**S * 2 ** (-1022) * (0.F)
Theseare "unnormalized" values.
Reference:
ANSI/IEEE Standard 754-1985,
Standard for Binary Floating Point Arithmetic.
From cs.uaf.edu notes on IEEE Floating Point Standard, "Fraction" is generally referenced as Mantissa.
我读了很多答案,但似乎没有一个能正确解释“double”这个词的来源。 我记得几年前一位大学教授给了我一个很好的解释。
回想一下 VonC 的回答风格,单精度浮点表示使用 32 位字。
表示:(
只是指出,符号位是最后一个,而不是第一个。)
双精度浮点表示使用 64 位字。
表示:
正如您可能注意到的,我写道,在两种类型中,尾数 都多了一位信息与其表示形式的比较。 事实上,尾数是一个没有所有非有意义的
0
的数字。 例如,这意味着尾数始终采用
0.α1 的形式sub>α2...αt × βp
其中 β 是表示的基础。 但由于分数是二进制数,α1将始终等于1,因此分数可以重写为1.α2α3...αt+1 × 2p 和初始 1 可以隐式假设,为额外位腾出空间 (αt+1< /子>)。
现在,显然 32 的倍数是 64,但这不是这个词的来源。
精度表示正确的小数位数,即没有任何表示错误或近似值。 换句话说,它表示可以安全使用多少个十进制数字。
话虽如此,很容易估计可以安全使用的小数位数:
I read a lot of answers but none seems to correctly explain where the word double comes from. I remember a very good explanation given by a University professor I had some years ago.
Recalling the style of VonC's answer, a single precision floating point representation uses a word of 32 bit.
Representation:
(Just to point out, the sign bit is the last, not the first.)
A double precision floating point representation uses a word of 64 bit.
Representation:
As you may notice, I wrote that the mantissa has, in both types, one bit more of information compared to its representation. In fact, the mantissa is a number represented without all its non-significative
0
. For example,This means that the mantissa will always be in the form
0.α1α2...αt × βp
where β is the base of representation. But since the fraction is a binary number, α1 will always be equal to 1, thus the fraction can be rewritten as 1.α2α3...αt+1 × 2p and the initial 1 can be implicitly assumed, making room for an extra bit (αt+1).
Now, it's obviously true that the double of 32 is 64, but that's not where the word comes from.
The precision indicates the number of decimal digits that are correct, i.e. without any kind of representation error or approximation. In other words, it indicates how many decimal digits one can safely use.
With that said, it's easy to estimate the number of decimal digits which can be safely used:
好吧,机器的基本区别是双精度使用的位数是单精度的两倍。 在通常的实现中,单精度数为 32 位,双精度数为 64 位。
但这是什么意思?? 如果我们假设IEEE标准,那么一个单精度数的尾数约为23位,最大指数约为38; 双精度尾数有 52 位,最大指数约为 308。
详细信息请参见 Wikipedia ,像往常一样。
Okay, the basic difference at the machine is that double precision uses twice as many bits as single. In the usual implementation,that's 32 bits for single, 64 bits for double.
But what does that mean? If we assume the IEEE standard, then a single precision number has about 23 bits of the mantissa, and a maximum exponent of about 38; a double precision has 52 bits for the mantissa, and a maximum exponent of about 308.
The details are at Wikipedia, as usual.
添加到这里的所有精彩答案
首先 float 和 double 都用于表示数字小数。 因此,两者之间的差异源于它们可以存储数字的精度。
所以,基本上我们想知道数字可以存储到什么程度,这就是我们所说的精度。
在这里引用@Alessandro
float可以精确存储小数部分大约7-8位,而
Double 可以精确存储小数部分大约 15-16 位数字
,因此,float 可以存储小数部分数量的两倍。这就是为什么 Double 被称为double the float
To add to all the wonderful answers here
First of all float and double are both used for representation of numbers fractional numbers. So, the difference between the two stems from the fact with how much precision they can store the numbers.
So, basically we want to know how much accurately can the number be stored and is what we call precision.
Quoting @Alessandro here
Float can accurately store about 7-8 digits in the fractional part while
Double can accurately store about 15-16 digits in the fractional part
So, float can store double the amount of fractional part. That is why Double is called double the float
所有内容都已详细解释,我无法再补充。 虽然我想用外行术语或简单的英语来解释它
......
能够存储或表示“1.9”的变量提供的精度低于能够保存或表示 1.9999 的变量。 在大型计算中,这些分数可能会产生巨大的差异。
All have explained in great detail and nothing I could add further. Though I would like to explain it in Layman's Terms or plain ENGLISH
.....
A variable, able to store or represent "1.9" provides less precision than the one able to hold or represent 1.9999. These Fraction can amount to a huge difference in large calculations.
基本上,单精度浮点算术处理 32 位浮点数,而 双精度 处理 64 位。
双精度位数增加了可存储的最大值并提高了精度(即有效位数)。
Basically single precision floating point arithmetic deals with 32 bit floating point numbers whereas double precision deals with 64 bit.
The number of bits in double precision increases the maximum value that can be stored as well as increasing the precision (ie the number of significant digits).
至于问题“ps3 和 xbxo 360 能否实现双精度浮点运算或仅实现单精度,并且在一般用途中是否使用双精度功能(如果存在?)”。
我相信这两个平台都无法支持双浮点。 最初的 Cell 处理器只有 32 位浮点,与 Xbox 360 (R600) 所基于的 ATI 硬件相同。 Cell 后来获得了双浮点支持,但我很确定 PS3 不会使用这种芯片。
As to the question "Can the ps3 and xbxo 360 pull off double precision floating point operations or only single precision and in generel use is the double precision capabilities made use of (if they exist?)."
I believe that both platforms are incapable of double floating point. The original Cell processor only had 32 bit floats, same with the ATI hardware which the XBox 360 is based on (R600). The Cell got double floating point support later on, but I'm pretty sure the PS3 doesn't use that chippery.
双精度意味着数字需要两倍的字长来存储。 在 32 位处理器上,字都是 32 位,因此双精度数是 64 位。 就性能而言,这意味着对双精度数字的操作需要更长的时间来执行。 因此,您可以获得更好的范围,但性能会受到一些影响。 硬件浮点单元稍微减轻了这种影响,但它仍然存在。
N64 使用基于 MIPS R4300i 的 NEC VR4300,它是一个 64 位处理器,但处理器通过 32 位宽总线与系统的其余部分进行通信。 因此,大多数开发人员使用 32 位数字,因为它们速度更快,而且当时的大多数游戏不需要额外的精度(因此他们使用浮点数而不是双精度数)。
所有三个系统都可以执行单精度和双精度浮点运算,但由于性能原因可能无法执行。 (尽管n64之后几乎所有东西都使用32位总线所以......)
Double precision means the numbers takes twice the word-length to store. On a 32-bit processor, the words are all 32 bits, so doubles are 64 bits. What this means in terms of performance is that operations on double precision numbers take a little longer to execute. So you get a better range, but there is a small hit on performance. This hit is mitigated a little by hardware floating point units, but its still there.
The N64 used a MIPS R4300i-based NEC VR4300 which is a 64 bit processor, but the processor communicates with the rest of the system over a 32-bit wide bus. So, most developers used 32 bit numbers because they are faster, and most games at the time did not need the additional precision (so they used floats not doubles).
All three systems can do single and double precision floating operations, but they might not because of performance. (although pretty much everything after the n64 used a 32 bit bus so...)
首先,float 和 double 都用于表示数字小数。 因此,两者之间的差异源于它们可以存储数字的精度。
例如:我必须存储 123.456789 一个人可能只能存储 123.4567,而其他人可能能够存储精确的 123.456789。
所以,基本上我们想知道数字可以存储到什么程度,这就是我们所说的精度。
在这里引用@Alessandro
精度表示正确的小数位数,即没有任何表示错误或近似值。 换句话说,它表示可以安全使用多少个十进制数字。
float 可以精确存储小数部分约 7-8 位数字,而 Double 可以精确存储小数部分约 15-16 位数字,
因此,double 可以存储小数部分数量的两倍。 这就是为什么 Double 被称为 double the float
First of all float and double are both used for representation of numbers fractional numbers. So, the difference between the two stems from the fact with how much precision they can store the numbers.
For example: I have to store 123.456789 One may be able to store only 123.4567 while other may be able to store the exact 123.456789.
So, basically we want to know how much accurately can the number be stored and is what we call precision.
Quoting @Alessandro here
The precision indicates the number of decimal digits that are correct, i.e. without any kind of representation error or approximation. In other words, it indicates how many decimal digits one can safely use.
Float can accurately store about 7-8 digits in the fractional part while Double can accurately store about 15-16 digits in the fractional part
So, double can store double the amount of fractional part as of float. That is why Double is called double the float
根据 IEEE754
• 浮点存储标准
• 32 和 64 位标准(单精度和双精度)
• 分别为8位和11位指数
• 中间结果的扩展格式(尾数和指数)
According to the IEEE754
• Standard for floating point storage
• 32 and 64 bit standards (single precision and double precision)
• 8 and 11 bit exponent respectively
• Extended formats (both mantissa and exponent) for intermediate results
单精度数使用 32 位,MSB 为符号位,双精度数使用 64 位,MSB 为符号位
单精度
SEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFFFF.(SIGN+EXPONENT+SIGNIFICAND)
双精度:
SEEEEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF .(符号+指数+符号)
Single precision number uses 32 bits, with the MSB being sign bit, whereas double precision number uses 64bits, MSB being sign bit
Single precision
SEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFF.(SIGN+EXPONENT+SIGNIFICAND)
Double precision:
SEEEEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.(SIGN+EXPONENT+SIGNIFICAND)