浮点数和双精度数有什么区别?
我读过有关双精度和单精度之间的区别的内容。然而,在大多数情况下,float
和 double
似乎是可以互换的,即使用其中之一似乎不会影响结果。事实真的如此吗?浮点数和双精度数什么时候可以互换?它们之间有什么区别?
I've read about the difference between double precision and single precision. However, in most cases, float
and double
seem to be interchangeable, i.e. using one or the other does not seem to affect the results. Is this really the case? When are floats and doubles interchangeable? What are the differences between them?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
我刚刚遇到了一个错误,花了我很长时间才弄清楚,并且可能会给您提供浮点精度的一个很好的例子。
输出
如您所见,在 0.83 之后,精度显着下降。
但是,如果我将
t
设置为 double,则不会发生这样的问题。我花了五个小时才意识到这个小错误,它毁了我的程序。
I just ran into a error that took me forever to figure out and potentially can give you a good example of float precision.
The output is
As you can see after 0.83, the precision runs down significantly.
However, if I set up
t
as double, such an issue won't happen.It took me five hours to realize this minor error, which ruined my program.
浮点类型共有三种:
一个简单的维恩图将解释:
类型值的集合
There are three floating point types:
A simple Venn diagram will explain about:
The set of values of the types
浮点计算中涉及的数字大小并不是最相关的事情。相关的是正在执行的计算。
本质上,如果您正在执行计算并且结果是无理数或循环小数,那么当该数字被压缩到您正在使用的有限大小的数据结构中时,将会出现舍入错误。由于 double 是 float 大小的两倍,因此舍入误差会小很多。
测试可能会专门使用会导致此类错误的数字,因此测试您是否在代码中使用了适当的类型。
The size of the numbers involved in the float-point calculations is not the most relevant thing. It's the calculation that is being performed that is relevant.
In essence, if you're performing a calculation and the result is an irrational number or recurring decimal, then there will be rounding errors when that number is squashed into the finite size data structure you're using. Since double is twice the size of float then the rounding error will be a lot smaller.
The tests may specifically use numbers which would cause this kind of error and therefore tested that you'd used the appropriate type in your code.
float类型,32位长,精度为7位。虽然它可以存储非常大或非常小的范围(+/- 3.4 * 10^38 或 * 10^-38)的值,但它只有 7 位有效数字。
double类型,64位长,具有更大的范围(*10^+/-308)和15位精度。
long double 类型名义上为 80 位,但出于对齐目的,给定的编译器/操作系统配对可能会将其存储为 12-16 字节。 long double 的指数大得离谱,并且应该具有 19 位精度。微软以其无限的智慧,将 long double 限制为 8 个字节,与 plain double 相同。
一般来说,当您需要浮点值/变量时,只需使用 double 类型。默认情况下,表达式中使用的文字浮点值将被视为双精度数,并且大多数返回浮点值的数学函数都会返回双精度数。如果你只使用 double,你会避免很多令人头疼的事情和类型转换。
Type float, 32 bits long, has a precision of 7 digits. While it may store values with very large or very small range (+/- 3.4 * 10^38 or * 10^-38), it has only 7 significant digits.
Type double, 64 bits long, has a bigger range (*10^+/-308) and 15 digits precision.
Type long double is nominally 80 bits, though a given compiler/OS pairing may store it as 12-16 bytes for alignment purposes. The long double has an exponent that just ridiculously huge and should have 19 digits precision. Microsoft, in their infinite wisdom, limits long double to 8 bytes, the same as plain double.
Generally speaking, just use type double when you need a floating point value/variable. Literal floating point values used in expressions will be treated as doubles by default, and most of the math functions that return floating point values return doubles. You'll save yourself many headaches and typecastings if you just use double.
浮点型的精度低于双精度型。尽管您已经知道了,但请阅读 关于浮点我们应该了解什么算术以便更好地理解。
Floats have less precision than doubles. Although you already know, read What WE Should Know About Floating-Point Arithmetic for better understanding.
当使用浮点数时,您不能相信您的本地测试将与在服务器端完成的测试完全相同。您的本地系统以及运行最终测试的环境和编译器可能有所不同。我之前在一些 TopCoder 比赛中多次看到过这个问题,特别是当你尝试比较两个浮点数时。
When using floating point numbers you cannot trust that your local tests will be exactly the same as the tests that are done on the server side. The environment and the compiler are probably different on you local system and where the final tests are run. I have seen this problem many times before in some TopCoder competitions especially if you try to compare two floating point numbers.
内置的比较操作有所不同,因为当您比较 2 个数字与浮点数时,数据类型(即 float 或 double)的差异可能会导致不同的结果。
The built-in comparison operations differ as in when you compare 2 numbers with floating point, the difference in data type (i.e. float or double) may result in different outcomes.
从数量上讲,正如其他答案所指出的,区别在于类型
double
的精度大约是float
类型的两倍,范围是float
类型的三倍(取决于您如何计算) )。但也许更重要的是质的差异。
float
类型具有良好的精度,通常足以满足您所做的任何操作。另一方面,double
类型具有出色的精度,无论您正在做什么,它几乎总是足够好的。结果是,您几乎应该始终使用 double 类型,这一点并不像应有的那样广为人知。。除非您有一些特别特殊的需求,否则您几乎不应该使用
float
类型。众所周知,在进行浮点运算时,“舍入误差”通常是一个问题。舍入误差可能很微妙,难以追踪,也难以修复。大多数程序员没有时间或专业知识来追踪和修复浮点算法中的数值错误 - 因为不幸的是,每种不同算法的细节最终都不同。但
double
类型具有足够的精度,因此大多数时候您不必担心。无论如何你都会得到好的结果。另一方面,对于
float
类型,令人担忧的舍入问题总是会出现。float
类型和double
类型之间不一定的不同之处在于执行速度。在当今大多数通用处理器上,float
和double
类型的算术运算所花费的时间或多或少完全相同。一切都是并行完成的,因此您不会因为double
类型的更大范围和精度而付出速度损失。这就是为什么建议您几乎不应该使用float
类型是安全的:使用double
不会在速度上造成任何损失,而且不会花费太多在太空中,它几乎肯定会在不受精度和舍入误差困扰的情况下获得丰厚的回报。(尽管如此,您可能需要输入
float
的“特殊需求”之一是当您在微控制器上进行嵌入式工作,或者编写针对 GPU 优化的代码时。对于处理器,类型double
可能会显着变慢,或者几乎不存在,因此在这些情况下,程序员通常会选择类型float
来提高速度,并且可能会以精度为代价。)Quantitatively, as other answers have pointed out, the difference is that type
double
has about twice the precision, and three times the range, as typefloat
(depending on how you count).But perhaps even more important is the qualitative difference. Type
float
has good precision, which will often be good enough for whatever you're doing. Typedouble
, on the other hand, has excellent precision, which will almost always be good enough for whatever you're doing.The upshot, which is not nearly as well known as it should be, is that you should almost always use type
double
. Unless you have some particularly special need, you should almost never use typefloat
.As everyone knows, "roundoff error" is often a problem when you're doing floating-point work. Roundoff error can be subtle, and difficult to track down, and difficult to fix. Most programmers don't have the time or expertise to track down and fix numerical errors in floating-point algorithms — because unfortunately, the details end up being different for every different algorithm. But type
double
has enough precision such that, much of the time, you don't have to worry.You'll get good results anyway. With type
float
, on the other hand, alarming-looking issues with roundoff crop up all the time.And the thing that's not necessarily different between type
float
anddouble
is execution speed. On most of today's general-purpose processors, arithmetic operations on typefloat
anddouble
take more or less exactly the same amount of time. Everything's done in parallel, so you don't pay a speed penalty for the greater range and precision of typedouble
. That's why it's safe to make the recommendation that you should almost never use typefloat
: Usingdouble
shouldn't cost you anything in speed, and it shouldn't cost you much in space, and it will almost definitely pay off handsomely in freedom from precision and roundoff error woes.(With that said, though, one of the "special needs" where you may need type
float
is when you're doing embedded work on a microcontroller, or writing code that's optimized for a GPU. On those processors, typedouble
can be significantly slower, or practically nonexistent, so in those cases programmers do typically choose typefloat
for speed, and maybe pay for it in precision.)如果使用嵌入式处理,最终底层硬件(例如 FPGA 或某些特定处理器/微控制器模型)将在硬件中以最佳方式实现浮点,而双精度将使用软件例程。因此,如果浮点型的精度足以满足需要,则使用浮点型的程序执行速度将比双精度型快一些。正如其他答案所述,请注意累积错误。
If one works with embedded processing, eventually the underlying hardware (e.g. FPGA or some specific processor / microcontroller model) will have float implemented optimally in hardware whereas double will use software routines. So if the precision of a float is enough to handle the needs, the program will execute some times faster with float then double. As noted on other answers, beware of accumulation errors.
与
int
(整数)不同,float
有小数点,double
也有。但两者之间的区别在于,
double
的详细程度是float
的两倍,这意味着它的小数点后的数字数量可以加倍。Unlike an
int
(whole number), afloat
have a decimal point, and so can adouble
.But the difference between the two is that a
double
is twice as detailed as afloat
, meaning that it can have double the amount of numbers after the decimal point.差异巨大。
顾名思义,
double
的精度是浮动
[1]。一般来说,double
的精度为 15 位小数,而float
的精度为 7。以下是位数的计算方式:
这种精度损失可能会导致重复计算时累积更大的截断误差,例如
while
另外,float 的最大值大约是
3e38
,但是double大约是1.7e308
,所以使用float
可以达到“无穷大”(即特殊的浮点数)对于一些简单的事情,比 double 更容易,例如计算 60 的阶乘。在测试过程中,可能有一些测试用例包含这些巨大的数字,如果使用浮点数,可能会导致程序失败。
当然,有时,即使
double
也不够准确,因此我们有时会使用long double
[1] (上面的例子给出了 9.000000000000000066 Mac),但所有浮点类型都会出现舍入误差,因此如果精度非常重要(例如货币处理),您应该使用int
或分数类。此外,不要使用
+=
对大量浮点数求和,因为错误会快速累积。如果您使用的是 Python,请使用fsum
。否则,请尝试实现 Kahan 求和算法。[1]:C 和 C++ 标准未指定
float
、double
和long double
的表示形式。这三个都可能以 IEEE 双精度实现。尽管如此,对于大多数体系结构(gcc、MSVC;x86、x64、ARM),float
确实是 IEEE 单精度浮点数(binary32),而double
是 IEEE 双精度浮点数(binary64)。Huge difference.
As the name implies, a
double
has 2x the precision offloat
[1]. In general adouble
has 15 decimal digits of precision, whilefloat
has 7.Here's how the number of digits are calculated:
This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.
while
Also, the maximum value of float is about
3e38
, but double is about1.7e308
, so usingfloat
can hit "infinity" (i.e. a special floating-point number) much more easily thandouble
for something simple, e.g. computing the factorial of 60.During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.
Of course, sometimes, even
double
isn't accurate enough, hence we sometimes havelong double
[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should useint
or a fraction class.Furthermore, don't use
+=
to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, usefsum
. Otherwise, try to implement the Kahan summation algorithm.[1]: The C and C++ standards do not specify the representation of
float
,double
andlong double
. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM)float
is indeed a IEEE single-precision floating point number (binary32), anddouble
is a IEEE double-precision floating point number (binary64).以下是标准 C99 (ISO-IEC 9899 6.2.5 §10) 或 C++2003 (ISO-IEC 14882-2003 3.1.9 §8) 标准的规定:
C++ 标准添加了:
我建议看看优秀的每个计算机科学家应该知道的内容关于浮点运算,深入介绍了 IEEE 浮点标准。您将了解表示细节,并且您将意识到幅度和精度之间存在权衡。浮点表示的精度随着幅度的减小而增加,因此 -1 到 1 之间的浮点数精度最高。
Here is what the standard C99 (ISO-IEC 9899 6.2.5 §10) or C++2003 (ISO-IEC 14882-2003 3.1.9 §8) standards say:
The C++ standard adds:
I would suggest having a look at the excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic that covers the IEEE floating-point standard in depth. You'll learn about the representation details and you'll realize there is a tradeoff between magnitude and precision. The precision of the floating point representation increases as the magnitude decreases, hence floating point numbers between -1 and 1 are those with the most precision.
给定一个二次方程:x2 − 4.0000000 x + 3.9999999 = 0, 10 位有效数字的精确根为 r1 = 2.000316228 和 r2 = 1.999683772。
使用
float
和double
,我们可以编写一个测试程序:运行该程序会得到:
请注意,数字并不大,但使用
仍然可以获得取消效果>浮动。
(事实上,以上并不是使用单精度或双精度浮点数求解二次方程的最佳方法,但即使使用 更稳定的方法。)
Given a quadratic equation: x2 − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r1 = 2.000316228 and r2 = 1.999683772.
Using
float
anddouble
, we can write a test program:Running the program gives me:
Note that the numbers aren't large, but still you get cancellation effects using
float
.(In fact, the above is not the best way of solving quadratic equations using either single- or double-precision floating-point numbers, but the answer remains unchanged even if one uses a more stable method.)
(浮点数)是 32 位。
(float) is 32 bits.