当前位置：文江博客话题详情

浮点数和双精度数有什么区别？

发布于 2024-08-24 04:18:35 字数 140 浏览 2 评论 0原文

我读过有关双精度和单精度之间的区别的内容。然而，在大多数情况下，float 和 double 似乎是可以互换的，即使用其中之一似乎不会影响结果。事实真的如此吗？浮点数和双精度数什么时候可以互换？它们之间有什么区别？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

反差帅 2024-08-31 04:18:36

我刚刚遇到了一个错误，花了我很长时间才弄清楚，并且可能会给您提供浮点精度的一个很好的例子。

#include <iostream>
#include <iomanip>

int main(){
  for(float t=0;t<1;t+=0.01){
     std::cout << std::fixed << std::setprecision(6) << t << std::endl;
  }
}

输出

如您所见，在 0.83 之后，精度显着下降。

但是，如果我将 t 设置为 double，则不会发生这样的问题。

我花了五个小时才意识到这个小错误，它毁了我的程序。

I just ran into a error that took me forever to figure out and potentially can give you a good example of float precision.

#include <iostream>
#include <iomanip>

int main(){
  for(float t=0;t<1;t+=0.01){
     std::cout << std::fixed << std::setprecision(6) << t << std::endl;
  }
}

The output is

As you can see after 0.83, the precision runs down significantly.

However, if I set up t as double, such an issue won't happen.

It took me five hours to realize this minor error, which ruined my program.

回复收藏 0 原文

隔纱相望 2024-08-31 04:18:36

浮点类型共有三种：

float
double
long double

一个简单的维恩图将解释：
类型值的集合

回复收藏 0 原文

别理我 2024-08-31 04:18:36

浮点计算中涉及的数字大小并不是最相关的事情。相关的是正在执行的计算。

本质上，如果您正在执行计算并且结果是无理数或循环小数，那么当该数字被压缩到您正在使用的有限大小的数据结构中时，将会出现舍入错误。由于 double 是 float 大小的两倍，因此舍入误差会小很多。

测试可能会专门使用会导致此类错误的数字，因此测试您是否在代码中使用了适当的类型。

回复收藏 0 原文

不念旧人 2024-08-31 04:18:36

float类型，32位长，精度为7位。虽然它可以存储非常大或非常小的范围（+/- 3.4 * 10^38 或 * 10^-38）的值，但它只有 7 位有效数字。

double类型，64位长，具有更大的范围（*10^+/-308）和15位精度。

long double 类型名义上为 80 位，但出于对齐目的，给定的编译器/操作系统配对可能会将其存储为 12-16 字节。 long double 的指数大得离谱，并且应该具有 19 位精度。微软以其无限的智慧，将 long double 限制为 8 个字节，与 plain double 相同。

一般来说，当您需要浮点值/变量时，只需使用 double 类型。默认情况下，表达式中使用的文字浮点值将被视为双精度数，并且大多数返回浮点值的数学函数都会返回双精度数。如果你只使用 double，你会避免很多令人头疼的事情和类型转换。

回复收藏 0 原文

女皇必胜 2024-08-31 04:18:36

浮点型的精度低于双精度型。尽管您已经知道了，但请阅读关于浮点我们应该了解什么算术以便更好地理解。

回复收藏 0 原文

暖心男生 2024-08-31 04:18:36

当使用浮点数时，您不能相信您的本地测试将与在服务器端完成的测试完全相同。您的本地系统以及运行最终测试的环境和编译器可能有所不同。我之前在一些 TopCoder 比赛中多次看到过这个问题，特别是当你尝试比较两个浮点数时。

回复收藏 0 原文

风为裳 2024-08-31 04:18:36

内置的比较操作有所不同，因为当您比较 2 个数字与浮点数时，数据类型（即 float 或 double）的差异可能会导致不同的结果。

回复收藏 0 原文

九公里浅绿 2024-08-31 04:18:36

从数量上讲，正如其他答案所指出的，区别在于类型 double 的精度大约是 float 类型的两倍，范围是 float 类型的三倍（取决于您如何计算））。

但也许更重要的是质的差异。 float 类型具有良好的精度，通常足以满足您所做的任何操作。另一方面，double 类型具有出色的精度，无论您正在做什么，它几乎总是足够好的。

结果是，您几乎应该始终使用 double 类型，这一点并不像应有的那样广为人知。。除非您有一些特别特殊的需求，否则您几乎不应该使用 float 类型。

众所周知，在进行浮点运算时，“舍入误差”通常是一个问题。舍入误差可能很微妙，难以追踪，也难以修复。大多数程序员没有时间或专业知识来追踪和修复浮点算法中的数值错误 - 因为不幸的是，每种不同算法的细节最终都不同。但 double 类型具有足够的精度，因此大多数时候您不必担心。
无论如何你都会得到好的结果。另一方面，对于 float 类型，令人担忧的舍入问题总是会出现。

float 类型和 double 类型之间不一定的不同之处在于执行速度。在当今大多数通用处理器上，float 和 double 类型的算术运算所花费的时间或多或少完全相同。一切都是并行完成的，因此您不会因为 double 类型的更大范围和精度而付出速度损失。这就是为什么建议您几乎不应该使用 float 类型是安全的：使用 double 不会在速度上造成任何损失，而且不会花费太多在太空中，它几乎肯定会在不受精度和舍入误差困扰的情况下获得丰厚的回报。

（尽管如此，您可能需要输入 float 的“特殊需求”之一是当您在微控制器上进行嵌入式工作，或者编写针对 GPU 优化的代码时。对于处理器，类型 double 可能会显着变慢，或者几乎不存在，因此在这些情况下，程序员通常会选择类型 float 来提高速度，并且可能会以精度为代价。）

Quantitatively, as other answers have pointed out, the difference is that type double has about twice the precision, and three times the range, as type float (depending on how you count).

But perhaps even more important is the qualitative difference. Type float has good precision, which will often be good enough for whatever you're doing. Type double, on the other hand, has excellent precision, which will almost always be good enough for whatever you're doing.

The upshot, which is not nearly as well known as it should be, is that you should almost always use type double. Unless you have some particularly special need, you should almost never use type float.

As everyone knows, "roundoff error" is often a problem when you're doing floating-point work. Roundoff error can be subtle, and difficult to track down, and difficult to fix. Most programmers don't have the time or expertise to track down and fix numerical errors in floating-point algorithms — because unfortunately, the details end up being different for every different algorithm. But type double has enough precision such that, much of the time, you don't have to worry.
You'll get good results anyway. With type float, on the other hand, alarming-looking issues with roundoff crop up all the time.

And the thing that's not necessarily different between type float and double is execution speed. On most of today's general-purpose processors, arithmetic operations on type float and double take more or less exactly the same amount of time. Everything's done in parallel, so you don't pay a speed penalty for the greater range and precision of type double. That's why it's safe to make the recommendation that you should almost never use type float: Using double shouldn't cost you anything in speed, and it shouldn't cost you much in space, and it will almost definitely pay off handsomely in freedom from precision and roundoff error woes.

(With that said, though, one of the "special needs" where you may need type float is when you're doing embedded work on a microcontroller, or writing code that's optimized for a GPU. On those processors, type double can be significantly slower, or practically nonexistent, so in those cases programmers do typically choose type float for speed, and maybe pay for it in precision.)

回复收藏 0 原文

痴意少年 2024-08-31 04:18:36

如果使用嵌入式处理，最终底层硬件（例如 FPGA 或某些特定处理器/微控制器模型）将在硬件中以最佳方式实现浮点，而双精度将使用软件例程。因此，如果浮点型的精度足以满足需要，则使用浮点型的程序执行速度将比双精度型快一些。正如其他答案所述，请注意累积错误。

回复收藏 0 原文

梨涡少年 2024-08-31 04:18:36

与int（整数）不同，float 有小数点，double 也有。
但两者之间的区别在于，double 的详细程度是 float 的两倍，这意味着它的小数点后的数字数量可以加倍。

回复收藏 0 原文

维持三分热 2024-08-31 04:18:35

差异巨大。

顾名思义，double 的精度是浮动^[1]。一般来说，double 的精度为 15 位小数，而 float 的精度为 7。

以下是位数的计算方式：

double 有 52 个尾数位 + 1 个隐藏位：log(2⁵³)÷log(10) = 15.95 位
float 有 23 个尾数位 + 1 个隐藏位：log(2²⁴)÷log(10) = 7.22 位

这种精度损失可能会导致重复计算时累积更大的截断误差，例如

float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
    b += a;
printf("%.7g\n", b); // prints 9.000023

while

double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
    b += a;
printf("%.15g\n", b); // prints 8.99999999999996

另外，float 的最大值大约是3e38，但是double大约是1.7e308，所以使用float可以达到“无穷大”（即特殊的浮点数）对于一些简单的事情，比 double 更容易，例如计算 60 的阶乘。

在测试过程中，可能有一些测试用例包含这些巨大的数字，如果使用浮点数，可能会导致程序失败。

当然，有时，即使 double 也不够准确，因此我们有时会使用 long double^[1] （上面的例子给出了 9.000000000000000066 Mac），但所有浮点类型都会出现舍入误差，因此如果精度非常重要（例如货币处理），您应该使用 int 或分数类。

此外，不要使用 += 对大量浮点数求和，因为错误会快速累积。如果您使用的是 Python，请使用 fsum。否则，请尝试实现 Kahan 求和算法。

^{[1]：C 和 C++ 标准未指定 float、double 和 long double 的表示形式。这三个都可能以 IEEE 双精度实现。尽管如此，对于大多数体系结构（gcc、MSVC；x86、x64、ARM），float 确实是 IEEE 单精度浮点数（binary32），而 double 是 IEEE 双精度浮点数（binary64）。}

Huge difference.

As the name implies, a double has 2x the precision of float^[1]. In general a double has 15 decimal digits of precision, while float has 7.

Here's how the number of digits are calculated:

double has 52 mantissa bits + 1 hidden bit: log(2⁵³)÷log(10) = 15.95 digits
float has 23 mantissa bits + 1 hidden bit: log(2²⁴)÷log(10) = 7.22 digits

This precision loss could lead to greater truncation errors being accumulated when repeated calculations are done, e.g.

float a = 1.f / 81;
float b = 0;
for (int i = 0; i < 729; ++ i)
    b += a;
printf("%.7g\n", b); // prints 9.000023

while

double a = 1.0 / 81;
double b = 0;
for (int i = 0; i < 729; ++ i)
    b += a;
printf("%.15g\n", b); // prints 8.99999999999996

Also, the maximum value of float is about 3e38, but double is about 1.7e308, so using float can hit "infinity" (i.e. a special floating-point number) much more easily than double for something simple, e.g. computing the factorial of 60.

During testing, maybe a few test cases contain these huge numbers, which may cause your programs to fail if you use floats.

Of course, sometimes, even double isn't accurate enough, hence we sometimes have long double^[1] (the above example gives 9.000000000000000066 on Mac), but all floating point types suffer from round-off errors, so if precision is very important (e.g. money processing) you should use int or a fraction class.

Furthermore, don't use += to sum lots of floating point numbers, as the errors accumulate quickly. If you're using Python, use fsum. Otherwise, try to implement the Kahan summation algorithm.

^{[1]: The C and C++ standards do not specify the representation of float, double and long double. It is possible that all three are implemented as IEEE double-precision. Nevertheless, for most architectures (gcc, MSVC; x86, x64, ARM) float is indeed a IEEE single-precision floating point number (binary32), and double is a IEEE double-precision floating point number (binary64).}

回复收藏 0 原文

浅笑轻吟梦一曲 2024-08-31 04:18:35

以下是标准 C99 (ISO-IEC 9899 6.2.5 §10) 或 C++2003 (ISO-IEC 14882-2003 3.1.9 §8) 标准的规定：

浮点类型共有三种：float、double 和 long double。 double 类型提供的精度至少与 float 一样高，而 long double 类型提供的精度至少与 double< /代码>。 float 类型的值集是 double 类型的值集的子集； double 类型的值集是 long double 类型的值集的子集。

C++ 标准添加了：

浮点类型的值表示是实现定义的。

我建议看看优秀的每个计算机科学家应该知道的内容关于浮点运算，深入介绍了 IEEE 浮点标准。您将了解表示细节，并且您将意识到幅度和精度之间存在权衡。浮点表示的精度随着幅度的减小而增加，因此 -1 到 1 之间的浮点数精度最高。

回复收藏 0 原文

把人绕傻吧 2024-08-31 04:18:35

给定一个二次方程：x² − 4.0000000 x + 3.9999999 = 0， 10 位有效数字的精确根为 r₁ = 2.000316228 和 r₂ = 1.999683772。

使用 float 和 double，我们可以编写一个测试程序：

#include <stdio.h>
#include <math.h>

void dbl_solve(double a, double b, double c)
{
    double d = b*b - 4.0*a*c;
    double sd = sqrt(d);
    double r1 = (-b + sd) / (2.0*a);
    double r2 = (-b - sd) / (2.0*a);
    printf("%.5f\t%.5f\n", r1, r2);
}

void flt_solve(float a, float b, float c)
{
    float d = b*b - 4.0f*a*c;
    float sd = sqrtf(d);
    float r1 = (-b + sd) / (2.0f*a);
    float r2 = (-b - sd) / (2.0f*a);
    printf("%.5f\t%.5f\n", r1, r2);
}   

int main(void)
{
    float fa = 1.0f;
    float fb = -4.0000000f;
    float fc = 3.9999999f;
    double da = 1.0;
    double db = -4.0000000;
    double dc = 3.9999999;
    flt_solve(fa, fb, fc);
    dbl_solve(da, db, dc);
    return 0;
}

运行该程序会得到：

2.00000 2.00000
2.00032 1.99968

请注意，数字并不大，但使用 仍然可以获得取消效果>浮动。

（事实上，以上并不是使用单精度或双精度浮点数求解二次方程的最佳方法，但即使使用更稳定的方法。）

Given a quadratic equation: x² − 4.0000000 x + 3.9999999 = 0, the exact roots to 10 significant digits are, r₁ = 2.000316228 and r₂ = 1.999683772.

Using float and double, we can write a test program:

#include <stdio.h>
#include <math.h>

void dbl_solve(double a, double b, double c)
{
    double d = b*b - 4.0*a*c;
    double sd = sqrt(d);
    double r1 = (-b + sd) / (2.0*a);
    double r2 = (-b - sd) / (2.0*a);
    printf("%.5f\t%.5f\n", r1, r2);
}

void flt_solve(float a, float b, float c)
{
    float d = b*b - 4.0f*a*c;
    float sd = sqrtf(d);
    float r1 = (-b + sd) / (2.0f*a);
    float r2 = (-b - sd) / (2.0f*a);
    printf("%.5f\t%.5f\n", r1, r2);
}   

int main(void)
{
    float fa = 1.0f;
    float fb = -4.0000000f;
    float fc = 3.9999999f;
    double da = 1.0;
    double db = -4.0000000;
    double dc = 3.9999999;
    flt_solve(fa, fb, fc);
    dbl_solve(da, db, dc);
    return 0;
}

Running the program gives me:

2.00000 2.00000
2.00032 1.99968

Note that the numbers aren't large, but still you get cancellation effects using float.

(In fact, the above is not the best way of solving quadratic equations using either single- or double-precision floating-point numbers, but the answer remains unchanged even if one uses a more stable method.)

回复收藏 0 原文