放飞的风筝 2025-01-17 12:57:06

二进制浮点数学的工作原理如下。在大多数编程语言中，它基于 IEEE 754 标准。问题的关键在于，数字以这种格式表示为整数乘以 2 的幂；分母不是 2 的幂的有理数（例如 0.1，即 1/10）无法精确表示。

对于标准 binary64 格式的 0.1，其表示形式可以完全写为

十进制的 0.1000000000000000055511151231257827021181583404541015625，或者
C99 十六进制浮点表示法中的 0x1.999999999999ap-4 。

相比之下，有理数0.1（即1/10）可以完全写成

十进制的0.1，即
0x1。 99999999999999...p-4 类似于 C99 十六进制浮点表示法，其中 ... 表示无休止的 9 序列。

程序中的常量 0.2 和 0.3 也将是其真实值的近似值。碰巧，最接近 0.2 的 double 大于有理数 0.2，但最接近 double code>0.3 小于有理数 0.3。 0.1 和 0.2 的总和最终大于有理数 0.3，因此与代码中的常量不一致。

对浮点算术问题的相当全面的处理是每个计算机科学家都应该了解浮点运算。有关更容易理解的解释，请参阅 floating-point-gui.de。

旁注：所有位置（以 N 为基数）数字系统都存在这个精度问题。

普通的旧十进制（以 10 为基数）数字也有同样的问题，这就是为什么像 1/3 这样的数字最终会变成 0.333333333...

你只是绊倒了一个数字（3/10）恰好很容易用十进制表示，但不适合二进制系统。它也是双向的（在某种程度上）：1/16 在十进制中是一个丑陋的数字（0.0625），但在二进制中它看起来像十进制的第 10,000 个（0.0001）** - 如果我们在由于我们在日常生活中使用以 2 为基数的数字系统的习惯，你甚至会看到这个数字，并本能地理解你可以通过将某个东西减半、再减半、一次又一次地达到这个数字。

当然，这并不完全是浮点数在内存中的存储方式（它们使用科学记数法的形式）。然而，它确实说明了二进制浮点精度误差往往会出现，因为我们通常感兴趣的“现实世界”数字通常是十的幂 - 但这只是因为我们使用十进制数字系统日 -今天。这也是为什么我们会说 71% 而不是“每 7 中就有 5”（71% 是一个近似值，因为 5/7 无法用任何十进制数字精确表示）。

所以，不：二进制浮点数并没有被破坏，它们只是碰巧和其他所有基于 N 的数字系统一样不完美:)

旁注：在编程中使用浮点数

实际上，这种精度问题意味着您需要使用舍入函数可以在显示浮点数之前将其四舍五入到您感兴趣的小数位数。

您还需要将相等测试替换为允许一定程度容差的比较，这意味着：

不要不执行 if (x == y) { ... }

而是执行if (abs(x - y) < myToleranceValue) { ... }。

其中 abs 是绝对值。需要根据您的特定应用程序选择 myToleranceValue - 这与您准备允许的“回旋空间”有很大关系，以及您要比较的最大数字可能是多少是（由于精度损失问题）。请注意您选择的语言中的“epsilon”样式常量。这些可以用作容差值，但它们的有效性取决于您正在使用的数字的大小（大小），因为大数字的计算可能会超出 epsilon 阈值。

Binary floating point math works like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.

For 0.1 in the standard binary64 format, the representation can be written exactly as

0.1000000000000000055511151231257827021181583404541015625 in decimal, or
0x1.999999999999ap-4 in C99 hexfloat notation.

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

0.1 in decimal, or
0x1.99999999999999...p-4 in an analog of C99 hexfloat notation, where the ... represents an unending sequence of 9's.

The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.

A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.

Side Note: All positional (base-N) number systems share this problem with precision

Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...

You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.

Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation since 5/7 can't be represented exactly with any decimal number).

So, no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)

Side Note: Working with Floats in Programming

In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.

You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:

Do not do if (x == y) { ... }

Instead do if (abs(x - y) < myToleranceValue) { ... }.

where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These can be used as tolerance values but their effectiveness depends on the magnitude (size) of the numbers you're working with, since calculations with large numbers may exceed the epsilon threshold.

回复收藏 0 原文

久夏青 2025-01-17 12:57:06

硬件设计师的视角

我相信我应该添加硬件设计师的视角，因为我设计和构建浮点硬件。了解错误的根源可能有助于理解软件中发生的情况，最终，我希望这有助于解释浮点错误发生并似乎随着时间的推移而累积的原因。

1. 概述

从工程角度来看，大多数浮点运算都会存在一定的误差，因为进行浮点计算的硬件只需要在最后一位的误差小于一个单位的二分之一。因此，许多硬件将停止在一个精度，该精度仅需要在单个操作的最后一位产生小于一个单位的误差，这在浮点除法中尤其成问题。单个操作的构成取决于该单元需要多少个操作数。对于大多数来说，它是两个，但有些单元需要 3 个或更多操作数。因此，不能保证重复操作会导致所需的错误，因为错误会随着时间的推移而累积。

2. 标准

大多数处理器遵循 IEEE-754 标准，但有些处理器使用非规范化或不同的标准
。例如，IEEE-754 中有一种非规范化模式，它允许以牺牲精度为代价来表示非常小的浮点数。然而，下面将介绍 IEEE-754 的标准化模式，这是典型的操作模式。

在IEEE-754标准中，硬件设计者可以使用任何误差/epsilon值，只要它小于最后一位的一个单位的二分之一，并且结果只需小于最后一位的一个单位的二分之一即可。一次操作的地方。这就解释了为什么当重复操作时，错误会累积起来。对于 IEEE-754 双精度，这是第 54 位，因为 53 位用于表示浮点数的数字部分（标准化），也称为尾数（例如 5.3e5 中的 5.3）。接下来的部分将更详细地介绍各种浮点运算中硬件错误的原因。

3. 除法舍入误差的原因

浮点除法误差的主要原因是用于计算商的除法算法。大多数计算机系统使用逆乘法来计算除法，主要是Z=X/Y、Z = X * (1/Y)。除法是迭代计算的，即每个周期计算商的一些位，直到达到所需的精度，对于 IEEE-754 来说，精度是最后一位误差小于一个单位的任何值。 Y（1/Y）的倒数表在慢除法中被称为商选择表（QST），商选择表的大小（以位为单位）通常是基数的宽度，或者是基数的位数。每次迭代中计算的商，加上一些保护位。对于 IEEE-754 标准，双精度（64 位），它是除法器基数的大小，加上一些保护位 k，其中 k>=2。例如，一次计算 2 位商（基数 4）的除法器的典型商选择表将是 2+2= 4 位（加上一些可选位）。

3.1 除法舍入误差：倒数的近似

商选择表中的倒数取决于除法：慢速除法如SRT除法，或快速除法如Goldschmidt除法；每个条目都会根据除法算法进行修改，以尝试产生尽可能低的错误。但无论如何，所有倒数都是实际倒数的近似值，并且会引入一些误差元素。慢速除法和快速除法都迭代地计算商，即每一步计算商的一些位数，然后从被除数中减去结果，除法器重复这些步骤，直到误差小于二分之一。单位排在最后一位。慢速除法方法在每个步骤中计算固定位数的商，并且通常构建成本较低，而快速除法方法在每个步骤中计算可变位数，并且通常构建成本更高。除法最重要的部分是，它们大多数依赖于倒数的近似值的重复乘法，因此很容易出错。

4. 其他运算中的舍入错误：截断

所有运算中舍入错误的另一个原因是 IEEE-754 允许的最终答案的不同截断模式。有截断、向零舍入、舍入到最近（默认）、舍入-向下，向上舍入。对于单个操作，所有方法都会在最后引入小于一个单位的误差元素。随着时间的推移和重复的操作，截断也会累积地增加最终的误差。这种截断错误在幂运算中尤其成问题，因为幂运算涉及某种形式的重复乘法。

5. 重复运算

由于进行浮点计算的硬件只需在一次运算中得到最后一位误差小于二分之一的结果，如果不注意，误差将随着重复运算而增大。这就是在需要有界误差的计算中，数学家使用诸如使用舍入到最近的方法偶数位在 IEEE-754 的最后一位，因为随着时间的推移，错误更有可能相互抵消，并且区间算术与 IEEE 754 舍入模式用于预测舍入误差并纠正它们。由于与其他舍入模式相比相对误差较低，因此舍入到最接近的偶数位（最后一位）是 IEEE-754 的默认舍入模式。

请注意，默认舍入模式，舍入到最近的最后一位的偶数，保证一次运算最后一位的误差小于二分之一。单独使用截断、向上取整、向下取整可能会导致误差大于最后一位的二分之一，但小于最后一位的一个单位，所以不建议使用这些模式，除非是用于区间算术。

6. 总结

简而言之，浮点运算出错的根本原因是硬件截断和除法时倒数截断的结合。由于 IEEE-754 标准仅要求单次运算的最后一位的误差小于二分之一，因此重复运算的浮点误差将会累加，除非进行纠正。

A Hardware Designer's Perspective

I believe I should add a hardware designer’s perspective to this since I design and build floating point hardware. Knowing the origin of the error may help in understanding what is happening in the software, and ultimately, I hope this helps explain the reasons for why floating point errors happen and seem to accumulate over time.

1. Overview

From an engineering perspective, most floating point operations will have some element of error since the hardware that does the floating point computations is only required to have an error of less than one half of one unit in the last place. Therefore, much hardware will stop at a precision that's only necessary to yield an error of less than one half of one unit in the last place for a single operation which is especially problematic in floating point division. What constitutes a single operation depends upon how many operands the unit takes. For most, it is two, but some units take 3 or more operands. Because of this, there is no guarantee that repeated operations will result in a desirable error since the errors add up over time.

2. Standards

Most processors follow the IEEE-754 standard but some use denormalized, or different standards
. For example, there is a denormalized mode in IEEE-754 which allows representation of very small floating point numbers at the expense of precision. The following, however, will cover the normalized mode of IEEE-754 which is the typical mode of operation.

In the IEEE-754 standard, hardware designers are allowed any value of error/epsilon as long as it's less than one half of one unit in the last place, and the result only has to be less than one half of one unit in the last place for one operation. This explains why when there are repeated operations, the errors add up. For IEEE-754 double precision, this is the 54th bit, since 53 bits are used to represent the numeric part (normalized), also called the mantissa, of the floating point number (e.g. the 5.3 in 5.3e5). The next sections go into more detail on the causes of hardware error on various floating point operations.

3. Cause of Rounding Error in Division

The main cause of the error in floating point division is the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y, Z = X * (1/Y). A division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in the slow division, and the size in bits of the quotient selection table is usually the width of the radix, or a number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, where k>=2. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be 2+2= 4 bits (plus a few optional bits).

3.1 Division Rounding Error: Approximation of Reciprocal

What reciprocals are in the quotient selection table depend on the division method: slow division such as SRT division, or fast division such as Goldschmidt division; each entry is modified according to the division algorithm in an attempt to yield the lowest possible error. In any case, though, all reciprocals are approximations of the actual reciprocal and introduce some element of error. Both slow division and fast division methods calculate the quotient iteratively, i.e. some number of bits of the quotient are calculated each step, then the result is subtracted from the dividend, and the divider repeats the steps until the error is less than one half of one unit in the last place. Slow division methods calculate a fixed number of digits of the quotient in each step and are usually less expensive to build, and fast division methods calculate a variable number of digits per step and are usually more expensive to build. The most important part of the division methods is that most of them rely upon repeated multiplication by an approximation of a reciprocal, so they are prone to error.

4. Rounding Errors in Other Operations: Truncation

Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default), round-down, and round-up. All methods introduce an element of error of less than one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error is especially problematic in exponentiation, which involves some form of repeated multiplication.

5. Repeated Operations

Since the hardware that does the floating point calculations only needs to yield a result with an error of less than one half of one unit in the last place for a single operation, the error will grow over repeated operations if not watched. This is the reason that in computations that require a bounded error, mathematicians use methods such as using the round-to-nearest even digit in the last place of IEEE-754, because, over time, the errors are more likely to cancel each other out, and Interval Arithmetic combined with variations of the IEEE 754 rounding modes to predict rounding errors, and correct them. Because of its low relative error compared to other rounding modes, round to nearest even digit (in the last place), is the default rounding mode of IEEE-754.

Note that the default rounding mode, round-to-nearest even digit in the last place, guarantees an error of less than one half of one unit in the last place for one operation. Using the truncation, round-up, and round down alone may result in an error that is greater than one half of one unit in the last place, but less than one unit in the last place, so these modes are not recommended unless they are used in Interval Arithmetic.

6. Summary

In short, the fundamental reason for the errors in floating point operations is a combination of the truncation in hardware, and the truncation of a reciprocal in the case of division. Since the IEEE-754 standard only requires an error of less than one half of one unit in the last place for a single operation, the floating point errors over repeated operations will add up unless corrected.

回复收藏 0 原文

老娘不死你永远是小三 2025-01-17 12:57:06

浮点表示法的破坏方式与您在小学学到并每天使用的十进制（以 10 为底）表示法的破坏方式完全相同，只是以 2 为底。

为了理解这一点，请考虑将 2/3 表示为十进制值。完全不可能做到！在你写完小数点后的 6 之前，世界就结束了，所以我们会写到一些位置，四舍五入到最后的 7，并认为它足够准确。

同样的，1/10（十进制0.1）在以2为底（二进制）的情况下也不能精确地表示为“十进制”值；小数点后的重复模式永远持续下去。该值不精确，因此您无法使用普通浮点方法对其进行精确数学运算。就像以 10 为基数一样，其他值也存在此问题。

回复收藏 0 原文

陌若浮生 2025-01-17 12:57:06

这里的大多数答案都用非常枯燥的技术术语来解决这个问题。我想用普通人可以理解的术语来解决这个问题。

想象一下，您正在尝试切披萨。您有一个机器人披萨切割机，可以将披萨片准确地切成两半。它可以将整个披萨减半，也可以将现有的披萨减半，但无论如何，减半总是准确的。

那个披萨刀的动作非常精细，如果你从整个披萨开始，然后将其减半，然后每次继续将最小的切片减半，你可以在切片太小之前减半53次甚至其高精度能力。那时，您不能再将那个非常薄的切片减半，而必须按原样包含或排除它。

现在，您如何将所有切片拼凑成披萨的十分之一 (0.1) 或五分之一 (0.2)？认真思考一下，并尝试解决它。如果您手边有一个神话般的精密披萨刀，您甚至可以尝试使用真正的披萨。 :-)

当然，大多数经验丰富的程序员都知道真正的答案，即无论你切片得多么精细，都无法使用这些切片将披萨精确地拼凑成十分之一或五分之一他们。你可以做一个相当好的近似，如果你将 0.1 的近似值与 0.2 的近似值相加，你会得到一个相当好的 0.3 的近似值，但它仍然只是一个近似值。

对于双精度数字（允许将披萨减半 53 次的精度），紧邻 0.1 的数字分别为 0.09999999999999999167332731531132594682276248931884765625 和0.1000000000000000055511151231257827021181583404541015625。后者比前者更接近 0.1，因此在给定输入 0.1 的情况下，数字解析器将倾向于后者。

（这两个数字之间的差异是我们必须决定包含的“最小切片”，这会引入向上偏差，或排除它，这会引入向下偏差。该最小切片的技术术语是 ulp。）

在 0.2 的情况下，数字都是相同的，只是放大了 2 倍。我们再次强调这个值稍微高一点大于0.2。

请注意，在这两种情况下，0.1 和 0.2 的近似值都有轻微的向上偏差。如果我们添加足够多的这些偏差，它们将使数字越来越远离我们想要的，事实上，在 0.1 + 0.2 的情况下，偏差足够高，导致结果数字不再是最接近的数字至 0.3。

特别是，0.1 + 0.2 实际上是 0.1000000000000000055511151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125 = 0.3000000000000000444089209850062616169452667236328125，而最接近的数字到 0.3 实际上是 0.299999999999999988897769753748434595763683319091796875。

PS 一些编程语言还提供披萨切割器，可以将切片精确地分割成十分之一。尽管这种披萨刀并不常见，但如果您确实有的话，那么当需要精确切出十分之一或五分之一的披萨时，您应该使用它。

（最初发布在 Quora 上。）

Most answers here address this question in very dry, technical terms. I'd like to address this in terms that normal human beings can understand.

Imagine that you are trying to slice up pizzas. You have a robotic pizza cutter that can cut pizza slices exactly in half. It can halve a whole pizza, or it can halve an existing slice, but in any case, the halving is always exact.

That pizza cutter has very fine movements, and if you start with a whole pizza, then halve that, and continue halving the smallest slice each time, you can do the halving 53 times before the slice is too small for even its high-precision abilities. At that point, you can no longer halve that very thin slice, but must either include or exclude it as is.

Now, how would you piece all the slices in such a way that would add up to one-tenth (0.1) or one-fifth (0.2) of a pizza? Really think about it, and try working it out. You can even try to use a real pizza, if you have a mythical precision pizza cutter at hand. :-)

Most experienced programmers, of course, know the real answer, which is that there is no way to piece together an exact tenth or fifth of the pizza using those slices, no matter how finely you slice them. You can do a pretty good approximation, and if you add up the approximation of 0.1 with the approximation of 0.2, you get a pretty good approximation of 0.3, but it's still just that, an approximation.

For double-precision numbers (which is the precision that allows you to halve your pizza 53 times), the numbers immediately less and greater than 0.1 are 0.09999999999999999167332731531132594682276248931884765625 and 0.1000000000000000055511151231257827021181583404541015625. The latter is quite a bit closer to 0.1 than the former, so a numeric parser will, given an input of 0.1, favour the latter.

(The difference between those two numbers is the "smallest slice" that we must decide to either include, which introduces an upward bias, or exclude, which introduces a downward bias. The technical term for that smallest slice is an ulp.)

In the case of 0.2, the numbers are all the same, just scaled up by a factor of 2. Again, we favour the value that's slightly higher than 0.2.

Notice that in both cases, the approximations for 0.1 and 0.2 have a slight upward bias. If we add enough of these biases in, they will push the number further and further away from what we want, and in fact, in the case of 0.1 + 0.2, the bias is high enough that the resulting number is no longer the closest number to 0.3.

In particular, 0.1 + 0.2 is really 0.1000000000000000055511151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125 = 0.3000000000000000444089209850062616169452667236328125, whereas the number closest to 0.3 is actually 0.299999999999999988897769753748434595763683319091796875.

P.S. Some programming languages also provide pizza cutters that can split slices into exact tenths. Although such pizza cutters are uncommon, if you do have access to one, you should use it when it's important to be able to get exactly one-tenth or one-fifth of a slice.

(Originally posted on Quora.)

回复收藏 0 原文

渡你暖光 2025-01-17 12:57:06

浮点舍入错误。由于缺少 5 的质因数，0.1 在 2 进制中无法像在 10 进制中那样准确表示。就像 1/3 在十进制中需要无穷多个数字来表示，但在 3 进制中却是“0.1”， 0.1 可以采用以 2 为基数的无限个数字，而以 10 为基数则不能。而且计算机没有无限量的内存。

回复收藏 0 原文

若能看破又如何 2025-01-17 12:57:06

我的答案很长，所以我将其分为三个部分。由于问题是关于浮点数学的，因此我将重点放在机器实际执行的操作上。我还专门针对双精度（64 位）精度，但该参数同样适用于任何浮点算术。

序言

IEEE 754 双精度二进制浮点格式（binary64）数字代表某种形式的数字

值 = (-1)^s * (1.m₅₁m₅₀...m₂m₁米₀)₂ * 2^e-1023

（64 位）：

第一位是符号位：如果数字为负数，1，否则0¹。
接下来的 11 位是指数，即偏移 1023。换句话说，从双精度数读取指数位后，必须减去 1023 才能得到两个人的力量。
剩余的 52 位是有效数（或尾数）。在尾数中，“隐含”1. 始终被省略²，因为任何二进制值的最高有效位都是 1。

¹ - IEEE 754 允许使用有符号零 - < code>+0 和 -0 的处理方式不同：1 / (+0) 为正无穷大； 1 / (-0) 是负无穷大。对于零值，尾数和指数位均为零。注意：零值（+0 和 -0）明确不归类为非正规²。

² - 对于非正规数来说，情况并非如此，它有一个偏移指数为零（以及隐含的 0.）。非正规双精度数的范围为 d_min ≤ |x| ≤ d_max，其中 d_min（最小可表示的非零数）为 2^{-1023 - 51} (≈ 4.94 * 10^{- 324}）和 d_max（最大的非正规数，其尾数完全由 1 组成）是2^{-1023 + 1} - 2^{-1023 - 51} (≈ 2.225 * 10^-308)。

将双精度数字转换为二进制

许多在线转换器可将双精度浮点数转换为二进制（例如，在 binaryconvert.com），但这里有一些示例 C# 代码，用于获取双精度数的 IEEE 754 表示形式（我用冒号分隔这三个部分 (: ）：

public static string BinaryRepresentation(double value)
{
    long valueInLongType = BitConverter.DoubleToInt64Bits(value);
    string bits = Convert.ToString(valueInLongType, 2);
    string leadingZeros = new string('0', 64 - bits.Length);
    string binaryRepresentation = leadingZeros + bits;

    string sign = binaryRepresentation[0].ToString();
    string exponent = binaryRepresentation.Substring(1, 11);
    string mantissa = binaryRepresentation.Substring(12);

    return string.Format("{0}:{1}:{2}", sign, exponent, mantissa);
}

切入要点：原始问题

（跳至底部查看 TL;DR 版本）

Cato Johnston（提问者）问为什么 0.1 + 0.2 != 0.3

以二进制形式编写（用冒号分隔三个部分），即值的 IEEE 754 表示形式。是：

0.1 => 0:01111111011:1001100110011001100110011001100110011001100110011010
0.2 => 0:01111111100:1001100110011001100110011001100110011001100110011010

请注意，尾数由 0011 的重复数字组成，这是计算出现错误的关键 - 0.1、0.2 和 0.3 无法用 0.1、0.2 和 0.3 表示。精确在有限数量的二进制位中，任何超过1/9、1/3或1/7的二进制位都可以精确地用十进制表示数字。

另请注意，我们可以将指数的幂减少 52，并将二进制表示形式的点向右移动 52 位（很像 10^-3 * 1.23 == 10^-5 * 123）。这样我们就可以将二进制表示形式表示为 a * 2^p 形式所表示的精确值。其中“a”是整数。

将指数转换为十进制，删除偏移量，然后重新添加隐含的 1（在方括号中），0.1 和 0.2 为：

0.1 => 2^-4 * [1].1001100110011001100110011001100110011001100110011010
0.2 => 2^-3 * [1].1001100110011001100110011001100110011001100110011010
or
0.1 => 2^-56 * 7205759403792794 = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794 = 0.200000000000000011102230246251565404236316680908203125

要添加两个数字，指数需要相同，即：

0.1 => 2^-3 *  0.1100110011001100110011001100110011001100110011001101(0)
0.2 => 2^-3 *  1.1001100110011001100110011001100110011001100110011010
sum =  2^-3 * 10.0110011001100110011001100110011001100110011001100111
or
0.1 => 2^-55 * 3602879701896397  = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794  = 0.200000000000000011102230246251565404236316680908203125
sum =  2^-55 * 10808639105689191 = 0.3000000000000000166533453693773481063544750213623046875

由于总和不是 2ⁿ * 1.{bbb} 的形式，我们将指数加一并移动小数点（二进制）得到

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)
    = 2^-54 * 5404319552844595.5 = 0.3000000000000000166533453693773481063544750213623046875

：现在尾数中有 53 位（第 53 位位于上行的方括号中）。 IEEE 754 的默认舍入模式是“舍入到最近的值” ' - 即，如果数字x落在两个值a和b之间，则最不重要的值选择位为零。

a = 2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875
  = 2^-2  * 1.0011001100110011001100110011001100110011001100110011

x = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)

b = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
  = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

请注意，a 和 b 仅最后一位不同； ...0011 + 1 = ...0100。在这种情况下，最低有效位为零的值为 b，因此总和为：

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
    = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

而 0.3 的二进制表示为：

0.3 => 2^-2  * 1.0011001100110011001100110011001100110011001100110011
    =  2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875

与 0.1 和 0.2 之和的二进制表示不同2^-54。

0.1 和 0.2 的二进制表示是 IEEE 754 允许的数字的最准确表示。由于默认的舍入模式，这些表示的相加会产生一个仅略有不同的值-有效位。

TL;DR

以 IEEE 754 二进制表示形式写入 0.1 + 0.2（用冒号分隔三个部分）并将其与 0.3 进行比较，这是（我将不同的位放在方括号中）：

0.1 + 0.2 => 0:01111111101:0011001100110011001100110011001100110011001100110[100]
0.3       => 0:01111111101:0011001100110011001100110011001100110011001100110[011]

转换回十进制，这些值是：

0.1 + 0.2 => 0.300000000000000044408920985006...
0.3       => 0.299999999999999988897769753748...

差值恰好是 2^-54，即~5.5511151231258 × 10^-17 - 与原始值相比，微不足道（对于许多应用程序）。

比较浮点数的最后几位本质上是危险的，因为任何阅读著名的“每个计算机科学家应该了解浮点运算"（涵盖了这个答案的所有主要部分）就会知道。

大多数计算器使用额外的保护数字来解决这个问题，这就是0.1 + 0.2 将给出 0.3：最后几位被四舍五入。

My answer is quite long, so I've split it into three sections. Since the question is about floating point mathematics, I've put the emphasis on what the machine actually does. I've also made it specific to double (64 bit) precision, but the argument applies equally to any floating point arithmetic.

Preamble

An IEEE 754 double-precision binary floating-point format (binary64) number represents a number of the form

value = (-1)^s * (1.m₅₁m₅₀...m₂m₁m₀)₂ * 2^e-1023

in 64 bits:

The first bit is the sign bit: 1 if the number is negative, 0 otherwise¹.
The next 11 bits are the exponent, which is offset by 1023. In other words, after reading the exponent bits from a double-precision number, 1023 must be subtracted to obtain the power of two.
The remaining 52 bits are the significand (or mantissa). In the mantissa, an 'implied' 1. is always² omitted since the most significant bit of any binary value is 1.

¹ - IEEE 754 allows for the concept of a signed zero - +0 and -0 are treated differently: 1 / (+0) is positive infinity; 1 / (-0) is negative infinity. For zero values, the mantissa and exponent bits are all zero. Note: zero values (+0 and -0) are explicitly not classed as denormal².

² - This is not the case for denormal numbers, which have an offset exponent of zero (and an implied 0.). The range of denormal double precision numbers is d_min ≤ |x| ≤ d_max, where d_min (the smallest representable nonzero number) is 2^{-1023 - 51} (≈ 4.94 * 10^-324) and d_max (the largest denormal number, for which the mantissa consists entirely of 1s) is 2^{-1023 + 1} - 2^{-1023 - 51} (≈ 2.225 * 10^-308).

Turning a double precision number to binary

Many online converters exist to convert a double precision floating point number to binary (e.g. at binaryconvert.com), but here is some sample C# code to obtain the IEEE 754 representation for a double precision number (I separate the three parts with colons (:):

public static string BinaryRepresentation(double value)
{
    long valueInLongType = BitConverter.DoubleToInt64Bits(value);
    string bits = Convert.ToString(valueInLongType, 2);
    string leadingZeros = new string('0', 64 - bits.Length);
    string binaryRepresentation = leadingZeros + bits;

    string sign = binaryRepresentation[0].ToString();
    string exponent = binaryRepresentation.Substring(1, 11);
    string mantissa = binaryRepresentation.Substring(12);

    return string.Format("{0}:{1}:{2}", sign, exponent, mantissa);
}

Getting to the point: the original question

(Skip to the bottom for the TL;DR version)

Cato Johnston (the question asker) asked why 0.1 + 0.2 != 0.3.

Written in binary (with colons separating the three parts), the IEEE 754 representations of the values are:

0.1 => 0:01111111011:1001100110011001100110011001100110011001100110011010
0.2 => 0:01111111100:1001100110011001100110011001100110011001100110011010

Note that the mantissa is composed of recurring digits of 0011. This is key to why there is any error to the calculations - 0.1, 0.2 and 0.3 cannot be represented in binary precisely in a finite number of binary bits any more than 1/9, 1/3 or 1/7 can be represented precisely in decimal digits.

Also note that we can decrease the power in the exponent by 52 and shift the point in the binary representation to the right by 52 places (much like 10^-3 * 1.23 == 10^-5 * 123). This then enables us to represent the binary representation as the exact value that it represents in the form a * 2^p. where 'a' is an integer.

Converting the exponents to decimal, removing the offset, and re-adding the implied 1 (in square brackets), 0.1 and 0.2 are:

0.1 => 2^-4 * [1].1001100110011001100110011001100110011001100110011010
0.2 => 2^-3 * [1].1001100110011001100110011001100110011001100110011010
or
0.1 => 2^-56 * 7205759403792794 = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794 = 0.200000000000000011102230246251565404236316680908203125

To add two numbers, the exponent needs to be the same, i.e.:

0.1 => 2^-3 *  0.1100110011001100110011001100110011001100110011001101(0)
0.2 => 2^-3 *  1.1001100110011001100110011001100110011001100110011010
sum =  2^-3 * 10.0110011001100110011001100110011001100110011001100111
or
0.1 => 2^-55 * 3602879701896397  = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794  = 0.200000000000000011102230246251565404236316680908203125
sum =  2^-55 * 10808639105689191 = 0.3000000000000000166533453693773481063544750213623046875

Since the sum is not of the form 2ⁿ * 1.{bbb} we increase the exponent by one and shift the decimal (binary) point to get:

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)
    = 2^-54 * 5404319552844595.5 = 0.3000000000000000166533453693773481063544750213623046875

There are now 53 bits in the mantissa (the 53rd is in square brackets in the line above). The default rounding mode for IEEE 754 is 'Round to Nearest' - i.e. if a number x falls between two values a and b, the value where the least significant bit is zero is chosen.

a = 2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875
  = 2^-2  * 1.0011001100110011001100110011001100110011001100110011

x = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)

b = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
  = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

Note that a and b differ only in the last bit; ...0011 + 1 = ...0100. In this case, the value with the least significant bit of zero is b, so the sum is:

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
    = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

whereas the binary representation of 0.3 is:

0.3 => 2^-2  * 1.0011001100110011001100110011001100110011001100110011
    =  2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875

which only differs from the binary representation of the sum of 0.1 and 0.2 by 2^-54.

The binary representation of 0.1 and 0.2 are the most accurate representations of the numbers allowable by IEEE 754. The addition of these representation, due to the default rounding mode, results in a value which differs only in the least-significant-bit.

TL;DR

Writing 0.1 + 0.2 in a IEEE 754 binary representation (with colons separating the three parts) and comparing it to 0.3, this is (I've put the distinct bits in square brackets):

0.1 + 0.2 => 0:01111111101:0011001100110011001100110011001100110011001100110[100]
0.3       => 0:01111111101:0011001100110011001100110011001100110011001100110[011]

Converted back to decimal, these values are:

0.1 + 0.2 => 0.300000000000000044408920985006...
0.3       => 0.299999999999999988897769753748...

The difference is exactly 2^-54, which is ~5.5511151231258 × 10^-17 - insignificant (for many applications) when compared to the original values.

Comparing the last few bits of a floating point number is inherently dangerous, as anyone who reads the famous "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (which covers all the major parts of this answer) will know.

Most calculators use additional guard digits to get around this problem, which is how 0.1 + 0.2 would give 0.3: the final few bits are rounded.

回复收藏 0 原文

安静 2025-01-17 12:57:06

除了其他正确答案之外，您可能还需要考虑缩放值以避免浮点运算出现问题。

例如：

var result = 1.0 + 2.0;     // result === 3.0 returns true

... 而不是：

var result = 0.1 + 0.2;     // result === 0.3 returns false

JavaScript 中的表达式 0.1 + 0.2 === 0.3 返回 false，但幸运的是浮点中的整数运算是精确的，因此十进制可以通过缩放来避免表示错误。

作为一个实际示例，为了避免精度至关重要的浮点问题，建议¹将货币处理为表示美分数量的整数：2550美分而不是25.50 美元。

¹ 道格拉斯·克罗克福德：JavaScript：好的部分：附录 A - 糟糕的部分（第 105 页）。

In addition to the other correct answers, you may want to consider scaling your values to avoid problems with floating-point arithmetic.

For example:

var result = 1.0 + 2.0;     // result === 3.0 returns true

... instead of:

var result = 0.1 + 0.2;     // result === 0.3 returns false

The expression 0.1 + 0.2 === 0.3 returns false in JavaScript, but fortunately integer arithmetic in floating-point is exact, so decimal representation errors can be avoided by scaling.

As a practical example, to avoid floating-point problems where accuracy is paramount, it is recommended¹ to handle money as an integer representing the number of cents: 2550 cents instead of 25.50 dollars.

¹ Douglas Crockford: JavaScript: The Good Parts: Appendix A - Awful Parts (page 105).

回复收藏 0 原文

我一向站在原地 2025-01-17 12:57:06

计算机中存储的浮点数由两部分组成：整数和指数，指数乘以整数部分。

如果计算机以 10 为基数工作，0.1 将是 1 x 10⁻1，0.2 将是 2 x 10⁻1 ，0.3 将为 3 x 10⁻¹。整数数学运算既简单又精确，因此加上 0.1 + 0.2 显然会得到 0.3。

计算机通常不以 10 为基数工作，而是以 2 为基数工作。您仍然可以获得某些值的精确结果，例如 0.5 是 1 x 2⁻¹ 且0.25 是 1 x 2⁻²，将它们相加得到 3 x 2⁻² 或 0.75。确切地。

问题在于，数字可以精确地以 10 为基数表示，但不能以 2 为基数表示。这些数字需要四舍五入到最接近的等值。假设非常常见的 IEEE 64 位浮点格式，最接近 0.1 的数字是 3602879701896397 x 2⁻⁵⁵，最接近 0.2 的数字> 是 7205759403792794 x 2⁻⁵⁵；将它们相加得到 10808639105689191 x 2⁻⁵⁵，或精确的十进制值 0.3000000000000000444089209850062616169452667236328125。浮点数在显示时通常会进行四舍五入。

回复收藏 0 原文

花期渐远 2025-01-17 12:57:06

简而言之这是因为：

浮点数不能精确地用二进制表示所有小数

所以就像 10/3 一样在 10 基数中不存在（它将是 3.33...重复出现），同样 1/10 也不存在以二进制形式。

那又怎样？怎么处理？有什么解决办法吗？

为了提供最佳解决方案，我可以说我发现了以下方法：

parseFloat((0.1 + 0.2).toFixed(10)) => Will return 0.3

让我解释一下为什么它是最佳解决方案。
正如上面答案中其他人提到的，使用现成的 Javascript toFixed() 函数来解决问题是一个好主意。但很可能您会遇到一些问题。

想象一下，您要将两个浮点数相加，例如 0.2 和 0.7，如下：0.2 + 0.7 = 0.8999999999999999。

您的预期结果是 0.9，这意味着在本例中您需要 1 位精度的结果。
所以你应该使用 (0.2 + 0.7).tofixed(1)
但你不能只给 toFixed() 一个特定的参数，因为它取决于给定的数字，例如

0.22 + 0.7 = 0.9199999999999999

在这个例子中你需要 2 位精度，所以它应该是 toFixed(2)，那么又怎样呢？应该是适合每个给定浮点数的参数吗？

那么你可能会说，在每种情况下都让它为 10：

(0.2 + 0.7).toFixed(10) => Result will be "0.9000000000"

该死！你打算如何处理 9 后面那些不需要的零呢？
现在是时候将其转换为浮点数以使其符合您的要求了：

parseFloat((0.2 + 0.7).toFixed(10)) => Result will be 0.9

既然您找到了解决方案，最好将其作为这样的函数提供：

function floatify(number){
           return parseFloat((number).toFixed(10));
        }

让我们自己尝试一下：

function floatify(number){
       return parseFloat((number).toFixed(10));
    }
 
function addUp(){
  var number1 = +$("#number1").val();
  var number2 = +$("#number2").val();
  var unexpectedResult = number1 + number2;
  var expectedResult = floatify(number1 + number2);
  $("#unexpectedResult").text(unexpectedResult);
  $("#expectedResult").text(expectedResult);
}
addUp();

input{
  width: 50px;
}
#expectedResult{
color: green;
}
#unexpectedResult{
color: red;
}

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input id="number1" value="0.2" onclick="addUp()" onkeyup="addUp()"/> +
<input id="number2" value="0.7" onclick="addUp()" onkeyup="addUp()"/> =
<p>Expected Result: <span id="expectedResult"></span></p>
<p>Unexpected Result: <span id="unexpectedResult"></span></p>

您可以这样使用它：

var x = 0.2 + 0.7;
floatify(x);  => Result: 0.9

正如W3SCHOOLS建议还有另一种解决方案，您可以乘法和除法来解决上述问题：

var x = (0.2 * 10 + 0.1 * 10) / 10;       // x will be 0.3

请记住，(0.2 + 0.1) * 10 / 10 根本不起作用，尽管它看起来是一样的！
我更喜欢第一个解决方案，因为我可以将其应用为将输入浮点转换为精确输出浮点的函数。

仅供参考，乘法也存在同样的问题：

例如0.09 * 10返回0.8999999999999999。应用 floatify 函数作为解决方法：floatify(0.09 * 10) 返回 0.9

Division：0.3 / 0.1 = 2.9999999999999996 但 floatify(0.3 - 0.1 ) 返回 0.2

减去：1 - 0.8 = 0.19999999999999996 但 floatify(1 - 0.8) 返回 0.2

In short it's because:

Floating point numbers cannot represent all decimals precisely in binary

So just like 10/3 which does not exist in base 10 precisely (it will be 3.33... recurring), in the same way 1/10 doesn't exist in binary.

So what? How to deal with it? Is there any workaround?

In order to offer The best solution I can say I discovered following method:

parseFloat((0.1 + 0.2).toFixed(10)) => Will return 0.3

Let me explain why it's the best solution.
As others mentioned in above answers it's a good idea to use ready to use Javascript toFixed() function to solve the problem. But most likely you'll encounter with some problems.

Imagine you are going to add up two float numbers like 0.2 and 0.7 here it is: 0.2 + 0.7 = 0.8999999999999999.

Your expected result was 0.9 it means you need a result with 1 digit precision in this case.
So you should have used (0.2 + 0.7).tofixed(1)
but you can't just give a certain parameter to toFixed() since it depends on the given number, for instance

0.22 + 0.7 = 0.9199999999999999

In this example you need 2 digits precision so it should be toFixed(2), so what should be the paramter to fit every given float number?

You might say let it be 10 in every situation then:

(0.2 + 0.7).toFixed(10) => Result will be "0.9000000000"

Damn! What are you going to do with those unwanted zeros after 9?
It's the time to convert it to float to make it as you desire:

parseFloat((0.2 + 0.7).toFixed(10)) => Result will be 0.9

Now that you found the solution, it's better to offer it as a function like this:

function floatify(number){
           return parseFloat((number).toFixed(10));
        }

Let's try it yourself:

function floatify(number){
       return parseFloat((number).toFixed(10));
    }
 
function addUp(){
  var number1 = +$("#number1").val();
  var number2 = +$("#number2").val();
  var unexpectedResult = number1 + number2;
  var expectedResult = floatify(number1 + number2);
  $("#unexpectedResult").text(unexpectedResult);
  $("#expectedResult").text(expectedResult);
}
addUp();

input{
  width: 50px;
}
#expectedResult{
color: green;
}
#unexpectedResult{
color: red;
}

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input id="number1" value="0.2" onclick="addUp()" onkeyup="addUp()"/> +
<input id="number2" value="0.7" onclick="addUp()" onkeyup="addUp()"/> =
<p>Expected Result: <span id="expectedResult"></span></p>
<p>Unexpected Result: <span id="unexpectedResult"></span></p>

You can use it this way:

var x = 0.2 + 0.7;
floatify(x);  => Result: 0.9

As W3SCHOOLS suggests there is another solution too, you can multiply and divide to solve the problem above:

var x = (0.2 * 10 + 0.1 * 10) / 10;       // x will be 0.3

Keep in mind that (0.2 + 0.1) * 10 / 10 won't work at all although it seems the same!
I prefer the first solution since I can apply it as a function which converts the input float to accurate output float.

FYI, the same problem exists for:

Multiplication: for instance 0.09 * 10 returns 0.8999999999999999. Apply the floatify function as a workaround: floatify(0.09 * 10) returns 0.9

Division: 0.3 / 0.1 = 2.9999999999999996 but floatify(0.3 - 0.1) returns 0.2

Subtract: 1 - 0.8 = 0.19999999999999996 but floatify(1 - 0.8) returns 0.2

回复收藏 0 原文

薯片软お妹 2025-01-17 12:57:06

浮点舍入误差。来自每个计算机科学家应该了解的浮点运算知识：

将无限多个实数压缩为有限数量的位数需要近似表示。虽然整数有无限多个，但在大多数程序中，整数计算的结果可以存储在 32 位中。相反，给定任何固定的位数，大多数实数计算将产生无法使用这么多位数精确表示的数量。因此，浮点计算的结果通常必须进行舍入，以便适应其有限表示。这种舍入误差是浮点计算的特征。

回复收藏 0 原文

老子叫无熙 2025-01-17 12:57:06

我的解决方法：

function add(a, b, precision) {
    var x = Math.pow(10, precision || 2);
    return (Math.round(a * x) + Math.round(b * x)) / x;
}

精度是指在加法过程中要保留小数点后的位数。

My workaround:

function add(a, b, precision) {
    var x = Math.pow(10, precision || 2);
    return (Math.round(a * x) + Math.round(b * x)) / x;
}

precision refers to the number of digits you want to preserve after the decimal point during addition.

回复收藏 0 原文

南风起 2025-01-17 12:57:06

不，没有损坏，但大多数小数必须近似

摘要

浮点算术是精确的，不幸的是，它与我们通常的以10为基数的数字表示形式不太匹配，所以事实证明我们经常给它输入与我们写的略有不同的输入。

即使像 0.01、0.02、0.03、0.04 ... 0.24 这样的简单数字也不能完全表示为二进制分数。如果你数 0.01, .02, .03 ...，直到达到 0.25，你才会得到第一个以 ₂ 为基数的分数。如果您尝试使用 FP，您的 0.01 会略有偏差，因此将其中 25 个相加到精确的 0.25 的唯一方法将需要涉及保护位和舍入的长因果链。很难预测，所以我们举手说“FP 不精确”，但事实并非如此。

我们不断地向 FP 硬件提供一些以 10 为基数看似简单但以 2 为基数的重复分数。

这是如何发生的？

当我们用十进制书写时，每个分数（具体来说，每个终止小数） 是以下形式的有理数

a / (2ⁿ x 5^m)

在二进制中，我们只能得到2ⁿ< /em> 术语，即：

a / 2ⁿ

因此，在十进制中，我们无法表示 ¹/₃。由于以 10 为底的分数包含 2 作为质因数，因此我们可以将每个数字写为二进制分数也，也可以将其写为以 10 为底的分数。然而，我们以 ₁₀ 为基数编写的任何内容几乎都不能用二进制表示。在 0.01、0.02、0.03 ... 0.99 范围内，我们的 FP 格式只能表示三个数字：0.25、0.50 和 0.75，因为它们是 1/4、1/2，和 3/4，所有仅使用 2ⁿ 项具有质因数的数字。

在基数 ₁₀ 中，我们无法表示 ¹/₃。但在二进制中，我们不能执行 ¹/₁₀ 或 ¹/₃。

因此，虽然每个二进制分数都可以写成十进制，但反之则不然。事实上，大多数十进制小数都以二进制形式重复。

处理

开发人员通常被指示做< epsilon 比较，更好的建议可能是舍入为整数值（在 C 库中：round() 和 roundf()，即保留 FP 格式），然后进行比较。舍入到特定的小数部分长度可以解决大多数输出问题。

此外，在实数运算问题（FP 是为早期昂贵的计算机而发明的问题）上，宇宙的物理常数和所有其他测量值只有相对较少的有效数字才知道，因此整个问题空间无论如何都是“不准确的”。在此类应用中，FP“准确性”不是问题。

当人们尝试使用 FP 进行豆子计数时，整个问题就真正出现了。它确实可以做到这一点，但前提是你坚持整数值，这违背了使用它的意义。 这就是我们拥有所有这些小数软件库的原因。

我喜欢 Chris 的 Pizza 回答，因为它描述了实际问题，而不仅仅是通常对“不准确”的挥手。如果 FP 只是“不准确”，我们可以修复并且几十年前就已经这么做了。我们没有这样做的原因是 FP 格式紧凑且快速，并且它是处理大量数字的最佳方式。此外，它是太空时代和军备竞赛的遗产，也是早期尝试使用小型内存系统解决速度非常慢的计算机的大问题的遗产。（有时，单个磁芯用于 1 位存储，但这是另一个故事。）

结论

如果您只是在银行计算豆子，那么首先使用十进制字符串表示形式的软件解决方案效果非常好。但你不能那样做量子色动力学或空气动力学。

No, not broken, but most decimal fractions must be approximated

Summary

Floating point arithmetic is exact, unfortunately, it doesn't match up well with our usual base-10 number representation, so it turns out we are often giving it input that is slightly off from what we wrote.

Even simple numbers like 0.01, 0.02, 0.03, 0.04 ... 0.24 are not representable exactly as binary fractions. If you count up 0.01, .02, .03 ..., not until you get to 0.25 will you get the first fraction representable in base₂. If you tried that using FP, your 0.01 would have been slightly off, so the only way to add 25 of them up to a nice exact 0.25 would have required a long chain of causality involving guard bits and rounding. It's hard to predict so we throw up our hands and say "FP is inexact", but that's not really true.

We constantly give the FP hardware something that seems simple in base 10 but is a repeating fraction in base 2.

How did this happen?

When we write in decimal, every fraction (specifically, every terminating decimal) is a rational number of the form

a / (2ⁿ x 5^m)

In binary, we only get the 2ⁿ term, that is:

a / 2ⁿ

So in decimal, we can't represent ¹/₃. Because base 10 includes 2 as a prime factor, every number we can write as a binary fraction also can be written as a base 10 fraction. However, hardly anything we write as a base₁₀ fraction is representable in binary. In the range from 0.01, 0.02, 0.03 ... 0.99, only three numbers can be represented in our FP format: 0.25, 0.50, and 0.75, because they are 1/4, 1/2, and 3/4, all numbers with a prime factor using only the 2ⁿ term.

In base₁₀ we can't represent ¹/₃. But in binary, we can't do ¹/₁₀ or ¹/₃.

So while every binary fraction can be written in decimal, the reverse is not true. And in fact most decimal fractions repeat in binary.

Dealing with it

Developers are usually instructed to do < epsilon comparisons, better advice might be to round to integral values (in the C library: round() and roundf(), i.e., stay in the FP format) and then compare. Rounding to a specific decimal fraction length solves most problems with output.

Also, on real number-crunching problems (the problems that FP was invented for on early, frightfully expensive computers) the physical constants of the universe and all other measurements are only known to a relatively small number of significant figures, so the entire problem space was "inexact" anyway. FP "accuracy" isn't a problem in this kind of application.

The whole issue really arises when people try to use FP for bean counting. It does work for that, but only if you stick to integral values, which kind of defeats the point of using it. This is why we have all those decimal fraction software libraries.

I love the Pizza answer by Chris, because it describes the actual problem, not just the usual handwaving about "inaccuracy". If FP were simply "inaccurate", we could fix that and would have done it decades ago. The reason we haven't is because the FP format is compact and fast and it's the best way to crunch a lot of numbers. Also, it's a legacy from the space age and arms race and early attempts to solve big problems with very slow computers using small memory systems. (Sometimes, individual magnetic cores for 1-bit storage, but that's another story.)

Conclusion

If you are just counting beans at a bank, software solutions that use decimal string representations in the first place work perfectly well. But you can't do quantum chromodynamics or aerodynamics that way.

回复收藏 0 原文

高跟鞋的旋律 2025-01-17 12:57:06

并非所有数字都可以通过浮点数/双精度数表示。
例如，数字“0.2”在 IEEE 754 浮点标准。

用于在引擎盖下存储实数的模型将浮点数表示为

即使您可以轻松键入 0.2，FLT_RADIX 和DBL_RADIX 为 2；对于具有 FPU 且使用“二进制浮点 IEEE 标准”的计算机，不是 10算术（ISO/IEEE 标准 754-1985）”。

所以要准确地表示这些数字有点困难。即使您显式指定此变量而无需任何中间计算。

回复收藏 0 原文

桃扇骨 2025-01-17 12:57:06

一些与这个著名的双精度问题相关的统计数据。

当使用 0.1 的步长（从 0.1 到 100）添加所有值 (a + b) 时，我们有 ~15% 的机会出现精度误差。请注意，该错误可能会导致值稍大或稍小。
以下是一些示例：

0.1 + 0.2 = 0.30000000000000004 (BIGGER)
0.1 + 0.7 = 0.7999999999999999 (SMALLER)
...
1.7 + 1.9 = 3.5999999999999996 (SMALLER)
1.7 + 2.2 = 3.9000000000000004 (BIGGER)
...
3.2 + 3.6 = 6.800000000000001 (BIGGER)
3.2 + 4.4 = 7.6000000000000005 (BIGGER)

当使用 0.1 步长（从 100 到 0.1）减去所有值（a - b，其中 a > b）时，我们有 ~出现精度误差的可能性为 34%。
以下是一些示例：

0.6 - 0.2 = 0.39999999999999997 (SMALLER)
0.5 - 0.4 = 0.09999999999999998 (SMALLER)
...
2.1 - 0.2 = 1.9000000000000001 (BIGGER)
2.0 - 1.9 = 0.10000000000000009 (BIGGER)
...
100 - 99.9 = 0.09999999999999432 (SMALLER)
100 - 99.8 = 0.20000000000000284 (BIGGER)

*15% 和 34% 确实很大，因此当精度非常重要时，请始终使用 BigDecimal。对于 2 位小数（步长 0.01），情况会更糟一些（18% 和 36%）。

Some statistics related to this famous double precision question.

When adding all values (a + b) using a step of 0.1 (from 0.1 to 100) we have ~15% chance of precision error. Note that the error could result in slightly bigger or smaller values.
Here are some examples:

0.1 + 0.2 = 0.30000000000000004 (BIGGER)
0.1 + 0.7 = 0.7999999999999999 (SMALLER)
...
1.7 + 1.9 = 3.5999999999999996 (SMALLER)
1.7 + 2.2 = 3.9000000000000004 (BIGGER)
...
3.2 + 3.6 = 6.800000000000001 (BIGGER)
3.2 + 4.4 = 7.6000000000000005 (BIGGER)

When subtracting all values (a - b where a > b) using a step of 0.1 (from 100 to 0.1) we have ~34% chance of precision error.
Here are some examples:

0.6 - 0.2 = 0.39999999999999997 (SMALLER)
0.5 - 0.4 = 0.09999999999999998 (SMALLER)
...
2.1 - 0.2 = 1.9000000000000001 (BIGGER)
2.0 - 1.9 = 0.10000000000000009 (BIGGER)
...
100 - 99.9 = 0.09999999999999432 (SMALLER)
100 - 99.8 = 0.20000000000000284 (BIGGER)

*15% and 34% are indeed huge, so always use BigDecimal when precision is of big importance. With 2 decimal digits (step 0.01) the situation worsens a bit more (18% and 36%).

回复收藏 0 原文

无声无音无过去 2025-01-17 12:57:06

一些高级语言（例如 Python 和 Java）附带了克服二进制浮点限制的工具。例如：

Python的decimal模块和 Java 的 BigDecimal类，在内部用十进制表示法（而不是二进制表示法）表示数字。两者的精度都有限，因此仍然容易出错，但它们解决了二进制浮点运算的大多数常见问题。

在处理金钱时，小数非常有用：十美分加二十美分总是恰好是三十美分：

<前><代码>>>>> 0.1 + 0.2 == 0.3
错误的
>>>>>小数('0.1') + 小数('0.2') == 小数('0.3')
真的

Python 的 decimal 模块基于 IEEE 标准 854-1987< /a>.
Python 的 fractions 模块和 Apache Common 的BigFraction 类。两者都将有理数表示为（分子，分母） 对，并且它们可以提供比十进制浮点算术更准确的结果。

这些解决方案都不是完美的（特别是如果我们考虑性能，或者如果我们需要非常高的精度），但它们仍然解决了二进制浮点运算的大量问题。

Some high level languages such as Python and Java come with tools to overcome binary floating point limitations. For example:

Python's decimal module and Java's BigDecimal class, that represent numbers internally with decimal notation (as opposed to binary notation). Both have limited precision, so they are still error prone, however they solve most common problems with binary floating point arithmetic.

Decimals are very nice when dealing with money: ten cents plus twenty cents are always exactly thirty cents:
```
  >>> 0.1 + 0.2 == 0.3
  False
  >>> Decimal('0.1') + Decimal('0.2') == Decimal('0.3')
  True
```
Python's decimal module is based on IEEE standard 854-1987.
Python's fractions module and Apache Common's BigFraction class. Both represent rational numbers as (numerator, denominator) pairs and they may give more accurate results than decimal floating point arithmetic.

Neither of these solutions is perfect (especially if we look at performances, or if we require a very high precision), but still they solve a great number of problems with binary floating point arithmetic.

回复收藏 0 原文

|煩躁 2025-01-17 12:57:06

人们总是认为这是一个计算机问题，但如果你用手数（以 10 为底），你就无法得到(1/3+1/3=2/3)=true，除非你有无限的能力将 0.333... 添加到 0.333...，所以就像基数为 2 的 (1/10+2/10)!==3/10 问题一样，你将其截断为0.333 + 0.333 = 0.666 并且可能将其四舍五入为 0.667，这在技术上也是不准确的。

用三进制数，但三进制不是问题——也许每只手有 15 个手指的比赛会问为什么你的十进制数学被破坏了......

回复收藏 0 原文

风情万种。 2025-01-17 12:57:06

这些奇怪的数字之所以出现，是因为计算机使用二进制（以 2 为基数）数字系统进行计算，而我们使用十进制（以 10 为基数）。

大多数小数无法用二进制或十进制或两者精确表示。结果 - 向上舍入（但精确）的数字结果。

回复收藏 0 原文

绝對不後悔。 2025-01-17 12:57:06

您尝试过管道胶带解决方案吗？

尝试确定何时发生错误并使用简短的 if 语句修复它们。它并不漂亮，但对于某些问题来说，它是唯一的解决方案，这就是其中之一。

 if( (n * 0.1) < 100.0 ) { return n * 0.1 - 0.000000000000001 ;}
                    else { return n * 0.1 + 0.000000000000001 ;}

我在C#的一个科学模拟项目中也遇到了同样的问题，我可以告诉你，如果你忽略蝴蝶效应，它就会变成一条大肥龙，咬你一口。

Did you try the duct tape solution?

Try to determine when errors occur and fix them with short if statements. It's not pretty, but for some problems it is the only solution and this is one of them.

 if( (n * 0.1) < 100.0 ) { return n * 0.1 - 0.000000000000001 ;}
                    else { return n * 0.1 + 0.000000000000001 ;}

I had the same problem in a scientific simulation project in C#, and I can tell you that if you ignore the butterfly effect, it's going to turn to a big fat dragon and bite you in the a**.

回复收藏 0 原文

亂 2025-01-17 12:57:06

这个问题的许多重复项询问浮点舍入对特定数字的影响。在实践中，通过查看感兴趣的计算的确切结果而不是仅仅阅读它，更容易了解它是如何工作的。某些语言提供了执行此操作的方法 - 例如在 Java 中将 float 或 double 转换为 BigDecimal。

由于这是一个与语言无关的问题，因此它需要与语言无关的工具，例如小数到浮点数-点转换器。

将其应用于问题中的数字，视为双精度数：

0.1 转换为 0.1000000000000000055511151231257827021181583404541015625，0.2

转换为0.200000000000000011102230246251565404236316680908203125,

0.3 转换为0.299999999999999988897769753748434595763683319091796875 和

0.30000000000000004 转换为0.3000000000000000444089209850062616169452667236328125。

手动或在小数计算器（例如全精度计算器）中添加前两个数字，会显示实际输入的精确总和是0.3000000000000000166533453693773481063544750213623046875。

如果向下舍入到相当于 0.3，则舍入误差将为 0.0000000000000000277555756156289135105907917022705078125。向上舍入到相当于 0.30000000000000004 的值也会产生舍入误差 0.0000000000000000277555756156289135105907917022705078125。采用圆平局决胜局。

返回到浮点转换器，0.30000000000000004 的原始十六进制为 3fd3333333333334，它以偶数结尾，因此是正确的结果。

回复收藏 0 原文

埖埖迣鎅 2025-01-17 12:57:06

可以在数字计算机中实现的浮点数学必然使用实数的近似值及其运算。（标准版本有超过五十页的文档，并有一个委员会来处理其勘误表和进一步完善。）

这种近似值是不同种类近似值的混合，每种近似值都可以被忽略或由于其特定方式偏离精确性而仔细考虑。它还涉及硬件和软件层面上的许多明显的异常情况，大多数人会假装没有注意到而直接忽略这些情况。

如果您需要无限精度（例如，使用数字 π，而不是其许多较短的替代值之一），您应该编写或使用符号数学程序。

但是，如果您认为有时浮点数学在值和逻辑上是模糊的，并且错误会快速累积，并且您可以编写需求和测试来实现这一点，那么您的代码通常可以满足其中的要求你的 FPU。

回复收藏 0 原文

吻安 2025-01-17 12:57:06

只是为了好玩，我按照标准 C99 的定义来玩弄浮点数的表示，并编写了下面的代码。

该代码将浮点的二进制表示形式分为 3 个独立的组

SIGN EXPONENT FRACTION

，然后打印一个总和，当以足够的精度求和时，它将显示硬件中实际存在的值。

因此，当您编写 float x = 999... 时，编译器会将该数字转换为由函数 xx 打印的位表示形式，以便函数打印的总和 < code>yy 等于给定的数字。

事实上，这个总和只是一个近似值。对于数字 999,999,999，编译器将在浮点数的位表示形式中插入数字 1,000,000,000。

在代码之后，我附加了一个控制台会话，在其中计算硬件中实际存在的两个常量（减去 PI 和 999999999）的项之和，并由编译器插入其中。

#include <stdio.h>
#include <limits.h>

void
xx(float *x)
{
    unsigned char i = sizeof(*x)*CHAR_BIT-1;
    do {
        switch (i) {
        case 31:
             printf("sign: ");
             break;
        case 30:
             printf("exponent: ");
             break;
        case 23:
             printf("fraction: ");
             break;

        }
        char b = (*(unsigned long long*)x&((unsigned long long)1<<i)) != 0;
        printf("%d ", b);
    } while (i--);
    printf("\n");
}

void
yy(float a)
{
    int sign = !(*(unsigned long long*)&a&((unsigned long long)1<<31));
    int fraction = ((1<<23)-1)&(*(int*)&a);
    int exponent = (255&((*(int*)&a)>>23))-127;

    printf(sign ? "positive" " ( 1+" : "negative" " ( 1+");
    unsigned int i = 1 << 22;
    unsigned int j = 1;
    do {
        char b = (fraction&i) != 0;
        b && (printf("1/(%d) %c", 1 << j, (fraction&(i-1)) ? '+' : ')' ), 0);
    } while (j++, i >>= 1);

    printf("*2^%d", exponent);
    printf("\n");
}

void
main()
{
    float x = -3.14;
    float y = 999999999;
    printf("%lu\n", sizeof(x));
    xx(&x);
    xx(&y);
    yy(x);
    yy(y);
}

这是一个控制台会话，我在其中计算硬件中存在的浮点数的实际值。我使用 bc 来打印主程序输出的项之和。人们可以将该总和插入到 python repl 或类似的东西中。

-- .../terra1/stub
@ qemacs f.c
-- .../terra1/stub
@ gcc f.c
-- .../terra1/stub
@ ./a.out
sign: 1 exponent: 1 0 0 0 0 0 0 fraction: 0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1
sign: 0 exponent: 1 0 0 1 1 1 0 fraction: 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0
negative ( 1+1/(2) +1/(16) +1/(256) +1/(512) +1/(1024) +1/(2048) +1/(8192) +1/(32768) +1/(65536) +1/(131072) +1/(4194304) +1/(8388608) )*2^1
positive ( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
-- .../terra1/stub
@ bc
scale=15
( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
999999999.999999446351872

就是这样。 999999999 的值实际上是。

999999999.999999446351872

您还可以使用 bc 检查 -3.14 是否也受到扰动。不要忘记在 bc 中设置比例因子。

显示的总和是硬件内部的总和。计算得到的值取决于您设置的比例。我确实将比例因子设置为 15。从数学上讲，以无限精度，它似乎是 1,000,000,000。

Just for fun, I played with the representation of floats, following the definitions from the Standard C99 and I wrote the code below.

The code prints the binary representation of floats in 3 separated groups

SIGN EXPONENT FRACTION

and after that it prints a sum, that, when summed with enough precision, it will show the value that really exists in hardware.

So when you write float x = 999..., the compiler will transform that number in a bit representation printed by the function xx such that the sum printed by the function yy be equal to the given number.

In reality, this sum is only an approximation. For the number 999,999,999, the compiler will insert in bit representation of the float the number 1,000,000,000.

After the code I attach a console session, in which I compute the sum of terms for both constants (minus PI and 999999999) that really exists in hardware, inserted there by the compiler.

#include <stdio.h>
#include <limits.h>

void
xx(float *x)
{
    unsigned char i = sizeof(*x)*CHAR_BIT-1;
    do {
        switch (i) {
        case 31:
             printf("sign: ");
             break;
        case 30:
             printf("exponent: ");
             break;
        case 23:
             printf("fraction: ");
             break;

        }
        char b = (*(unsigned long long*)x&((unsigned long long)1<<i)) != 0;
        printf("%d ", b);
    } while (i--);
    printf("\n");
}

void
yy(float a)
{
    int sign = !(*(unsigned long long*)&a&((unsigned long long)1<<31));
    int fraction = ((1<<23)-1)&(*(int*)&a);
    int exponent = (255&((*(int*)&a)>>23))-127;

    printf(sign ? "positive" " ( 1+" : "negative" " ( 1+");
    unsigned int i = 1 << 22;
    unsigned int j = 1;
    do {
        char b = (fraction&i) != 0;
        b && (printf("1/(%d) %c", 1 << j, (fraction&(i-1)) ? '+' : ')' ), 0);
    } while (j++, i >>= 1);

    printf("*2^%d", exponent);
    printf("\n");
}

void
main()
{
    float x = -3.14;
    float y = 999999999;
    printf("%lu\n", sizeof(x));
    xx(&x);
    xx(&y);
    yy(x);
    yy(y);
}

Here is a console session in which I compute the real value of the float that exists in hardware. I used bc to print the sum of terms outputted by the main program. One can insert that sum in python repl or something similar also.

-- .../terra1/stub
@ qemacs f.c
-- .../terra1/stub
@ gcc f.c
-- .../terra1/stub
@ ./a.out
sign: 1 exponent: 1 0 0 0 0 0 0 fraction: 0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1
sign: 0 exponent: 1 0 0 1 1 1 0 fraction: 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0
negative ( 1+1/(2) +1/(16) +1/(256) +1/(512) +1/(1024) +1/(2048) +1/(8192) +1/(32768) +1/(65536) +1/(131072) +1/(4194304) +1/(8388608) )*2^1
positive ( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
-- .../terra1/stub
@ bc
scale=15
( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
999999999.999999446351872

That's it. The value of 999999999 is in fact

999999999.999999446351872

You can also check with bc that -3.14 is also perturbed. Do not forget to set a scale factor in bc.

The displayed sum is what inside the hardware. The value you obtain by computing it depends on the scale you set. I did set the scale factor to 15. Mathematically, with infinite precision, it seems it is 1,000,000,000.

回复收藏 0 原文

岁月苍老的讽刺 2025-01-17 12:57:06

浮点数在硬件级别表示为二进制数（基数为 2）的分数。例如，十进制小数：

0.125

的值为 1/10 + 2/100 + 5/1000，同样，二进制小数：

0.001

的值为 0/2 + 0/4 + 1/8。这两个分数具有相同的值，唯一的区别是第一个是十进制分数，第二个是二进制分数。

不幸的是，大多数十进制分数不能用二进制分数精确表示。因此，一般来说，你给出的浮点数只是近似于要存储在机器中的二进制小数。

这个问题以 10 为底更容易解决。以分数 1/3 为例。您可以将其近似为小数：

0.3

或更好，

0.33

或更好，

0.333

等等。无论您写多少位小数，结果都不会精确到 1/3，但它是一个总是更接近的估计值。

同样，无论使用多少个以 2 为基数的小数位，十进制值 0.1 都不能精确地表示为二进制分数。在基数 2 中，1/10 是以下周期数：

0.0001100110011001100110011001100110011001100110011 ...

停在任何有限数量的位上，您将得到一个近似值。

对于Python，在典型的机器上，浮点数的精度使用53位，因此输入小数0.1时存储的值是二进制小数。

0.00011001100110011001100110011001100110011001100110011010

接近但不完全等于 1/10。

由于浮点数在解释器中的显示方式，很容易忘记存储的值是原始小数的近似值。 Python 仅显示以二进制存储的值的十进制近似值。如果 Python 要输出存储为 0.1 的二进制近似值的真实十进制值，它将输出：

>>> 0.1
0.1000000000000000055511151231257827021181583404541015625

This is a much more Decimal days than most people’s预想，因此 Python 显示一个四舍五入的值以提高可读性：

>>> 0.1
0.1

理解这一点很重要实际上，这是一种错觉：存储的值并不完全是 1/10，只是在显示屏上显示存储的值被四舍五入。一旦您使用这些值执行算术运算，这一点就会变得明显：

>>> 0.1 + 0.2
0.30000000000000004

这种行为是机器浮点表示的本质所固有的：它不是 Python 中的错误，也不是代码中的错误。您可以在使用硬件支持计算浮点数的所有其他语言中观察到相同类型的行为（尽管某些语言默认情况下不会使差异可见，或者并非在所有显示模式下都可见）。

另一个惊喜是这个惊喜所固有的。例如，如果您尝试将值 2.675 四舍五入到小数点后两位，您将得到

>>> round (2.675, 2)
2.67

round() 原语的文档表明它四舍五入到远离零的最接近的值。由于小数正好位于 2.67 和 2.68 之间，因此您应该得到 2.68（二进制近似值）。然而，情况并非如此，因为当小数 2.675 转换为浮点数时，它是通过近似值存储的，其精确值为：

2.67499999999999982236431605997495353221893310546875

由于该近似值比 2.68 稍微更接近 2.67，因此舍入会向下舍入。

如果您遇到将十进制数字向下舍入一半很重要的情况，则应该使用十进制模块。顺便说一句，十进制模块还提供了一种方便的方法来“查看”为任何浮点存储的确切值。

>>> from decimal import Decimal
>>> Decimal (2.675)
>>> Decimal ('2.67499999999999982236431605997495353221893310546875')

0.1 没有精确地存储为 1/10 的事实的另一个后果是 0.1 的十个值的总和也不会给出 1.0：

>>> sum = 0.0
>>> for i in range (10):
... sum + = 0.1
...>>> sum
0.9999999999999999

二进制浮点数的算术有很多这样的惊喜。 “0.1”的问题将在下面的“表示错误”部分中详细解释。有关此类意外的更完整列表，请参阅浮点的危险。

确实，没有简单的答案，但是不要对浮动虚拟数字过度怀疑！在 Python 中，浮点数运算中的错误是由底层硬件引起的，在大多数机器上，每次运算的错误不超过 2 ** 53 中的 1。这对于大多数任务来说是不必要的，但您应该记住，这些不是十进制运算，并且对浮点数的每个运算都可能会遇到新的错误。

尽管存在病态情况，但对于大多数常见用例，您只需向上舍入到您想要在显示屏上显示的小数位数，即可最终获得预期结果。要精细控制浮点数的显示方式，请参阅字符串格式化语法以了解 str.format () 方法的格式化规范。

答案的这一部分详细解释了“0.1”的示例，并展示了如何自行对此类情况进行精确分析。我们假设您熟悉浮点数的二进制表示形式。“表示错误”一词意味着大多数十进制分数无法精确地用二进制表示。这是 Python（或 Perl、C、C++、Java、Fortran 等许多其他语言）通常不以十进制显示精确结果的主要原因：

>>> 0.1 + 0.2
0.30000000000000004

为什么？ 1/10 和 2/10 不能用二进制分数精确表示。然而，今天（2010 年 7 月）的所有机器都遵循浮点数算术的 IEEE-754 标准。大多数平台使用“IEEE-754 双精度”来表示 Python 浮点数。双精度 IEEE-754 使用 53 位精度，因此在读取时，计算机会尝试将 0.1 转换为 J / 2 ** N 形式的最接近的分数，其中 J 恰好是 53 位的整数。重写：

1/10 ~ = J / (2 ** N)

在：

J ~ = 2 ** N / 10

记住 J 正好是 53 位（所以 > = 2 ** 52 但 <2 ** 53），N 的最佳可能值是 56：

>>> 2 ** 52
4503599627370496
>>> 2 ** 53
9007199254740992
>>> 2 ** 56/10
7205759403792793

因此 56 是 N 的唯一可能值，正好剩下 53 J 的位。因此，J 的最佳可能值就是这个商，四舍五入：

>>> q, r = divmod (2 ** 56, 10)
>>> r
6

由于进位大于 10 的一半，因此通过向上舍入获得最佳近似值：

>>> q + 1
7205759403792794

因此，最佳可能近似值对于“IEEE-754 双精度”中的 1/10 来说，它高于 2 ** 56，即：

7205759403792794/72057594037927936

请注意，由于向上舍入，结果实际上略大于 1/10；如果我们没有四舍五入，商会略小于 1/10。但在任何情况下都不会恰好是 1/10！

因此计算机永远不会“看到”1/10：它看到的是上面给出的精确分数，使用“IEEE-754”中的双精度浮点数的最佳近似值：

>>>. 1 * 2 ** 56
7205759403792794.0

如果我们将此分数乘以 10 ** 30 ，我们可以观察到其小数点后30位的强重量值，

>>> 7205759403792794 * 10 ** 30 // 2 ** 56
100000000000000005551115123125L

这意味着计算机中存储的精确值大约等于小数点值。 0.100000000000000005551115123125。在 Python 2.7 和 Python 3.1 之前的版本中，Python 将这些值四舍五入到小数点后 17 位，显示“0.10000000000000001”。转换回二进制时的表示形式相同，简单地说显示“0.1”。

Floating point numbers are represented, at the hardware level, as fractions of binary numbers (base 2). For example, the decimal fraction:

0.125

has the value 1/10 + 2/100 + 5/1000 and, in the same way, the binary fraction:

0.001

has the value 0/2 + 0/4 + 1/8. These two fractions have the same value, the only difference is that the first is a decimal fraction, the second is a binary fraction.

Unfortunately, most decimal fractions cannot have exact representation in binary fractions. Therefore, in general, the floating point numbers you give are only approximated to binary fractions to be stored in the machine.

The problem is easier to approach in base 10. Take for example, the fraction 1/3. You can approximate it to a decimal fraction:

0.3

or better,

0.33

or better,

0.333

etc. No matter how many decimal places you write, the result is never exactly 1/3, but it is an estimate that always comes closer.

Likewise, no matter how many base 2 decimal places you use, the decimal value 0.1 cannot be represented exactly as a binary fraction. In base 2, 1/10 is the following periodic number:

0.0001100110011001100110011001100110011001100110011 ...

Stop at any finite amount of bits, and you'll get an approximation.

For Python, on a typical machine, 53 bits are used for the precision of a float, so the value stored when you enter the decimal 0.1 is the binary fraction.

0.00011001100110011001100110011001100110011001100110011010

which is close, but not exactly equal, to 1/10.

It's easy to forget that the stored value is an approximation of the original decimal fraction, due to the way floats are displayed in the interpreter. Python only displays a decimal approximation of the value stored in binary. If Python were to output the true decimal value of the binary approximation stored for 0.1, it would output:

>>> 0.1
0.1000000000000000055511151231257827021181583404541015625

This is a lot more decimal places than most people would expect, so Python displays a rounded value to improve readability:

>>> 0.1
0.1

It is important to understand that in reality this is an illusion: the stored value is not exactly 1/10, it is simply on the display that the stored value is rounded. This becomes evident as soon as you perform arithmetic operations with these values:

>>> 0.1 + 0.2
0.30000000000000004

This behavior is inherent to the very nature of the machine's floating-point representation: it is not a bug in Python, nor is it a bug in your code. You can observe the same type of behavior in all other languages that use hardware support for calculating floating point numbers (although some languages do not make the difference visible by default, or not in all display modes).

Another surprise is inherent in this one. For example, if you try to round the value 2.675 to two decimal places, you will get

>>> round (2.675, 2)
2.67

The documentation for the round() primitive indicates that it rounds to the nearest value away from zero. Since the decimal fraction is exactly halfway between 2.67 and 2.68, you should expect to get (a binary approximation of) 2.68. This is not the case, however, because when the decimal fraction 2.675 is converted to a float, it is stored by an approximation whose exact value is :

2.67499999999999982236431605997495353221893310546875

Since the approximation is slightly closer to 2.67 than 2.68, the rounding is down.

If you are in a situation where rounding decimal numbers halfway down matters, you should use the decimal module. By the way, the decimal module also provides a convenient way to "see" the exact value stored for any float.

>>> from decimal import Decimal
>>> Decimal (2.675)
>>> Decimal ('2.67499999999999982236431605997495353221893310546875')

Another consequence of the fact that 0.1 is not exactly stored in 1/10 is that the sum of ten values of 0.1 does not give 1.0 either:

>>> sum = 0.0
>>> for i in range (10):
... sum + = 0.1
...>>> sum
0.9999999999999999

The arithmetic of binary floating point numbers holds many such surprises. The problem with "0.1" is explained in detail below, in the section "Representation errors". See The Perils of Floating Point for a more complete list of such surprises.

It is true that there is no simple answer, however do not be overly suspicious of floating virtual numbers! Errors, in Python, in floating-point number operations are due to the underlying hardware, and on most machines are no more than 1 in 2 ** 53 per operation. This is more than necessary for most tasks, but you should keep in mind that these are not decimal operations, and every operation on floating point numbers may suffer from a new error.

Although pathological cases exist, for most common use cases you will get the expected result at the end by simply rounding up to the number of decimal places you want on the display. For fine control over how floats are displayed, see String Formatting Syntax for the formatting specifications of the str.format () method.

This part of the answer explains in detail the example of "0.1" and shows how you can perform an exact analysis of this type of case on your own. We assume that you are familiar with the binary representation of floating point numbers.The term Representation error means that most decimal fractions cannot be represented exactly in binary. This is the main reason why Python (or Perl, C, C ++, Java, Fortran, and many others) usually doesn't display the exact result in decimal:

>>> 0.1 + 0.2
0.30000000000000004

Why? 1/10 and 2/10 are not representable exactly in binary fractions. However, all machines today (July 2010) follow the IEEE-754 standard for the arithmetic of floating point numbers. and most platforms use an "IEEE-754 double precision" to represent Python floats. Double precision IEEE-754 uses 53 bits of precision, so on reading the computer tries to convert 0.1 to the nearest fraction of the form J / 2 ** N with J an integer of exactly 53 bits. Rewrite:

1/10 ~ = J / (2 ** N)

in:

J ~ = 2 ** N / 10

remembering that J is exactly 53 bits (so> = 2 ** 52 but <2 ** 53), the best possible value for N is 56:

>>> 2 ** 52
4503599627370496
>>> 2 ** 53
9007199254740992
>>> 2 ** 56/10
7205759403792793

So 56 is the only possible value for N which leaves exactly 53 bits for J. The best possible value for J is therefore this quotient, rounded:

>>> q, r = divmod (2 ** 56, 10)
>>> r
6

Since the carry is greater than half of 10, the best approximation is obtained by rounding up:

>>> q + 1
7205759403792794

Therefore the best possible approximation for 1/10 in "IEEE-754 double precision" is this above 2 ** 56, that is:

7205759403792794/72057594037927936

Note that since the rounding was done upward, the result is actually slightly greater than 1/10; if we hadn't rounded up, the quotient would have been slightly less than 1/10. But in no case is it exactly 1/10!

So the computer never "sees" 1/10: what it sees is the exact fraction given above, the best approximation using the double precision floating point numbers from the "" IEEE-754 ":

>>>. 1 * 2 ** 56
7205759403792794.0

If we multiply this fraction by 10 ** 30, we can observe the values of its 30 decimal places of strong weight.

>>> 7205759403792794 * 10 ** 30 // 2 ** 56
100000000000000005551115123125L

meaning that the exact value stored in the computer is approximately equal to the decimal value 0.100000000000000005551115123125. In versions prior to Python 2.7 and Python 3.1, Python rounded these values to 17 significant decimal places, displaying “0.10000000000000001”. In current versions of Python, the displayed value is the value whose fraction is as short as possible while giving exactly the same representation when converted back to binary, simply displaying “0.1”.

回复收藏 0 原文

韬韬不绝 2025-01-17 12:57:06

浮点数的陷阱是它们看起来像十进制，但它们以二进制形式工作。

2 的唯一质因数是 2，而 10 的质因数是 2 和 5。这样做的结果是，每个可以精确地写成二进制分数的数字也可以精确地写成十进制分数，但只能是一个子集可以写成十进制分数的数字也可以写成二进制分数。

浮点数本质上是具有有限数量有效数字的二进制分数。如果超过这些有效数字，结果将被四舍五入。

当您在代码中键入文字或调用函数将浮点数解析为字符串时，它需要一个十进制数，并将该十进制数的二进制近似值存储在变量中。

当您打印浮点数或调用函数将其转换为字符串时，它会打印浮点数的十进制近似值。可以将二进制数精确地转换为十进制数，但我所知道的语言在转换为字符串*时默认情况下都没有这样做。某些语言使用固定数量的有效数字，其他语言则使用最短的字符串，该字符串将“往返”回相同的浮点值。

* Python 在将浮点数转换为“decimal.Decimal”时确实进行精确转换。这是我所知道的获取浮点数的精确十进制等值的最简单方法。

回复收藏 0 原文

_畞蕅 2025-01-17 12:57:06

自 Python 3.5 起，您就可以使用 math.isclose() 函数用于测试近似值平等：

>>> import math
>>> math.isclose(0.1 + 0.2, 0.3)
True
>>> 0.1 + 0.2 == 0.3
False

Since Python 3.5, you have been able to use the math.isclose() function for testing approximate equality:

>>> import math
>>> math.isclose(0.1 + 0.2, 0.3)
True
>>> 0.1 + 0.2 == 0.3
False

回复收藏 0 原文

伊面 2025-01-17 12:57:06

另一种看待这个问题的方式是：使用 64 位来表示数字。因此，无法精确表示超过 2**64 = 18,446,744,073,709,551,616 个不同的数字。

然而，Math 表示 0 和 1 之间已经有无限多个小数。IEE 754 定义了一种编码，可以有效地将这些 64 位用于更大的数字空间以及 NaN 和 +/- Infinity，因此在精确表示的数字之间存在间隙数字只是近似值。

不幸的是 0.3 存在差距。

回复收藏 0 原文

谁的新欢旧爱 2025-01-17 12:57:06

想象一下以 10 为基数进行工作，精度达到 8 位数字。您检查

1/3 + 2 / 3 == 1

并得知这是否返回false。为什么？好吧，作为实数，我们有

1/3 = 0.333.... 和 2/3 = 0.666....

截断小数点后八位，我们得到

0.33333333 + 0.66666666 = 0.99999999

的是当然，与 1.00000000 相差 0.00000001。

具有固定位数的二进制数的情况完全类似。作为实数，我们有

1/10 = 0.0001100110011001100... (base 2)

和

1/5 = 0.0011001100110011001... (base 2)

如果我们将它们截断为，比如说，七位，那么我们会得到

0.0001100 + 0.0011001 = 0.0100101

另一方面，

3/10 = 0.01001100110011...（基数 2）

，截断为 7 位后，为 0.0100110，并且它们之间的差异恰好为 0.0000001。

确切的情况稍微微妙一些，因为这些数字通常以科学记数法存储。因此，例如，我们可以将其存储为 1.10011 * 2^-4 之类的内容，而不是将 1/10 存储为 0.0001100，具体取决于我们分配了多少位对于指数和尾数。这会影响您计算时获得的精度位数。

结果是，由于这些舍入错误，您基本上不想在浮点数上使用 == 。相反，您可以检查它们差异的绝对值是否小于某个固定的小数。

Imagine working in base ten with, say, 8 digits of accuracy. You check whether

1/3 + 2 / 3 == 1

and learn that this returns false. Why? Well, as real numbers we have

1/3 = 0.333.... and 2/3 = 0.666....

Truncating at eight decimal places, we get

0.33333333 + 0.66666666 = 0.99999999

which is, of course, different from 1.00000000 by exactly 0.00000001.

The situation for binary numbers with a fixed number of bits is exactly analogous. As real numbers, we have

1/10 = 0.0001100110011001100... (base 2)

and

1/5 = 0.0011001100110011001... (base 2)

If we truncated these to, say, seven bits, then we'd get

0.0001100 + 0.0011001 = 0.0100101

while on the other hand,

3/10 = 0.01001100110011... (base 2)

which, truncated to seven bits, is 0.0100110, and these differ by exactly 0.0000001.

The exact situation is slightly more subtle because these numbers are typically stored in scientific notation. So, for instance, instead of storing 1/10 as 0.0001100 we may store it as something like 1.10011 * 2^-4, depending on how many bits we've allocated for the exponent and the mantissa. This affects how many digits of precision you get for your calculations.

The upshot is that because of these rounding errors you essentially never want to use == on floating-point numbers. Instead, you can check if the absolute value of their difference is smaller than some fixed small number.

回复收藏 0 原文

说不完的你爱 2025-01-17 12:57:06

其实很简单。当你有一个以 10 为底的系统（像我们的系统）时，它只能表达使用底数素因数的分数。 10 的质因数是 2 和 5。因此 1/2、1/4、1/5、1/8 和 1/10 都可以清晰地表达，因为分母都使用 10 的质因数。相比之下，1 /3、1/6 和 1/7 都是重复小数，因为它们的分母使用质因数 3 或 7。在二进制（或基数 2）中，唯一的质因数是2. 所以你只能干净地表达只包含 2 作为质因数的分数。在二进制中，1/2、1/4、1/8 都可以清晰地表示为小数。而 1/5 或 1/10 则为重复小数。因此，0.1 和 0.2（1/10 和 1/5）虽然在 10 基数系统中是干净的小数，但在计算机运行的基数 2 系统中却是重复小数。当您对这些重复小数进行数学运算时，最终会得到剩余的小数当您将计算机的基数 2（二进制）数字转换为更易于人类阅读的基数 10 数字时，它会继续存在。

来自 https://0.30000000000000004.com/

回复收藏 0 原文

花之痕靓丽 2025-01-17 12:57:06

0.1、0.2 和 0.3 等十进制数在二进制编码的浮点类型中无法准确表示。 0.1 和 0.2 的近似值之和与 0.3 使用的近似值不同，因此 0.1 + 0.2 == 是错误的0.3 在这里可以更清楚地看到：

#include <stdio.h>

int main() {
    printf("0.1 + 0.2 == 0.3 is %s\n", 0.1 + 0.2 == 0.3 ? "true" : "false");
    printf("0.1 is %.23f\n", 0.1);
    printf("0.2 is %.23f\n", 0.2);
    printf("0.1 + 0.2 is %.23f\n", 0.1 + 0.2);
    printf("0.3 is %.23f\n", 0.3);
    printf("0.3 - (0.1 + 0.2) is %g\n", 0.3 - (0.1 + 0.2));
    return 0;
}

输出：

0.1 + 0.2 == 0.3 is false
0.1 is 0.10000000000000000555112
0.2 is 0.20000000000000001110223
0.1 + 0.2 is 0.30000000000000004440892
0.3 is 0.29999999999999998889777
0.3 - (0.1 + 0.2) is -5.55112e-17

为了更可靠地评估这些计算，您需要使用基于十进制的浮点值表示。 C 标准默认情况下不指定此类类型，而是作为技术报告。

_Decimal32、_Decimal64 和 _Decimal128 类型可能在您的系统上可用（例如，GCC 在选定的目标，但是 Clang 在操作系统 X）。

Decimal numbers such as 0.1, 0.2, and 0.3 are not represented exactly in binary encoded floating point types. The sum of the approximations for 0.1 and 0.2 differs from the approximation used for 0.3, hence the falsehood of 0.1 + 0.2 == 0.3 as can be seen more clearly here:

#include <stdio.h>

int main() {
    printf("0.1 + 0.2 == 0.3 is %s\n", 0.1 + 0.2 == 0.3 ? "true" : "false");
    printf("0.1 is %.23f\n", 0.1);
    printf("0.2 is %.23f\n", 0.2);
    printf("0.1 + 0.2 is %.23f\n", 0.1 + 0.2);
    printf("0.3 is %.23f\n", 0.3);
    printf("0.3 - (0.1 + 0.2) is %g\n", 0.3 - (0.1 + 0.2));
    return 0;
}

Output:

0.1 + 0.2 == 0.3 is false
0.1 is 0.10000000000000000555112
0.2 is 0.20000000000000001110223
0.1 + 0.2 is 0.30000000000000004440892
0.3 is 0.29999999999999998889777
0.3 - (0.1 + 0.2) is -5.55112e-17

For these computations to be evaluated more reliably, you would need to use a decimal-based representation for floating point values. The C Standard does not specify such types by default but as an extension described in a technical Report.

The _Decimal32, _Decimal64 and _Decimal128 types might be available on your system (for example, GCC supports them on selected targets, but Clang does not support them on OS X).

回复收藏 0 原文

和我恋爱吧 2025-01-17 12:57:06

普通算术是以 10 为基数的，因此小数代表十分之一、百分之一等。当您尝试用二进制基数 2 算术表示浮点数时，您要处理的是二分之一、四分之二、八分之一等。

在硬件中，浮点数点存储为整数尾数和指数。尾数代表有效数字。指数类似于科学记数法，但它使用基数 2 而不是 10。例如，64.0 将用尾数 1 和指数 6 表示。0.125 将用尾数 1 和指数 -3 表示。

浮点小数必须将 2 的负幂相加

0.1b = 0.5d
0.01b = 0.25d
0.001b = 0.125d
0.0001b = 0.0625d
0.00001b = 0.03125d

，依此类推。

在处理浮点运算时，通常使用误差增量而不是使用相等运算符。而不是

if(a==b) ...

你会使用

delta = 0.0001; // or some arbitrarily small amount
if(a - b > -delta && a - b < delta) ...

Normal arithmetic is base-10, so decimals represent tenths, hundredths, etc. When you try to represent a floating-point number in binary base-2 arithmetic, you are dealing with halves, fourths, eighths, etc.

In the hardware, floating points are stored as integer mantissas and exponents. Mantissa represents the significant digits. Exponent is like scientific notation but it uses a base of 2 instead of 10. For example 64.0 would be represented with a mantissa of 1 and exponent of 6. 0.125 would be represented with a mantissa of 1 and an exponent of -3.

Floating point decimals have to add up negative powers of 2

0.1b = 0.5d
0.01b = 0.25d
0.001b = 0.125d
0.0001b = 0.0625d
0.00001b = 0.03125d

and so on.

It is common to use a error delta instead of using equality operators when dealing with floating point arithmetic. Instead of

if(a==b) ...

you would use

delta = 0.0001; // or some arbitrarily small amount
if(a - b > -delta && a - b < delta) ...

回复收藏 0 原文

真心难拥有 2025-01-17 12:57:06

有一些项目致力于解决浮点实现问题。

看看 Unum &例如，Posit，它展示了一种名为 posit 的数字类型（及其前身 unum），承诺以更少的位数提供更高的准确性。如果我的理解是正确的，它也解决了问题中的问题。这是一个相当有趣的项目，背后的人是一位数学家，Dr.约翰·古斯塔夫森。

整个东西是开源的，有许多 C/C++、Python、Julia 和 C# 的实际实现（https://hastlayer. com/arithmetics）。

回复收藏 0 原文

浮点数学有问题吗？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（30）

旁注：所有位置（以 N 为基数）数字系统都存在这个精度问题。

旁注：在编程中使用浮点数

Side Note: All positional (base-N) number systems share this problem with precision

Side Note: Working with Floats in Programming

硬件设计师的视角

1. 概述

2. 标准

3. 除法舍入误差的原因

4. 其他运算中的舍入错误：截断

5. 重复运算

6. 总结

A Hardware Designer's Perspective

1. Overview

2. Standards

3. Cause of Rounding Error in Division

4. Rounding Errors in Other Operations: Truncation

5. Repeated Operations

6. Summary

不，没有损坏，但大多数小数必须近似

摘要

这是如何发生的？

处理

结论

No, not broken, but most decimal fractions must be approximated

Summary

How did this happen?

Dealing with it

Conclusion

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。