夏雨凉 2025-02-11 12:11:08

二进制浮点数学类似。在大多数编程语言中，它基于 ieee 754标准。问题的症结在于，数字以这种格式代表了整个数量，倍数的功率为两个。理性数字（例如 0.1 ，是 1/10 ），其分母不是两个功能，不能完全表示。

对于 0.1 在标准 binary64 格式中，可以完全按照

0.10000000000000000000000005555151515123125782782782118158158158404545415415415415625在Decimal中或
0x1.999999999999999999999999999999999999999999999999999999999999999999.999999999999999. /code> in c99 hexfloat符号。

相比之下，有理数 0.1 是 1/10 ，可以完全

用小数为 0.1 编写，或者
0x1。 9999999999999 ... p-4 在c99六链符号的类似物中，其中 ... 代表了9秒的无效序列。

常数 0.2 和 0.3 在您的程序中也将是其真实值的近似值。碰巧的是，最接近的 double to 0.2 大于理性号码 0.2 ，但最接近的 double to to <代码> 0.3 小于理性编号 0.3 。 0.1 和 0.2 的总和大于理性编号 0.3 ，因此与代码中的常数不同意。

浮点算术问题的相当全面的处理是每个计算机科学家对浮点算术的了解。有关易于消化的说明，请参见 floating-point-gui.de 。

旁注：所有位置（base-n）编号系统与精确的

普通旧小数（基本10）数字共享此问题，这就是为什么1/3之类的数字最终成0.3333333333 ...

您刚刚偶然发现了在数字（3/10）上，该数字很容易用小数系统表示，但不符合二进制系统。这两种方式也（在某种程度上）也是如此：1/16是十进制的丑陋数字（0.0625），但在二进制中，它看起来像十分位数（0.0001）的10,000次（0.0001）** - 如果我们在在我们的日常生活中使用基本2号系统的习惯，您甚至会考虑到这个数字，并本能地理解您可以通过将某些东西减半，一次又一次地将其减半。

当然，这并不是将浮点数存储在记忆中的方式（它们使用科学符号的形式）。但是，它确实说明了二进制浮点精度错误往往会出现的观点，因为我们通常对使用的“现实世界”数字通常是十大的功能 - 但这仅仅是因为我们使用小数次数字系统 - 今天。这也是为什么我们会说诸如71％而不是“每7分中的5”之类的东西（71％是一个近似值，因为5/7不能完全用任何小数组来代表）。

因此，否：二进制浮点数没有打破，它们恰好与其他每个基本n号系统一样不完美：:)

侧面注意：在实践中使用浮子的浮子

，此精确问题意味着您需要使用舍入在显示之前，您可以将浮点数围绕浮点数的功能，但是您在显示它们之前感兴趣的许多小数点。

您还需要将平等测试替换为允许一定耐受性的比较，这意味着：

do not do 如果（x == y）{... ...}

而不是 if（abs（x -y）＆lt; myToleranceValue）{...} 。

其中 abs 是绝对值。 myToleranceValue 需要为您的特定应用程序选择 - 这与您准备允许的“ Wiggle Room”有很大的关系，以及您要比较的最大数字可能（由于精确问题的丧失）。当您选择的语言中提防“ Epsilon”风格常数。这些可以用作公差值，但它们的有效性取决于您正在使用的数字的大小（大小），因为数量较大的计算可能超过Epsilon阈值。

Binary floating point math works like this. In most programming languages, it is based on the IEEE 754 standard. The crux of the problem is that numbers are represented in this format as a whole number times a power of two; rational numbers (such as 0.1, which is 1/10) whose denominator is not a power of two cannot be exactly represented.

For 0.1 in the standard binary64 format, the representation can be written exactly as

0.1000000000000000055511151231257827021181583404541015625 in decimal, or
0x1.999999999999ap-4 in C99 hexfloat notation.

In contrast, the rational number 0.1, which is 1/10, can be written exactly as

0.1 in decimal, or
0x1.99999999999999...p-4 in an analog of C99 hexfloat notation, where the ... represents an unending sequence of 9's.

The constants 0.2 and 0.3 in your program will also be approximations to their true values. It happens that the closest double to 0.2 is larger than the rational number 0.2 but that the closest double to 0.3 is smaller than the rational number 0.3. The sum of 0.1 and 0.2 winds up being larger than the rational number 0.3 and hence disagreeing with the constant in your code.

A fairly comprehensive treatment of floating-point arithmetic issues is What Every Computer Scientist Should Know About Floating-Point Arithmetic. For an easier-to-digest explanation, see floating-point-gui.de.

Side Note: All positional (base-N) number systems share this problem with precision

Plain old decimal (base 10) numbers have the same issues, which is why numbers like 1/3 end up as 0.333333333...

You've just stumbled on a number (3/10) that happens to be easy to represent with the decimal system but doesn't fit the binary system. It goes both ways (to some small degree) as well: 1/16 is an ugly number in decimal (0.0625), but in binary it looks as neat as a 10,000th does in decimal (0.0001)** - if we were in the habit of using a base-2 number system in our daily lives, you'd even look at that number and instinctively understand you could arrive there by halving something, halving it again, and again and again.

Of course, that's not exactly how floating-point numbers are stored in memory (they use a form of scientific notation). However, it does illustrate the point that binary floating-point precision errors tend to crop up because the "real world" numbers we are usually interested in working with are so often powers of ten - but only because we use a decimal number system day-to-day. This is also why we'll say things like 71% instead of "5 out of every 7" (71% is an approximation since 5/7 can't be represented exactly with any decimal number).

So, no: binary floating point numbers are not broken, they just happen to be as imperfect as every other base-N number system :)

Side Note: Working with Floats in Programming

In practice, this problem of precision means you need to use rounding functions to round your floating point numbers off to however many decimal places you're interested in before you display them.

You also need to replace equality tests with comparisons that allow some amount of tolerance, which means:

Do not do if (x == y) { ... }

Instead do if (abs(x - y) < myToleranceValue) { ... }.

where abs is the absolute value. myToleranceValue needs to be chosen for your particular application - and it will have a lot to do with how much "wiggle room" you are prepared to allow, and what the largest number you are going to be comparing may be (due to loss of precision issues). Beware of "epsilon" style constants in your language of choice. These can be used as tolerance values but their effectiveness depends on the magnitude (size) of the numbers you're working with, since calculations with large numbers may exceed the epsilon threshold.

回复收藏 0 原文

栖竹 2025-02-11 12:11:08

硬件设计师的视角

我相信我应该为此添加硬件设计师的观点，因为我设计和构建了浮点硬件。了解错误的起源可能有助于理解软件中发生的事情，最终，我希望这有助于解释浮点错误发生并随着时间而积累的原因。

1。

从工程的角度来看，大多数浮点操作都会有一些错误元素，因为执行浮点计算的硬件仅需要在最后一个位置的错误少于一个单元的一半。因此，许多硬件将停止的精确度仅是仅在最后一个位置的一个单元的误差少于一个单元的误差所必需的，而单一操作，这在浮点分区中尤其有问题。构成单个操作的是取决于单元采用多少操作数。对于大多数人来说，这是两个，但是有些单元需要3个或更多操作数。因此，不能保证重复操作会导致理想的错误，因为这些错误会随着时间的推移加起来。

2。标准

大多数处理器遵循 ieee-754 标准
。例如，IEEE-754中有一个符合模式，该模式允许以精度为代价来表示非常小的浮点数。但是，以下将涵盖IEEE-754的归一化模式，这是典型的操作模式。

在IEEE-754标准中，只要在最后一个位置少于一个单位的一半，硬件设计人员就可以允许任何错误/epsilon的价值放置一个操作。这解释了为什么当重复操作时，错误会加起来。对于IEEE-754双重精度，这是第54位，因为使用53位表示浮点数的数字部分（归一化），也称为Mantissa（例如5.3 in 5.3e5）。接下来的部分详细介绍了各种浮点操作上硬件错误的原因。

3。造成舍入误差的原因，

浮点划分误差的主要原因是用于计算商的分裂算法。大多数计算机系统使用乘法乘以乘法，主要在 z = x/y ， z = x *（1/y）中计算划分。在迭代中计算一个分区，即每个周期都计算一些商的一些位，直到达到所需的精度为止，对于IEEE-754而言，这在最后一个位置的错误少于一个单元。 y（1/y）的倒数表被称为慢划分中的商选择表（QST），并且商选择表中的大小通常是radix的宽度，或者是许多位的宽度在每次迭代中计算的商，以及几个后卫。对于IEEE-754标准，双重精度（64位），它将是分隔器的radix的大小，再加上几个后卫位K，其中 k＆gt; = 2 。因此，例如，一次计算2位商的分隔线的典型商选择表（Radix 4）将为 2+2 = 4 位（加上一些可选位）。

3.1分区四舍五入错误：倒数的近似

商选择表中的倒数取决于分区方法：慢速分区，例如SRT部门或快速分裂，例如Goldschmidt division；根据除法算法修改每个条目，以尝试产生最低的误差。但是，无论如何，所有倒数都是实际倒数的近似值，并引入了一些错误元素。缓慢的除法和快速除法方法都计算出商迭代，即每个步骤都计算出一些商的位，然后从股息中减去结果，而分隔器重复步骤，直到误差小于一半的一半在最后一个地方。缓慢的除法方法计算每个步骤中商数的固定数字，并且通常构建价格便宜，快速分裂方法计算每个步骤数量的数字数量，并且构建通常更昂贵。分区方法中最重要的部分是，它们中的大多数依赖于相互量的近似重复乘法，因此它们很容易出错。

4。其他操作中的四舍五入错误：截断

所有操作中舍入错误的另一个原因是IEEE-754允许的最终答案的截断模式的不同模式。有截短的，圆形的 - 零，“ noreferrer”> round to-nearest（default），回合 - 下达和综述。所有方法都在最后一个地方引入了少于一个单元的误差元素，以进行单个操作。随着时间的流逝和重复操作，截断还会累积地增加结果错误。这种截断误差在指示中尤其有问题，涉及某种形式的重复乘法。

5。重复操作，

因为执行浮点计算的硬件只需要产生一个结果，而错误在最后一个位置的一个单位的错误少于一个单元的一半以进行一次操作，因此，如果不观察，该错误将会在重复的操作中增长。这就是在需要有界错误的计算中的原因，数学家使用诸如使用fromn-neart IEEE-754的最后一个位置的数字，因为随着时间的流逝，错误更有可能互相取消，而间隔算术结合预测四舍五入错误并纠正它们。由于与其他圆形模式相比，其相对误差较低，因此圆形到最近的数字（在最后一个位置）是IEEE-754的默认舍入模式。

请注意，默认的圆形模式，圆头到nearest 甚至在最后一个位置的数字，保证在最后一个位置的一个单位的错误少于一个单元的一半进行一次操作。使用截断，综述和单独的圆形可能会导致一个错误大于最后一个单元的一半，但在最后一个位置少于一个单元，因此不建议使用这些模式用于间隔算术。

6.总而言之

，浮点操作错误的基本原因是硬件截断的结合，以及在划分的情况下的倒数。由于IEEE-754标准仅需要在最后一个位置少于一个单元的一半误差即可进行一次操作，因此除非纠正，否则重复操作的浮点误差将加起来。

A Hardware Designer's Perspective

I believe I should add a hardware designer’s perspective to this since I design and build floating point hardware. Knowing the origin of the error may help in understanding what is happening in the software, and ultimately, I hope this helps explain the reasons for why floating point errors happen and seem to accumulate over time.

1. Overview

From an engineering perspective, most floating point operations will have some element of error since the hardware that does the floating point computations is only required to have an error of less than one half of one unit in the last place. Therefore, much hardware will stop at a precision that's only necessary to yield an error of less than one half of one unit in the last place for a single operation which is especially problematic in floating point division. What constitutes a single operation depends upon how many operands the unit takes. For most, it is two, but some units take 3 or more operands. Because of this, there is no guarantee that repeated operations will result in a desirable error since the errors add up over time.

2. Standards

Most processors follow the IEEE-754 standard but some use denormalized, or different standards
. For example, there is a denormalized mode in IEEE-754 which allows representation of very small floating point numbers at the expense of precision. The following, however, will cover the normalized mode of IEEE-754 which is the typical mode of operation.

In the IEEE-754 standard, hardware designers are allowed any value of error/epsilon as long as it's less than one half of one unit in the last place, and the result only has to be less than one half of one unit in the last place for one operation. This explains why when there are repeated operations, the errors add up. For IEEE-754 double precision, this is the 54th bit, since 53 bits are used to represent the numeric part (normalized), also called the mantissa, of the floating point number (e.g. the 5.3 in 5.3e5). The next sections go into more detail on the causes of hardware error on various floating point operations.

3. Cause of Rounding Error in Division

The main cause of the error in floating point division is the division algorithms used to calculate the quotient. Most computer systems calculate division using multiplication by an inverse, mainly in Z=X/Y, Z = X * (1/Y). A division is computed iteratively i.e. each cycle computes some bits of the quotient until the desired precision is reached, which for IEEE-754 is anything with an error of less than one unit in the last place. The table of reciprocals of Y (1/Y) is known as the quotient selection table (QST) in the slow division, and the size in bits of the quotient selection table is usually the width of the radix, or a number of bits of the quotient computed in each iteration, plus a few guard bits. For the IEEE-754 standard, double precision (64-bit), it would be the size of the radix of the divider, plus a few guard bits k, where k>=2. So for example, a typical Quotient Selection Table for a divider that computes 2 bits of the quotient at a time (radix 4) would be 2+2= 4 bits (plus a few optional bits).

3.1 Division Rounding Error: Approximation of Reciprocal

What reciprocals are in the quotient selection table depend on the division method: slow division such as SRT division, or fast division such as Goldschmidt division; each entry is modified according to the division algorithm in an attempt to yield the lowest possible error. In any case, though, all reciprocals are approximations of the actual reciprocal and introduce some element of error. Both slow division and fast division methods calculate the quotient iteratively, i.e. some number of bits of the quotient are calculated each step, then the result is subtracted from the dividend, and the divider repeats the steps until the error is less than one half of one unit in the last place. Slow division methods calculate a fixed number of digits of the quotient in each step and are usually less expensive to build, and fast division methods calculate a variable number of digits per step and are usually more expensive to build. The most important part of the division methods is that most of them rely upon repeated multiplication by an approximation of a reciprocal, so they are prone to error.

4. Rounding Errors in Other Operations: Truncation

Another cause of the rounding errors in all operations are the different modes of truncation of the final answer that IEEE-754 allows. There's truncate, round-towards-zero, round-to-nearest (default), round-down, and round-up. All methods introduce an element of error of less than one unit in the last place for a single operation. Over time and repeated operations, truncation also adds cumulatively to the resultant error. This truncation error is especially problematic in exponentiation, which involves some form of repeated multiplication.

5. Repeated Operations

Since the hardware that does the floating point calculations only needs to yield a result with an error of less than one half of one unit in the last place for a single operation, the error will grow over repeated operations if not watched. This is the reason that in computations that require a bounded error, mathematicians use methods such as using the round-to-nearest even digit in the last place of IEEE-754, because, over time, the errors are more likely to cancel each other out, and Interval Arithmetic combined with variations of the IEEE 754 rounding modes to predict rounding errors, and correct them. Because of its low relative error compared to other rounding modes, round to nearest even digit (in the last place), is the default rounding mode of IEEE-754.

Note that the default rounding mode, round-to-nearest even digit in the last place, guarantees an error of less than one half of one unit in the last place for one operation. Using the truncation, round-up, and round down alone may result in an error that is greater than one half of one unit in the last place, but less than one unit in the last place, so these modes are not recommended unless they are used in Interval Arithmetic.

6. Summary

In short, the fundamental reason for the errors in floating point operations is a combination of the truncation in hardware, and the truncation of a reciprocal in the case of division. Since the IEEE-754 standard only requires an error of less than one half of one unit in the last place for a single operation, the floating point errors over repeated operations will add up unless corrected.

回复收藏 0 原文

挽手叙旧 2025-02-11 12:11:08

浮点符号以完全相同的方式打破了您在小学时学到的十进制（基本10）符号，并且每天使用使用，仅适用于Base-2。

要理解，考虑将2/3表示为十进制价值。完全不可能做到！世界将在您完成小数点之后写6的内容之前结束，因此，我们将其写入一些位置，将其写入最后的7个位置，并认为它足够准确。

以同样的方式，1/10（十进制为0.1）不能完全以“十进制”值表示基础2（二进制）；小数点后的重复模式永远存在。该值不准确，因此您无法使用普通的浮点方法对其进行精确数学。就像基本10一样，还有其他值也表现出此问题。

回复收藏 0 原文

吹梦到西洲 2025-02-11 12:11:08

这里的大多数答案都以非常干燥的技术术语来解决这个问题。我想用普通人可以理解的术语来解决这个问题。

想象您正在尝试切片比萨饼。您有一个机器人的披萨切割机，可以将比萨饼切成薄片一半。它可以将整个比萨饼减半，也可以将现有切片减半，但无论如何，始终确切的一半。

那个披萨切割机的动作非常好，如果您从整个披萨开始，然后将其减半，并每次将最小的切片减半，则可以在切片太小之前进行减半 53次即使是其高精度的能力。到那时，您不能再将其切成薄的切片，但必须包括或排除它。

现在，您将如何以这种切片的形式构成所有片段，即将披萨的十分之一（0.1）或五分之一（0.2）绘制？真正考虑一下，然后尝试解决它。如果您手头有神话般的精密披萨切割器，您甚至可以尝试使用真正的披萨。 :-)

当然，最有经验的程序员知道真正的答案，这是使用这些切片都无法将 Exccent Exccent 十分之一或五分之一组合在一起他们。您可以做一个很好的近似值，如果将0.1的近似值添加到0.2的近似值，则获得0.3的近似值，但仍然是一个近似值。

对于双精度数字（这是使您可以将披萨切成53次的精度），该数字立即少于0.1，是0.09999999999999999999999999167332731531531325946822762762762489318931562555555555555555555555555555555555ION 1583404541015625。后者比前者更接近0.1，因此，如果输入为0.1，数字解析器将有利于后者。

（这两个数字之间的差异是我们必须决定包括的“最小切片”，它引入了向上偏见或排除，它引入了向下偏见。该最小切片的技术术语是。

https://en.wikipedia.org/wiki/unit_in_in_the_last_last_place "> ulp 这略高于0.2。

请注意，在这两种情况下，0.1和0.2的近似值都有略微向上偏置。如果我们添加了足够的这些偏见，它们将越来越远离我们想要的数字，实际上，在0.1 + 0.2的情况下，偏差足够高，以至于所产生的数字不再是最接近的数字至0.3。

特别是，0.1 + 0.2实际上是0.1000000000000000000555151231257827021181581583404541015625 + 0.20000000000000000000000011111110223302302330233023302515151515151515423631680000000000000044444444444444444400082008200820082008200820082008200082000820008200080000000000000号= 6169452667236328125，而最接近0.3的数字实际上是0.2999999999999999999999888888897775374848484484595957636368333319091796875。

ps一些编程语言还提供可以拆分切片的披萨切割器。尽管这样的披萨切割机并不常见，但是如果您确实可以访问它，那么在重要的是要获得十分之一或五分之一的切片时，应该使用它。

（最初发布在Quora。）

Most answers here address this question in very dry, technical terms. I'd like to address this in terms that normal human beings can understand.

Imagine that you are trying to slice up pizzas. You have a robotic pizza cutter that can cut pizza slices exactly in half. It can halve a whole pizza, or it can halve an existing slice, but in any case, the halving is always exact.

That pizza cutter has very fine movements, and if you start with a whole pizza, then halve that, and continue halving the smallest slice each time, you can do the halving 53 times before the slice is too small for even its high-precision abilities. At that point, you can no longer halve that very thin slice, but must either include or exclude it as is.

Now, how would you piece all the slices in such a way that would add up to one-tenth (0.1) or one-fifth (0.2) of a pizza? Really think about it, and try working it out. You can even try to use a real pizza, if you have a mythical precision pizza cutter at hand. :-)

Most experienced programmers, of course, know the real answer, which is that there is no way to piece together an exact tenth or fifth of the pizza using those slices, no matter how finely you slice them. You can do a pretty good approximation, and if you add up the approximation of 0.1 with the approximation of 0.2, you get a pretty good approximation of 0.3, but it's still just that, an approximation.

For double-precision numbers (which is the precision that allows you to halve your pizza 53 times), the numbers immediately less and greater than 0.1 are 0.09999999999999999167332731531132594682276248931884765625 and 0.1000000000000000055511151231257827021181583404541015625. The latter is quite a bit closer to 0.1 than the former, so a numeric parser will, given an input of 0.1, favour the latter.

(The difference between those two numbers is the "smallest slice" that we must decide to either include, which introduces an upward bias, or exclude, which introduces a downward bias. The technical term for that smallest slice is an ulp.)

In the case of 0.2, the numbers are all the same, just scaled up by a factor of 2. Again, we favour the value that's slightly higher than 0.2.

Notice that in both cases, the approximations for 0.1 and 0.2 have a slight upward bias. If we add enough of these biases in, they will push the number further and further away from what we want, and in fact, in the case of 0.1 + 0.2, the bias is high enough that the resulting number is no longer the closest number to 0.3.

In particular, 0.1 + 0.2 is really 0.1000000000000000055511151231257827021181583404541015625 + 0.200000000000000011102230246251565404236316680908203125 = 0.3000000000000000444089209850062616169452667236328125, whereas the number closest to 0.3 is actually 0.299999999999999988897769753748434595763683319091796875.

P.S. Some programming languages also provide pizza cutters that can split slices into exact tenths. Although such pizza cutters are uncommon, if you do have access to one, you should use it when it's important to be able to get exactly one-tenth or one-fifth of a slice.

(Originally posted on Quora.)

回复收藏 0 原文

江南烟雨〆相思醉 2025-02-11 12:11:08

浮点舍入错误。 0.1在基本-2中不能像base-10一样准确地表示，因为缺少素数为5。当1/3占用数字数字数字数字时，数字为十进制，但在base-3中为“ 0.1”， 0.1在基本10中不在基本2中取出无限数量的数字。计算机没有无限的内存。

回复收藏 0 原文

鹿港小镇 2025-02-11 12:11:08

我的答案很长，所以我将其分为三个部分。由于问题是关于浮点数学的，因此我重点介绍了机器实际上的作用。我还将其特定于双重（64位）精度，但是该参数同样适用于任何浮点算术。

preamble

an

value =（-1）^s *（1.M ₅₁ m ₅₀ ... m ₂ m ₁ m ₀）₂ * 2 ^e-1023

在64位：

第一个位是 sign bit ： 1 如果数字为负， 0 否则^{1 ^{1 < /sup>。}}
接下来的11位是 endent> exponent ， offset 1023。换句话说，在阅读了两倍精确号码的指数位后，必须减去1023，以获得两个的幂。
其余的52位为显着（或Mantissa）。在Mantissa中，“暗示” 1。始终²省略，因为任何二进制值中最重要的位 1 。

¹ -IEEE 754允许签名零 - - - <代码> +0 和 -0 的处理方式有所不同： 1/（+0）是正无限; 1 /（-0）< / code>是负无穷大。对于零值，Mantissa和指数位均为零。注意：零值（+0和-0）明确未归类为denormal ²。

² - denormal数字偏移零的指数（以及隐含的 0。）。 denormal双精度数字的范围为d min ≤| x | ≤d max ，其中d min （最小的代表性非零数字）为2 ^-1023-51（≈4.94 * 10 ^{- 324}）和d _max（最大的变性号，曼蒂萨完全由 1 s组成）是2 ^{-1023 + 1 -2 ^-1023-51（≈2.225 * 10 ^-308）。}

将双重精度编号转换为二进制

存在许多在线转换器，以将双精度的浮点数转换为二进制（例如， binaryconvert.com ），但这里有一些示例C＃代码以获取IEEE 754表示双重精度编号（我将三个部分与结肠分开（：> ）：

public static string BinaryRepresentation(double value)
{
    long valueInLongType = BitConverter.DoubleToInt64Bits(value);
    string bits = Convert.ToString(valueInLongType, 2);
    string leadingZeros = new string('0', 64 - bits.Length);
    string binaryRepresentation = leadingZeros + bits;

    string sign = binaryRepresentation[0].ToString();
    string exponent = binaryRepresentation.Substring(1, 11);
    string mantissa = binaryRepresentation.Substring(12);

    return string.Format("{0}:{1}:{2}", sign, exponent, mantissa);
}

要点：原始问题

（跳到tl; dr版本的底部）

cato johnston （问题询问）问为什么0.1 + 0.2！=

0.3

0.1 => 0:01111111011:1001100110011001100110011001100110011001100110011010
0.2 => 0:01111111100:1001100110011001100110011001100110011001100110011010

。 0011的数字。 a 有限二进制位数量超过1/9、1/3或1/7可以精确地用 DECIMAL Digits 。

另请注意，我们可以将指数中的功率降低52，并将二进制代表中的点移到右侧的点52个位置（非常类似于10 ^-3 * 1.23 == 10 ^-5 * 123）。然后，这使我们能够将二进制表示形式表示为以a * 2 ^p形式表示的确切值。其中“ a”是整数。

将指数转换为十进制，删除偏移，然后重新添加隐含的 1 （在方括号中），0.1和0.2是：

0.1 => 2^-4 * [1].1001100110011001100110011001100110011001100110011010
0.2 => 2^-3 * [1].1001100110011001100110011001100110011001100110011010
or
0.1 => 2^-56 * 7205759403792794 = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794 = 0.200000000000000011102230246251565404236316680908203125

要添加两个数字，指数必须相同，即：

0.1 => 2^-3 *  0.1100110011001100110011001100110011001100110011001101(0)
0.2 => 2^-3 *  1.1001100110011001100110011001100110011001100110011010
sum =  2^-3 * 10.0110011001100110011001100110011001100110011001100111
or
0.1 => 2^-55 * 3602879701896397  = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794  = 0.200000000000000011102230246251565404236316680908203125
sum =  2^-55 * 10808639105689191 = 0.3000000000000000166533453693773481063544750213623046875

由于总和不是表格2 ⁿ * 1. {bbb}我们将指数增加一个，然后移动小数（ binary ）点：

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)
    = 2^-54 * 5404319552844595.5 = 0.3000000000000000166533453693773481063544750213623046875

现在有Mantissa中的53位（第53位在上线的方括号中）。默认圆形模式' - ie如果一个数字 x 属于两个值 a 和 b ，则最小显着的位为零的值是选择。

a = 2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875
  = 2^-2  * 1.0011001100110011001100110011001100110011001100110011

x = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)

b = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
  = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

请注意， a 和 b 仅在最后位中有所不同。 ... 0011 + 1 = ... 0100 。在这种情况下，零值最低的值为 b ，因此总和是：

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
    = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

而0.3的二进制表示为：

0.3 => 2^-2  * 1.0011001100110011001100110011001100110011001100110011
    =  2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875

仅与0.1和0.2的二进制表示不同。由2 ^-54。

二进制表示为0.1和0.2是IEEE 754允许的数字的最精确表示。由于默认的舍入模式，这些表示形式的添加至少仅导致一个值，这一点始终有所不同。 - 显着性。

tl; dr </

strong 0.1 + 0.2 在IEEE 754二进制表示中（分隔三个部分），并将其与 0.3 进行比较，这是（我将不同的位放在方括号中）：

0.1 + 0.2 => 0:01111111101:0011001100110011001100110011001100110011001100110[100]
0.3       => 0:01111111101:0011001100110011001100110011001100110011001100110[011]

转换回小数，这些值是：

0.1 + 0.2 => 0.300000000000000044408920985006...
0.3       => 0.299999999999999988897769753748...

差异恰好是2 ^-54，即〜5.5515151231258×10 ^{-17 > - 微不足道（对于许多应用程序）。}

比较浮点数的最后几个位本质上是危险的，因为任何阅读著名的“ 每个计算机科学家对浮点算术的了解“（涵盖本答案的所有主要部分）将知道。

大多数计算器都使用其他 guard Digits 解决这个问题，这就是 0.1 + 0.2 + 0.2 + 0.2 将给出 0.3 ：最后几个位是舍入的。

My answer is quite long, so I've split it into three sections. Since the question is about floating point mathematics, I've put the emphasis on what the machine actually does. I've also made it specific to double (64 bit) precision, but the argument applies equally to any floating point arithmetic.

Preamble

An IEEE 754 double-precision binary floating-point format (binary64) number represents a number of the form

value = (-1)^s * (1.m₅₁m₅₀...m₂m₁m₀)₂ * 2^e-1023

in 64 bits:

The first bit is the sign bit: 1 if the number is negative, 0 otherwise¹.
The next 11 bits are the exponent, which is offset by 1023. In other words, after reading the exponent bits from a double-precision number, 1023 must be subtracted to obtain the power of two.
The remaining 52 bits are the significand (or mantissa). In the mantissa, an 'implied' 1. is always² omitted since the most significant bit of any binary value is 1.

¹ - IEEE 754 allows for the concept of a signed zero - +0 and -0 are treated differently: 1 / (+0) is positive infinity; 1 / (-0) is negative infinity. For zero values, the mantissa and exponent bits are all zero. Note: zero values (+0 and -0) are explicitly not classed as denormal².

² - This is not the case for denormal numbers, which have an offset exponent of zero (and an implied 0.). The range of denormal double precision numbers is d_min ≤ |x| ≤ d_max, where d_min (the smallest representable nonzero number) is 2^{-1023 - 51} (≈ 4.94 * 10^-324) and d_max (the largest denormal number, for which the mantissa consists entirely of 1s) is 2^{-1023 + 1} - 2^{-1023 - 51} (≈ 2.225 * 10^-308).

Turning a double precision number to binary

Many online converters exist to convert a double precision floating point number to binary (e.g. at binaryconvert.com), but here is some sample C# code to obtain the IEEE 754 representation for a double precision number (I separate the three parts with colons (:):

public static string BinaryRepresentation(double value)
{
    long valueInLongType = BitConverter.DoubleToInt64Bits(value);
    string bits = Convert.ToString(valueInLongType, 2);
    string leadingZeros = new string('0', 64 - bits.Length);
    string binaryRepresentation = leadingZeros + bits;

    string sign = binaryRepresentation[0].ToString();
    string exponent = binaryRepresentation.Substring(1, 11);
    string mantissa = binaryRepresentation.Substring(12);

    return string.Format("{0}:{1}:{2}", sign, exponent, mantissa);
}

Getting to the point: the original question

(Skip to the bottom for the TL;DR version)

Cato Johnston (the question asker) asked why 0.1 + 0.2 != 0.3.

Written in binary (with colons separating the three parts), the IEEE 754 representations of the values are:

0.1 => 0:01111111011:1001100110011001100110011001100110011001100110011010
0.2 => 0:01111111100:1001100110011001100110011001100110011001100110011010

Note that the mantissa is composed of recurring digits of 0011. This is key to why there is any error to the calculations - 0.1, 0.2 and 0.3 cannot be represented in binary precisely in a finite number of binary bits any more than 1/9, 1/3 or 1/7 can be represented precisely in decimal digits.

Also note that we can decrease the power in the exponent by 52 and shift the point in the binary representation to the right by 52 places (much like 10^-3 * 1.23 == 10^-5 * 123). This then enables us to represent the binary representation as the exact value that it represents in the form a * 2^p. where 'a' is an integer.

Converting the exponents to decimal, removing the offset, and re-adding the implied 1 (in square brackets), 0.1 and 0.2 are:

0.1 => 2^-4 * [1].1001100110011001100110011001100110011001100110011010
0.2 => 2^-3 * [1].1001100110011001100110011001100110011001100110011010
or
0.1 => 2^-56 * 7205759403792794 = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794 = 0.200000000000000011102230246251565404236316680908203125

To add two numbers, the exponent needs to be the same, i.e.:

0.1 => 2^-3 *  0.1100110011001100110011001100110011001100110011001101(0)
0.2 => 2^-3 *  1.1001100110011001100110011001100110011001100110011010
sum =  2^-3 * 10.0110011001100110011001100110011001100110011001100111
or
0.1 => 2^-55 * 3602879701896397  = 0.1000000000000000055511151231257827021181583404541015625
0.2 => 2^-55 * 7205759403792794  = 0.200000000000000011102230246251565404236316680908203125
sum =  2^-55 * 10808639105689191 = 0.3000000000000000166533453693773481063544750213623046875

Since the sum is not of the form 2ⁿ * 1.{bbb} we increase the exponent by one and shift the decimal (binary) point to get:

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)
    = 2^-54 * 5404319552844595.5 = 0.3000000000000000166533453693773481063544750213623046875

There are now 53 bits in the mantissa (the 53rd is in square brackets in the line above). The default rounding mode for IEEE 754 is 'Round to Nearest' - i.e. if a number x falls between two values a and b, the value where the least significant bit is zero is chosen.

a = 2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875
  = 2^-2  * 1.0011001100110011001100110011001100110011001100110011

x = 2^-2  * 1.0011001100110011001100110011001100110011001100110011(1)

b = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
  = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

Note that a and b differ only in the last bit; ...0011 + 1 = ...0100. In this case, the value with the least significant bit of zero is b, so the sum is:

sum = 2^-2  * 1.0011001100110011001100110011001100110011001100110100
    = 2^-54 * 5404319552844596 = 0.3000000000000000444089209850062616169452667236328125

whereas the binary representation of 0.3 is:

0.3 => 2^-2  * 1.0011001100110011001100110011001100110011001100110011
    =  2^-54 * 5404319552844595 = 0.299999999999999988897769753748434595763683319091796875

which only differs from the binary representation of the sum of 0.1 and 0.2 by 2^-54.

The binary representation of 0.1 and 0.2 are the most accurate representations of the numbers allowable by IEEE 754. The addition of these representation, due to the default rounding mode, results in a value which differs only in the least-significant-bit.

TL;DR

Writing 0.1 + 0.2 in a IEEE 754 binary representation (with colons separating the three parts) and comparing it to 0.3, this is (I've put the distinct bits in square brackets):

0.1 + 0.2 => 0:01111111101:0011001100110011001100110011001100110011001100110[100]
0.3       => 0:01111111101:0011001100110011001100110011001100110011001100110[011]

Converted back to decimal, these values are:

0.1 + 0.2 => 0.300000000000000044408920985006...
0.3       => 0.299999999999999988897769753748...

The difference is exactly 2^-54, which is ~5.5511151231258 × 10^-17 - insignificant (for many applications) when compared to the original values.

Comparing the last few bits of a floating point number is inherently dangerous, as anyone who reads the famous "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (which covers all the major parts of this answer) will know.

Most calculators use additional guard digits to get around this problem, which is how 0.1 + 0.2 would give 0.3: the final few bits are rounded.

回复收藏 0 原文

最后的乘客 2025-02-11 12:11:08

除了其他正确的答案外，您可能还需要考虑扩展值，以避免浮点算术问题。

例如：

var result = 1.0 + 2.0;     // result === 3.0 returns true

...代替：

var result = 0.1 + 0.2;     // result === 0.3 returns false

表达式 0.1 + 0.2 === 0.3 返回 false in JavaScript中，但幸运的是，浮动点中的整数算术是准确的，因此可以通过缩放来避免表示错误。

作为一个实践的例子，为了避免精确度至关重要的浮点问题，建议¹将金钱作为代表美分数的整数来处理： 2550 cents而不是美分 25.50 美元。

¹ Douglas Crockford： javascript：好零件：appendix a - 可怕的零件（第105页）。

In addition to the other correct answers, you may want to consider scaling your values to avoid problems with floating-point arithmetic.

For example:

var result = 1.0 + 2.0;     // result === 3.0 returns true

... instead of:

var result = 0.1 + 0.2;     // result === 0.3 returns false

The expression 0.1 + 0.2 === 0.3 returns false in JavaScript, but fortunately integer arithmetic in floating-point is exact, so decimal representation errors can be avoided by scaling.

As a practical example, to avoid floating-point problems where accuracy is paramount, it is recommended¹ to handle money as an integer representing the number of cents: 2550 cents instead of 25.50 dollars.

¹ Douglas Crockford: JavaScript: The Good Parts: Appendix A - Awful Parts (page 105).

回复收藏 0 原文

缪败 2025-02-11 12:11:08

存储在计算机中的浮点数由两个部分组成，一个整数和基座被带到整数部分并乘以该基础的指数。

如果计算机在基本10中工作，则 0.1 将为 1 x10⁻⁻， 0.2 将为 2 x10⁻⁻ ，<代码> 0.3 将为 3 x10⁻⁻。整数数学是简单而精确的，因此添加 0.1 + 0.2 显然会导致 0.3 。

计算机通常在基本10中不起作用，而是在基本2中工作。 0.25 是 1 x2⁻²，并将它们添加为 3 x2⁻²，或 0.75 。确切地。

该问题带有可以在基本10中完全表示的数字，而在基本2中则不能。假设具有非常常见的IEEE 64位浮点格式，则最接近 0.1 是 3602879701896397 x2⁻⁵⁵，最接近 0.2 是 7205759403792794 x2⁻⁵⁵;将它们添加在一起 10808639105689191 x2⁻⁵⁵，或 0.30000000000000000000044444444440892098500626161616169452666672667236328125 <代码>的确切小数值。浮点数通常是舍入的，以显示。

回复收藏 0 原文

太阳公公是暖光 2025-02-11 12:11:08

在简短是因为：

浮点数不能准确表示二进制中的所有小数

，因此就像10/3一样不存在基本10中不存在（将是3.33 ...重复出现），以相同的方式1/10在二进制中不存在。

那呢？如何处理它？</strong>是否有解决方法？

为了提供最佳解决方案，我可以说我发现了以下方法：

parseFloat((0.1 + 0.2).toFixed(10)) => Will return 0.3

让我解释一下为什么它是最好的解决方案。
正如上面答案中提到的其他人所述，使用准备使用JavaScript tofixed（）函数解决问题是一个好主意。但是很可能您会遇到一些问题。

想象一下，您将添加两个浮点数，例如 0.2 和 0.7 在这里： 0.2 + 0.7 = 0.89999999999999999999 。

您的预期结果是 0.9 ，这意味着在这种情况下需要具有1位数字精度的结果。
因此，您应该使用（0.2 + 0.7）。Tofixed（1）
但是您不能仅将某个参数给tofixed（），因为它取决于给定的数字，例如，

0.22 + 0.7 = 0.9199999999999999

在此示例中，您需要2位数字精度，因此应为 tofix（2），那么什么应该适合每个给定的浮点数的参数吗？

您可能会说，在每种情况下，都会说是10：

(0.2 + 0.7).toFixed(10) => Result will be "0.9000000000"

该死！ 9点以后，您将如何处理那些不需要的零？
现在是时候将其转换为Float以使其按照您的要求进行：

parseFloat((0.2 + 0.7).toFixed(10)) => Result will be 0.9

现在您找到了解决方案，最好将其作为这样的函数提供：

function floatify(number){
           return parseFloat((number).toFixed(10));
        }

让我们自己尝试一下：

function floatify(number){
       return parseFloat((number).toFixed(10));
    }
 
function addUp(){
  var number1 = +$("#number1").val();
  var number2 = +$("#number2").val();
  var unexpectedResult = number1 + number2;
  var expectedResult = floatify(number1 + number2);
  $("#unexpectedResult").text(unexpectedResult);
  $("#expectedResult").text(expectedResult);
}
addUp();

input{
  width: 50px;
}
#expectedResult{
color: green;
}
#unexpectedResult{
color: red;
}

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input id="number1" value="0.2" onclick="addUp()" onkeyup="addUp()"/> +
<input id="number2" value="0.7" onclick="addUp()" onkeyup="addUp()"/> =
<p>Expected Result: <span id="expectedResult"></span></p>
<p>Unexpected Result: <span id="unexpectedResult"></span></p>

您可以这样使用它：

var x = 0.2 + 0.7;
floatify(x);  => Result: 0.9

AS w3schools 乘以并分开以解决上述问题：

var x = (0.2 * 10 + 0.1 * 10) / 10;       // x will be 0.3

请记住，（0.2 + 0.1） * 10/10 < / code>根本无法工作，尽管看起来相同！我更喜欢第一个解决方案，因为我可以将其应用于将输入浮子转换为准确输出浮点的函数。

fyi ，存在相同的问题：

乘法：实例 0.09 * 10 返回 0.89999999999999999999 。将浮动函数应用于解决方法： floatify（0.09 * 10）返回 0.9

：0.3/0.1 = 2.9999999999999999999966，但浮动（0.3-0.1-0.3-0.1 ）回报0.2

减法：1-0.8 = 0.199999999999999996，但浮动（1-0.8）返回0.2

In short it's because:

Floating point numbers cannot represent all decimals precisely in binary

So just like 10/3 which does not exist in base 10 precisely (it will be 3.33... recurring), in the same way 1/10 doesn't exist in binary.

So what? How to deal with it? Is there any workaround?

In order to offer The best solution I can say I discovered following method:

parseFloat((0.1 + 0.2).toFixed(10)) => Will return 0.3

Let me explain why it's the best solution.
As others mentioned in above answers it's a good idea to use ready to use Javascript toFixed() function to solve the problem. But most likely you'll encounter with some problems.

Imagine you are going to add up two float numbers like 0.2 and 0.7 here it is: 0.2 + 0.7 = 0.8999999999999999.

Your expected result was 0.9 it means you need a result with 1 digit precision in this case.
So you should have used (0.2 + 0.7).tofixed(1)
but you can't just give a certain parameter to toFixed() since it depends on the given number, for instance

0.22 + 0.7 = 0.9199999999999999

In this example you need 2 digits precision so it should be toFixed(2), so what should be the paramter to fit every given float number?

You might say let it be 10 in every situation then:

(0.2 + 0.7).toFixed(10) => Result will be "0.9000000000"

Damn! What are you going to do with those unwanted zeros after 9?
It's the time to convert it to float to make it as you desire:

parseFloat((0.2 + 0.7).toFixed(10)) => Result will be 0.9

Now that you found the solution, it's better to offer it as a function like this:

function floatify(number){
           return parseFloat((number).toFixed(10));
        }

Let's try it yourself:

function floatify(number){
       return parseFloat((number).toFixed(10));
    }
 
function addUp(){
  var number1 = +$("#number1").val();
  var number2 = +$("#number2").val();
  var unexpectedResult = number1 + number2;
  var expectedResult = floatify(number1 + number2);
  $("#unexpectedResult").text(unexpectedResult);
  $("#expectedResult").text(expectedResult);
}
addUp();

input{
  width: 50px;
}
#expectedResult{
color: green;
}
#unexpectedResult{
color: red;
}

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<input id="number1" value="0.2" onclick="addUp()" onkeyup="addUp()"/> +
<input id="number2" value="0.7" onclick="addUp()" onkeyup="addUp()"/> =
<p>Expected Result: <span id="expectedResult"></span></p>
<p>Unexpected Result: <span id="unexpectedResult"></span></p>

You can use it this way:

var x = 0.2 + 0.7;
floatify(x);  => Result: 0.9

As W3SCHOOLS suggests there is another solution too, you can multiply and divide to solve the problem above:

var x = (0.2 * 10 + 0.1 * 10) / 10;       // x will be 0.3

Keep in mind that (0.2 + 0.1) * 10 / 10 won't work at all although it seems the same!
I prefer the first solution since I can apply it as a function which converts the input float to accurate output float.

FYI, the same problem exists for:

Multiplication: for instance 0.09 * 10 returns 0.8999999999999999. Apply the floatify function as a workaround: floatify(0.09 * 10) returns 0.9

Division: 0.3 / 0.1 = 2.9999999999999996 but floatify(0.3 - 0.1) returns 0.2

Subtract: 1 - 0.8 = 0.19999999999999996 but floatify(1 - 0.8) returns 0.2

回复收藏 0 原文

英雄似剑 2025-02-11 12:11:08

浮点舍入错误。来自每个计算机科学家应该对浮动点的了解>：：

将无限的许多实数挤压成有限数量的位，需要大致表示。尽管有很多整数，但在大多数程序中，整数计算的结果可以存储在32位。相比之下，在任何固定数量的位，大多数具有实数的计算都会产生无法使用多个位来准确表示的数量。因此，浮点计算的结果通常必须舍入，以便重新安装到其有限的表示中。此舍入错误是浮点计算的特征。

回复收藏 0 原文

贵在坚持 2025-02-11 12:11:08

我的解决方法：

function add(a, b, precision) {
    var x = Math.pow(10, precision || 2);
    return (Math.round(a * x) + Math.round(b * x)) / x;
}

精度是指添加期间要保留的数字数量。

My workaround:

function add(a, b, precision) {
    var x = Math.pow(10, precision || 2);
    return (Math.round(a * x) + Math.round(b * x)) / x;
}

precision refers to the number of digits you want to preserve after the decimal point during addition.

回复收藏 0 原文

埖埖迣鎅 2025-02-11 12:11:08

不，没有打破，但是大多数小数部分必须近似

摘要

浮点算术算术算术是确切的，不幸的是，它与我们通常的base-10数字表示不太匹配，因此，事实证明，我们经常给出它的输入，这些输入与我们所写的内容相比有些偏离。

即使是简单的数字，例如0.01、0.02、0.03、0.04 ... 0.24也不能像二进制分数一样表示。如果您计数0.01，.02，.03 ...，直到到达0.25之前，您将获得第一个分数，可以在base 2 中表示。如果您尝试使用FP，则您的0.01将略有关闭，因此，将其中25个添加到精确的0.25的唯一方法是需要一系列涉及后卫碎片和圆形的因果关系。很难预测，所以我们举起双手说“ FP是不精确的”，，但这并不是真的。

我们不断地给出FP硬件在基本10中看起来很简单的东西，但是在基本2中重复分数。

这是如何发生的？

当我们以小数为单位编写时，每一个分数（具体来说，每个终止）十进制）是形式的合理数量

＆nbsp;＆nbsp;＆nbsp; nbsp;＆nbsp; nbsp;＆nbsp;＆nbsp;
a/（2 ⁿ x 5 ^m）

在二进制中，我们只得到 2 ⁿ < /em>术语，也就是：

＆nbsp;＆nbsp;＆nbsp;＆nbsp;＆nbsp;＆nbsp;＆nbsp;＆nbsp;＆nbsp; a/2 ⁿ

因此，在十进制中，我们不能表示¹/₃。因为基本10包括2作为主要因素，所以我们可以写入二进制分数的每个数字也可以写为基本10分数。但是，我们几乎没有写任何作为基础₁₀的分数在二进制中表示。在0.01、0.02、0.03 ... 0.99的范围内，只能以我们的FP格式表示三个数字：0.25、0.50和0.75，因为它们是1/4、1/2，和3/4，仅使用2 ⁿ术语的所有数字。

在基本 10 中，我们不能表示¹/ 3 。但是在二进制中，我们不能做¹/ 10 或 ¹/₃。

因此，尽管每个二元分数都可以用小数为小数，但相反是不正确的。实际上，大多数小数分数在二进制中重复。

通常指示开发人员进行＆lt； Epsilon 比较，更好的建议可能是圆形的积分值（在C库中：round（）和roundf（），即以FP格式保持），然后进行比较。四舍五入到特定的小数分数长度可以解决大多数输出问题。

同样，关于实际数字迫切问题（FP是在早期，昂贵的计算机上发明的问题），宇宙的物理常数和所有其他测量值仅是相对较少的重要数字知道的，因此整个问题空间无论如何是“不精确的”。在这种应用中，FP“准确性”不是问题。

当人们试图将FP用于豆类计数时，整个问题确实会产生。它确实为此起作用，但是只有当您坚持积分价值时，哪种使用它的要点。 这就是为什么我们拥有所有那些小数分数软件库。

我喜欢 chris ，因为它描述了实际问题，而不仅仅是关于“不准确性”的通常手动挥舞。如果FP简单地“不准确”，那么我们可以修复，并且几十年前就可以做到这一点。我们没有的原因是因为FP格式紧凑且快速，这是处理大量数字的最佳方法。此外，这是空间年龄和军备竞赛的遗产，也是使用小型内存系统来解决非常缓慢的计算机来解决大问题的遗产。（有时是1位存储的单个磁芯，但这是又故事。）

结论，

如果您只是在银行计算豆子，那么首先使用小数字符串表示形式的软件解决方案非常有效。但是，您不能这样做量子染色体动力学或空气动力学。

No, not broken, but most decimal fractions must be approximated

Summary

Floating point arithmetic is exact, unfortunately, it doesn't match up well with our usual base-10 number representation, so it turns out we are often giving it input that is slightly off from what we wrote.

Even simple numbers like 0.01, 0.02, 0.03, 0.04 ... 0.24 are not representable exactly as binary fractions. If you count up 0.01, .02, .03 ..., not until you get to 0.25 will you get the first fraction representable in base₂. If you tried that using FP, your 0.01 would have been slightly off, so the only way to add 25 of them up to a nice exact 0.25 would have required a long chain of causality involving guard bits and rounding. It's hard to predict so we throw up our hands and say "FP is inexact", but that's not really true.

We constantly give the FP hardware something that seems simple in base 10 but is a repeating fraction in base 2.

How did this happen?

When we write in decimal, every fraction (specifically, every terminating decimal) is a rational number of the form

a / (2ⁿ x 5^m)

In binary, we only get the 2ⁿ term, that is:

a / 2ⁿ

So in decimal, we can't represent ¹/₃. Because base 10 includes 2 as a prime factor, every number we can write as a binary fraction also can be written as a base 10 fraction. However, hardly anything we write as a base₁₀ fraction is representable in binary. In the range from 0.01, 0.02, 0.03 ... 0.99, only three numbers can be represented in our FP format: 0.25, 0.50, and 0.75, because they are 1/4, 1/2, and 3/4, all numbers with a prime factor using only the 2ⁿ term.

In base₁₀ we can't represent ¹/₃. But in binary, we can't do ¹/₁₀ or ¹/₃.

So while every binary fraction can be written in decimal, the reverse is not true. And in fact most decimal fractions repeat in binary.

Dealing with it

Developers are usually instructed to do < epsilon comparisons, better advice might be to round to integral values (in the C library: round() and roundf(), i.e., stay in the FP format) and then compare. Rounding to a specific decimal fraction length solves most problems with output.

Also, on real number-crunching problems (the problems that FP was invented for on early, frightfully expensive computers) the physical constants of the universe and all other measurements are only known to a relatively small number of significant figures, so the entire problem space was "inexact" anyway. FP "accuracy" isn't a problem in this kind of application.

The whole issue really arises when people try to use FP for bean counting. It does work for that, but only if you stick to integral values, which kind of defeats the point of using it. This is why we have all those decimal fraction software libraries.

I love the Pizza answer by Chris, because it describes the actual problem, not just the usual handwaving about "inaccuracy". If FP were simply "inaccurate", we could fix that and would have done it decades ago. The reason we haven't is because the FP format is compact and fast and it's the best way to crunch a lot of numbers. Also, it's a legacy from the space age and arms race and early attempts to solve big problems with very slow computers using small memory systems. (Sometimes, individual magnetic cores for 1-bit storage, but that's another story.)

Conclusion

If you are just counting beans at a bank, software solutions that use decimal string representations in the first place work perfectly well. But you can't do quantum chromodynamics or aerodynamics that way.

回复收藏 0 原文

一个人的旅程 2025-02-11 12:11:08

并非所有数字都可以通过 floats /双打表示。
例如，数字“ 0.2”将以单个精度表示为“ 0.200000003” >浮点标准。

在引擎盖下存储实数的模型表示浮点数为

即使您可以轻松键入 0.2 ， flt_radix 和 dbl_radix dbl_radix < /代码>是2;对于具有

因此，准确地表示这样的数字有点困难。即使您在没有任何中间计算的情况下明确指定了此变量。

回复收藏 0 原文

相思故 2025-02-11 12:11:08

一些与这个著名的双重精度问题有关的统计数据。

当使用0.1的步骤（从0.1到100）添加所有值（ a + b ）时，我们有 〜15％的精度误差的机会。请注意，错误可能导致稍大或更小的值。
以下是一些示例：

0.1 + 0.2 = 0.30000000000000004 (BIGGER)
0.1 + 0.7 = 0.7999999999999999 (SMALLER)
...
1.7 + 1.9 = 3.5999999999999996 (SMALLER)
1.7 + 2.2 = 3.9000000000000004 (BIGGER)
...
3.2 + 3.6 = 6.800000000000001 (BIGGER)
3.2 + 4.4 = 7.6000000000000005 (BIGGER)

当减去所有值时（ a -b 其中 a＆gt; b ）使用0.1的步骤（从100到0.1），我们有 〜〜 34％的精度错误的机会。
这里有一些例子：

0.6 - 0.2 = 0.39999999999999997 (SMALLER)
0.5 - 0.4 = 0.09999999999999998 (SMALLER)
...
2.1 - 0.2 = 1.9000000000000001 (BIGGER)
2.0 - 1.9 = 0.10000000000000009 (BIGGER)
...
100 - 99.9 = 0.09999999999999432 (SMALLER)
100 - 99.8 = 0.20000000000000284 (BIGGER)

*15％和34％的确实很大，因此，当精度很重要时，请始终使用BigDecimal。使用2个小数位数（步骤0.01），情况会加剧一些（18％和36％）。

Some statistics related to this famous double precision question.

When adding all values (a + b) using a step of 0.1 (from 0.1 to 100) we have ~15% chance of precision error. Note that the error could result in slightly bigger or smaller values.
Here are some examples:

0.1 + 0.2 = 0.30000000000000004 (BIGGER)
0.1 + 0.7 = 0.7999999999999999 (SMALLER)
...
1.7 + 1.9 = 3.5999999999999996 (SMALLER)
1.7 + 2.2 = 3.9000000000000004 (BIGGER)
...
3.2 + 3.6 = 6.800000000000001 (BIGGER)
3.2 + 4.4 = 7.6000000000000005 (BIGGER)

When subtracting all values (a - b where a > b) using a step of 0.1 (from 100 to 0.1) we have ~34% chance of precision error.
Here are some examples:

0.6 - 0.2 = 0.39999999999999997 (SMALLER)
0.5 - 0.4 = 0.09999999999999998 (SMALLER)
...
2.1 - 0.2 = 1.9000000000000001 (BIGGER)
2.0 - 1.9 = 0.10000000000000009 (BIGGER)
...
100 - 99.9 = 0.09999999999999432 (SMALLER)
100 - 99.8 = 0.20000000000000284 (BIGGER)

*15% and 34% are indeed huge, so always use BigDecimal when precision is of big importance. With 2 decimal digits (step 0.01) the situation worsens a bit more (18% and 36%).

回复收藏 0 原文

说好的呢 2025-02-11 12:11:08

Python和Java等一些高级语言具有克服二进制浮点限制的工具。例如：

python's DECIMAL DECIMAL Module 和java的 /a>，以十进制表示法表示内部数字（与二进制符号相反）。两者的精度都有限，因此它们仍然容易出错，但是它们解决了二进制浮点算术的最常见问题。
小数在处理金钱时非常好：十美分加二十美分总是三十美分：
```
 ＆gt;＆gt;＆gt; 0.1 + 0.2 == 0.3
  错误的
  ＆gt;＆gt;＆gt;十进制（'0.1'） +十进制（'0.2'）==十进制（'0.3'）
  真的
 
```
python的十进制模块基于。
python's fractions module 代码>类。两者都表示有理数为（分子，分母）对，它们可能比小数浮点算术更准确。

这些解决方案都不是完美的（尤其是如果我们看表演，或者如果需要很高的精度），但是它们仍然解决了二进制浮点算术的大量问题。

Some high level languages such as Python and Java come with tools to overcome binary floating point limitations. For example:

Python's decimal module and Java's BigDecimal class, that represent numbers internally with decimal notation (as opposed to binary notation). Both have limited precision, so they are still error prone, however they solve most common problems with binary floating point arithmetic.

Decimals are very nice when dealing with money: ten cents plus twenty cents are always exactly thirty cents:
```
  >>> 0.1 + 0.2 == 0.3
  False
  >>> Decimal('0.1') + Decimal('0.2') == Decimal('0.3')
  True
```
Python's decimal module is based on IEEE standard 854-1987.
Python's fractions module and Apache Common's BigFraction class. Both represent rational numbers as (numerator, denominator) pairs and they may give more accurate results than decimal floating point arithmetic.

Neither of these solutions is perfect (especially if we look at performances, or if we require a very high precision), but still they solve a great number of problems with binary floating point arithmetic.

回复收藏 0 原文

一指流沙 2025-02-11 12:11:08

人们总是认为这是计算机问题，但是如果您用手数（基本10），则无法获得（1/3+1/3 = 2/3）= true ，除非您有无穷大要添加0.333 ...到0.333 ...因此，就像（1/10+2/10）一样0.333 + 0.333 = 0.666，可能会将其圆成0.667，这在技术上也是不准确的。

算在三元中，三分之二并不是问题 - 也许每只手上有15个手指的比赛会问为什么您的小数数学被打破...

回复收藏 0 原文

左耳近心 2025-02-11 12:11:08

出现这些怪异的数字是因为计算机使用二进制（基本2）编号系统来计算目的，而我们使用小数（基本10）。

大多数分数数量不能准确地以二进制或十进制或两者兼而有之表示。结果 - 舍入（但精确的）数字结果。

回复收藏 0 原文

深府石板幽径 2025-02-11 12:11:08

您是否尝试过胶带解决方案？

尝试确定何时发生错误，并使用简短的修复语句。它不是很漂亮，但是对于某些问题，它是唯一的解决方案，这是其中之一。

 if( (n * 0.1) < 100.0 ) { return n * 0.1 - 0.000000000000001 ;}
                    else { return n * 0.1 + 0.000000000000001 ;}

我在C＃的科学模拟项目中也遇到了同样的问题，我可以告诉您，如果您忽略蝴蝶效应，它将转向一条大胖龙，并在A **中咬住您。

Did you try the duct tape solution?

Try to determine when errors occur and fix them with short if statements. It's not pretty, but for some problems it is the only solution and this is one of them.

 if( (n * 0.1) < 100.0 ) { return n * 0.1 - 0.000000000000001 ;}
                    else { return n * 0.1 + 0.000000000000001 ;}

I had the same problem in a scientific simulation project in C#, and I can tell you that if you ignore the butterfly effect, it's going to turn to a big fat dragon and bite you in the a**.

回复收藏 0 原文

￠蛋碎的人ぎ生 2025-02-11 12:11:08

这个问题的许多重复询问了浮点舍入对特定数字的影响。在实践中，通过查看兴趣计算的确切结果而不是仅仅阅读有关它的确切结果，就可以更容易地感觉到它的工作原理。某些语言提供了这样做的方法 - 例如，在Java中将 float 或 double 转换为 bigdecimal 。

由于这是一个语言 - 语言问题，因此需要语言 - 敏捷的工具，例如a 。

Applying it to the numbers in the question, treated as doubles:

0.1 converts to 0.1000000000000000055511151231257827021181583404541015625,

0.2 converts to 0.200000000000000011102230246251565404236316680908203125,

0.3 converts to 0.299999999999999988897769753748434595763683319091796875, and

0.30000000000000004 converts to 0.3000000000000000444089209850062616169452667236328125.

手动或在十进制计算器中添加前两个数字，例如完整的精度计算器显示实际输入的确切总和为0.30000000000000166533453693773481063544750213623046875。

如果将其四舍五入到0.3的等效范围为0.0000000000000000277555755756156156289135105907917022705078125。四舍五入为0.300000000000004等等，还会给出舍入错误0.0000000000000027755555557555557555756156289135105907917022705078125。适用圆形的打破打破器。

返回浮点转换器，用于0.30000000000000004的原始十六进制是3FD333333333333334，它以均匀数字结尾，因此是正确的结果。

回复收藏 0 原文

醉生梦死 2025-02-11 12:11:08

可以在数字计算机中实现的浮点数学类型必须使用对它们上的实际数字和操作的近似。（ Standard 版本运行到五十页的文档，并有一个委员会来处理其勘误和进一步的细化。）

此近似值是不同种类的近似值的混合物，每种都可以忽略或由于其特定的偏离精确性方式而仔细考虑。它还涉及大多数人在假装不注意的同时走过的硬件和软件级别的许多明确的例外情况。

如果您需要无限的精度（例如，使用数字π，而不是其众多较短的备用之一），则应代替编写或使用符号数学程序。

但是，如果您对有时浮点数数学的价值和逻辑和错误可能会迅速累积的想法可以，并且您可以编写您的要求和测试以允许这一点，那么您的代码经常可以通过中的内容来您的FPU。

回复收藏 0 原文

能怎样 2025-02-11 12:11:08

只是为了娱乐，我按照标准C99的定义进行了浮标的代表，并写了下面的代码。

该代码将浮子的二进制表示形式打印为3个分离的组

SIGN EXPONENT FRACTION

，然后将其打印出一个总和，当以足够精确的方式求和时，它将显示硬件中真正存在的值。

因此，当您编写 float x = 999 ... 时，编译器将以函数 xx 打印的位表示该数字，以便函数打印的总和<代码> yy 等于给定的数字。

实际上，此总和只是一个近似值。对于999,999,999的编号，编译器将以浮点数为1,000,000,000的位代表插入。

代码后，我附加了一个控制台会话，其中我在其中计算了两个常数（负PI和99999999）的术语总和，该术语实际上存在于硬件中，并由编译器插入。

#include <stdio.h>
#include <limits.h>

void
xx(float *x)
{
    unsigned char i = sizeof(*x)*CHAR_BIT-1;
    do {
        switch (i) {
        case 31:
             printf("sign: ");
             break;
        case 30:
             printf("exponent: ");
             break;
        case 23:
             printf("fraction: ");
             break;

        }
        char b = (*(unsigned long long*)x&((unsigned long long)1<<i)) != 0;
        printf("%d ", b);
    } while (i--);
    printf("\n");
}

void
yy(float a)
{
    int sign = !(*(unsigned long long*)&a&((unsigned long long)1<<31));
    int fraction = ((1<<23)-1)&(*(int*)&a);
    int exponent = (255&((*(int*)&a)>>23))-127;

    printf(sign ? "positive" " ( 1+" : "negative" " ( 1+");
    unsigned int i = 1 << 22;
    unsigned int j = 1;
    do {
        char b = (fraction&i) != 0;
        b && (printf("1/(%d) %c", 1 << j, (fraction&(i-1)) ? '+' : ')' ), 0);
    } while (j++, i >>= 1);

    printf("*2^%d", exponent);
    printf("\n");
}

void
main()
{
    float x = -3.14;
    float y = 999999999;
    printf("%lu\n", sizeof(x));
    xx(&x);
    xx(&y);
    yy(x);
    yy(y);
}

这是一个控制台会话，我在其中计算硬件中存在的浮子的真实值。我使用 bc 打印主程序输出的术语总和。一个人可以在python repl 或类似的内容中插入该总和。

-- .../terra1/stub
@ qemacs f.c
-- .../terra1/stub
@ gcc f.c
-- .../terra1/stub
@ ./a.out
sign: 1 exponent: 1 0 0 0 0 0 0 fraction: 0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1
sign: 0 exponent: 1 0 0 1 1 1 0 fraction: 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0
negative ( 1+1/(2) +1/(16) +1/(256) +1/(512) +1/(1024) +1/(2048) +1/(8192) +1/(32768) +1/(65536) +1/(131072) +1/(4194304) +1/(8388608) )*2^1
positive ( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
-- .../terra1/stub
@ bc
scale=15
( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
999999999.999999446351872

就是这样。实际上，

999999999.999999446351872

您还可以使用 bc 检查-3.14也会扰动99999999的值。不要忘记在 bc 中设置比例因子。

显示的总和是硬件内部。通过计算获得的值取决于您设置的比例。我确实将 scale 因子设置为15。从数学上讲，以无限的精度，似乎是1,000,000,000。

Just for fun, I played with the representation of floats, following the definitions from the Standard C99 and I wrote the code below.

The code prints the binary representation of floats in 3 separated groups

SIGN EXPONENT FRACTION

and after that it prints a sum, that, when summed with enough precision, it will show the value that really exists in hardware.

So when you write float x = 999..., the compiler will transform that number in a bit representation printed by the function xx such that the sum printed by the function yy be equal to the given number.

In reality, this sum is only an approximation. For the number 999,999,999, the compiler will insert in bit representation of the float the number 1,000,000,000.

After the code I attach a console session, in which I compute the sum of terms for both constants (minus PI and 999999999) that really exists in hardware, inserted there by the compiler.

#include <stdio.h>
#include <limits.h>

void
xx(float *x)
{
    unsigned char i = sizeof(*x)*CHAR_BIT-1;
    do {
        switch (i) {
        case 31:
             printf("sign: ");
             break;
        case 30:
             printf("exponent: ");
             break;
        case 23:
             printf("fraction: ");
             break;

        }
        char b = (*(unsigned long long*)x&((unsigned long long)1<<i)) != 0;
        printf("%d ", b);
    } while (i--);
    printf("\n");
}

void
yy(float a)
{
    int sign = !(*(unsigned long long*)&a&((unsigned long long)1<<31));
    int fraction = ((1<<23)-1)&(*(int*)&a);
    int exponent = (255&((*(int*)&a)>>23))-127;

    printf(sign ? "positive" " ( 1+" : "negative" " ( 1+");
    unsigned int i = 1 << 22;
    unsigned int j = 1;
    do {
        char b = (fraction&i) != 0;
        b && (printf("1/(%d) %c", 1 << j, (fraction&(i-1)) ? '+' : ')' ), 0);
    } while (j++, i >>= 1);

    printf("*2^%d", exponent);
    printf("\n");
}

void
main()
{
    float x = -3.14;
    float y = 999999999;
    printf("%lu\n", sizeof(x));
    xx(&x);
    xx(&y);
    yy(x);
    yy(y);
}

Here is a console session in which I compute the real value of the float that exists in hardware. I used bc to print the sum of terms outputted by the main program. One can insert that sum in python repl or something similar also.

-- .../terra1/stub
@ qemacs f.c
-- .../terra1/stub
@ gcc f.c
-- .../terra1/stub
@ ./a.out
sign: 1 exponent: 1 0 0 0 0 0 0 fraction: 0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 1
sign: 0 exponent: 1 0 0 1 1 1 0 fraction: 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0
negative ( 1+1/(2) +1/(16) +1/(256) +1/(512) +1/(1024) +1/(2048) +1/(8192) +1/(32768) +1/(65536) +1/(131072) +1/(4194304) +1/(8388608) )*2^1
positive ( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
-- .../terra1/stub
@ bc
scale=15
( 1+1/(2) +1/(4) +1/(16) +1/(32) +1/(64) +1/(512) +1/(1024) +1/(4096) +1/(16384) +1/(32768) +1/(262144) +1/(1048576) )*2^29
999999999.999999446351872

That's it. The value of 999999999 is in fact

999999999.999999446351872

You can also check with bc that -3.14 is also perturbed. Do not forget to set a scale factor in bc.

The displayed sum is what inside the hardware. The value you obtain by computing it depends on the scale you set. I did set the scale factor to 15. Mathematically, with infinite precision, it seems it is 1,000,000,000.

回复收藏 0 原文

三月梨花 2025-02-11 12:11:08

浮点数在硬件级别表示为二进制编号的分数（基数2）。例如，小数分数：

0.125

具有1/10 + 2/100 + 5/1000的值，并且以相同的方式为二进制分数：

0.001

具有0/2 + 0/2 + 0/4 + 1/8的值。这两个分数具有相同的值，唯一的区别是第一个是小数分数，第二个是二进制部分。

不幸的是，大多数小数分数在二元分数中无法完全表示。因此，通常，您给出的浮点数仅近似于要存储在计算机中的二进制组分。

该问题在基本10中更容易接近。例如，分数为1/3。您可以将其近似为十进制的小数：

0.3

或更好

0.33

或更好的

0.333

等等。无论您写了多少个小数点位置，结果永远都不是1/3，但这是估计总是更接近的。

同样，无论您使用多少个小数小数位数，小数点值0.1都不能完全表示为二进制分数。在基本2中，1/10是以下定期编号：

0.0001100110011001100110011001100110011001100110011 ...

以任何有限数量的位停止，您将获得近似值。

对于Python，在典型的机器上，将53位用于浮子的精度，因此进入十进制0.1时存储的值是二进制分数。

0.00011001100110011001100110011001100110011001100110011010

这是接近但不完全相等的，达到1/10。

很容易忘记的是，由于浮点在解释器中显示的方式，存储的值是原始小数部分的近似值。 Python仅显示二进制中存储的值的小数近似。如果Python要输出存储为0.1的二进制近似值的真正小数值，则将输出：

>>> 0.1
0.1000000000000000055511151231257827021181583404541015625

这比大多数人期望的十进制位置要多得多，因此Python显示一个圆形值以提高可读性：

>>> 0.1
0.1

重要的是要了解，了解这一点很重要。实际上，这是一种幻想：存储的值并非完全是1/10，仅在显示屏上，存储的值是舍入的。当您执行以这些值执行算术操作时，这将变得很明显：

>>> 0.1 + 0.2
0.30000000000000004

此行为是机器浮点数表示的本质所固有的：这不是Python中的错误，也不是代码中的错误。您可以在所有其他语言中观察到使用硬件支持来计算浮点数的相同类型的行为（尽管某些语言默认情况下没有可见的差异，或者在所有显示模式中都不可见）。

另一个惊喜是这一固有的。例如，如果您尝试将2.675的值折叠到两个小数点位置，则您将获得

>>> round (2.675, 2)
2.67

（）原始（）原始的文档，表示它圆形为距离零的最接近值。由于小数分数正好在2.67到2.68之间，因此您应该期望获得（二进制近似值）2.68。但是，事实并非如此，因为当小数分数2.675转换为浮子时，它是由近似值存储的，其确切值为：

2.67499999999999982236431605997495353221893310546875

由于近似值略接近2.67，而不是2.68，因此舍入的圆形为下降。

如果您处于中途倒数小数的情况下，则应使用十进制模块。顺便说一句，十进制模块还提供了一种方便的方法来“查看”任何浮子存储的确切值。

>>> from decimal import Decimal
>>> Decimal (2.675)
>>> Decimal ('2.67499999999999982236431605997495353221893310546875')

1/10中0.1并未完全存储的事实的另一个结果是，10个值的总和0.1的总和也不给出1.0：

>>> sum = 0.0
>>> for i in range (10):
... sum + = 0.1
...>>> sum
0.9999999999999999

二进制浮点数的算术算术算法带来了许多这样的惊喜。在下面的“表示错误”部分中详细说明了“ 0.1”的问题。有关此类惊喜的更完整列表，请参见浮点的危险。

的确，没有简单的答案，但是不要过分怀疑浮动虚拟数字！在Python中，浮点数操作中的错误是由于基础硬件造成的，并且在大多数机器上，每个操作中的错误不超过2 ** 53中的1个。对于大多数任务而言，这是不需要的，但是您应该记住，这些不是十进制操作，并且浮点数号上的每个操作都可能遭受新的错误。

尽管存在病理案例，但对于大多数常见的用例，您将通过简单地将所需的小数位数舍入末端来获得预期的结果。有关如何显示浮子的精细控制，请参阅str.format（）方法的格式规格的字符串格式语法。

答案的这一部分详细说明了“ 0.1”的示例，并显示了如何自己对此案例进行精确分析。我们假设您熟悉浮点数的二进制表示。项表示误差意味着大多数小数分数不能完全用二进制表示。这是为什么Python（或Perl，C，C，C ++，Java，Fortran等）通常不会在十进制中显示确切的结果：

>>> 0.1 + 0.2
0.30000000000000004

为什么？ 1/10和2/10在二元分数中不可完全表示。但是，今天（2010年7月）的所有机器都遵循IEEE-754标准浮点数的算法。而且大多数平台都使用“ IEEE-754双精度”来代表Python浮标。双精度IEEE-754使用53位精确度，因此，在阅读计算机时，使用J / 2 ** n的最接近的分数将计算机转换为j / 2 ** n的最接近，而整数正好为53位。重写：

1/10 ~ = J / (2 ** N)

在：

J ~ = 2 ** N / 10

记住j正好是53位（so＆gt; = 2 ** 52但是＆lt; 2 ** 53），n的最佳值为56：

>>> 2 ** 52
4503599627370496
>>> 2 ** 53
9007199254740992
>>> 2 ** 56/10
7205759403792793

所以56是n的唯一可能值，它恰好是53 j的位。因此，J的最佳价值是此商，舍入的：

>>> q, r = divmod (2 ** 56, 10)
>>> r
6

由于进位大于10的一半，因此可以通过舍入来获得最佳近似值：

>>> q + 1
7205759403792794

因此，“ IEEE-754”中1/10的最佳近似值是最好的近似值。双重精度“这是2 ** 56，也就是说：

7205759403792794/72057594037927936

请注意，由于舍入了向上完成，因此结果实际上略大于1/10；如果我们没有四舍五入，商人将少于1/10。但是在任何情况下都不是1/10！

因此，计算机永远不会“看到” 1/10：它看到的是上面给出的确切分数，使用“ IEEE-754”的双精度浮点数使用的最佳近似值：

>>>. 1 * 2 ** 56
7205759403792794.0

如果我们将此分数乘以10 ** 30 ，我们可以观察其30个小数点的小数点，

>>> 7205759403792794 * 10 ** 30 // 2 ** 56
100000000000000005551115123125L

这意味着计算机中存储的确切值大约等于十进制值0.100000000000000005555115123125。十进制位置，显示“ 0.10000000000000001”，在当前版本的python中，显示的值是其分数尽可能短，同时在转换回二进制时提供完全相同的表示形式，只需显示“ 0.1”。

Floating point numbers are represented, at the hardware level, as fractions of binary numbers (base 2). For example, the decimal fraction:

0.125

has the value 1/10 + 2/100 + 5/1000 and, in the same way, the binary fraction:

0.001

has the value 0/2 + 0/4 + 1/8. These two fractions have the same value, the only difference is that the first is a decimal fraction, the second is a binary fraction.

Unfortunately, most decimal fractions cannot have exact representation in binary fractions. Therefore, in general, the floating point numbers you give are only approximated to binary fractions to be stored in the machine.

The problem is easier to approach in base 10. Take for example, the fraction 1/3. You can approximate it to a decimal fraction:

0.3

or better,

0.33

or better,

0.333

etc. No matter how many decimal places you write, the result is never exactly 1/3, but it is an estimate that always comes closer.

Likewise, no matter how many base 2 decimal places you use, the decimal value 0.1 cannot be represented exactly as a binary fraction. In base 2, 1/10 is the following periodic number:

0.0001100110011001100110011001100110011001100110011 ...

Stop at any finite amount of bits, and you'll get an approximation.

For Python, on a typical machine, 53 bits are used for the precision of a float, so the value stored when you enter the decimal 0.1 is the binary fraction.

0.00011001100110011001100110011001100110011001100110011010

which is close, but not exactly equal, to 1/10.

It's easy to forget that the stored value is an approximation of the original decimal fraction, due to the way floats are displayed in the interpreter. Python only displays a decimal approximation of the value stored in binary. If Python were to output the true decimal value of the binary approximation stored for 0.1, it would output:

>>> 0.1
0.1000000000000000055511151231257827021181583404541015625

This is a lot more decimal places than most people would expect, so Python displays a rounded value to improve readability:

>>> 0.1
0.1

It is important to understand that in reality this is an illusion: the stored value is not exactly 1/10, it is simply on the display that the stored value is rounded. This becomes evident as soon as you perform arithmetic operations with these values:

>>> 0.1 + 0.2
0.30000000000000004

This behavior is inherent to the very nature of the machine's floating-point representation: it is not a bug in Python, nor is it a bug in your code. You can observe the same type of behavior in all other languages that use hardware support for calculating floating point numbers (although some languages do not make the difference visible by default, or not in all display modes).

Another surprise is inherent in this one. For example, if you try to round the value 2.675 to two decimal places, you will get

>>> round (2.675, 2)
2.67

The documentation for the round() primitive indicates that it rounds to the nearest value away from zero. Since the decimal fraction is exactly halfway between 2.67 and 2.68, you should expect to get (a binary approximation of) 2.68. This is not the case, however, because when the decimal fraction 2.675 is converted to a float, it is stored by an approximation whose exact value is :

2.67499999999999982236431605997495353221893310546875

Since the approximation is slightly closer to 2.67 than 2.68, the rounding is down.

If you are in a situation where rounding decimal numbers halfway down matters, you should use the decimal module. By the way, the decimal module also provides a convenient way to "see" the exact value stored for any float.

>>> from decimal import Decimal
>>> Decimal (2.675)
>>> Decimal ('2.67499999999999982236431605997495353221893310546875')

Another consequence of the fact that 0.1 is not exactly stored in 1/10 is that the sum of ten values of 0.1 does not give 1.0 either:

>>> sum = 0.0
>>> for i in range (10):
... sum + = 0.1
...>>> sum
0.9999999999999999

The arithmetic of binary floating point numbers holds many such surprises. The problem with "0.1" is explained in detail below, in the section "Representation errors". See The Perils of Floating Point for a more complete list of such surprises.

It is true that there is no simple answer, however do not be overly suspicious of floating virtual numbers! Errors, in Python, in floating-point number operations are due to the underlying hardware, and on most machines are no more than 1 in 2 ** 53 per operation. This is more than necessary for most tasks, but you should keep in mind that these are not decimal operations, and every operation on floating point numbers may suffer from a new error.

Although pathological cases exist, for most common use cases you will get the expected result at the end by simply rounding up to the number of decimal places you want on the display. For fine control over how floats are displayed, see String Formatting Syntax for the formatting specifications of the str.format () method.

This part of the answer explains in detail the example of "0.1" and shows how you can perform an exact analysis of this type of case on your own. We assume that you are familiar with the binary representation of floating point numbers.The term Representation error means that most decimal fractions cannot be represented exactly in binary. This is the main reason why Python (or Perl, C, C ++, Java, Fortran, and many others) usually doesn't display the exact result in decimal:

>>> 0.1 + 0.2
0.30000000000000004

Why? 1/10 and 2/10 are not representable exactly in binary fractions. However, all machines today (July 2010) follow the IEEE-754 standard for the arithmetic of floating point numbers. and most platforms use an "IEEE-754 double precision" to represent Python floats. Double precision IEEE-754 uses 53 bits of precision, so on reading the computer tries to convert 0.1 to the nearest fraction of the form J / 2 ** N with J an integer of exactly 53 bits. Rewrite:

1/10 ~ = J / (2 ** N)

in:

J ~ = 2 ** N / 10

remembering that J is exactly 53 bits (so> = 2 ** 52 but <2 ** 53), the best possible value for N is 56:

>>> 2 ** 52
4503599627370496
>>> 2 ** 53
9007199254740992
>>> 2 ** 56/10
7205759403792793

So 56 is the only possible value for N which leaves exactly 53 bits for J. The best possible value for J is therefore this quotient, rounded:

>>> q, r = divmod (2 ** 56, 10)
>>> r
6

Since the carry is greater than half of 10, the best approximation is obtained by rounding up:

>>> q + 1
7205759403792794

Therefore the best possible approximation for 1/10 in "IEEE-754 double precision" is this above 2 ** 56, that is:

7205759403792794/72057594037927936

Note that since the rounding was done upward, the result is actually slightly greater than 1/10; if we hadn't rounded up, the quotient would have been slightly less than 1/10. But in no case is it exactly 1/10!

So the computer never "sees" 1/10: what it sees is the exact fraction given above, the best approximation using the double precision floating point numbers from the "" IEEE-754 ":

>>>. 1 * 2 ** 56
7205759403792794.0

If we multiply this fraction by 10 ** 30, we can observe the values of its 30 decimal places of strong weight.

>>> 7205759403792794 * 10 ** 30 // 2 ** 56
100000000000000005551115123125L

meaning that the exact value stored in the computer is approximately equal to the decimal value 0.100000000000000005551115123125. In versions prior to Python 2.7 and Python 3.1, Python rounded these values to 17 significant decimal places, displaying “0.10000000000000001”. In current versions of Python, the displayed value is the value whose fraction is as short as possible while giving exactly the same representation when converted back to binary, simply displaying “0.1”.

回复收藏 0 原文

无风消散 2025-02-11 12:11:08

带有浮点数的陷阱是它们看起来像十进制，但在二进制中起作用。

唯一的主要因素是2，而10个的主要因素为2和5。结果是，可以完全写成二进制分数的每个数字也可以完全写成小数分数，但只有一个子集可以写入小数分数的数字可以写成二进制分数。

浮点数本质上是二进制分数，其数量有限。如果您超越了这些重要数字，那么结果将被舍入舍入。

当您在代码中键入文字或调用该函数以将浮点数分解为字符串时，它会期望十进制数字，并且在变量中将该小数号的二进制近似值存储。

当您打印浮点号或调用函数以将一个函数转换为字符串时，它会打印浮点数号的小数近似值。它是可以准确地将二进制号转换为十进制的，但是我默认不知道在转换为字符串*时这样做。某些语言使用固定数量的大数字，而另一些语言则使用最短的字符串，该字符串将“往返”回到相同的浮点值。

* python dis 将浮点数转换为“ Decimal.decimal”时，请准确转换。这是我知道获得浮点数的确切小数等效的最简单方法。

回复收藏 0 原文

忆梦 2025-02-11 12:11:08

以来“ https://docs.python.org/3/library/math.html#math.isclose” rel =“ nofollow noreferrer”> math.isclose（）平等：

>>> import math
>>> math.isclose(0.1 + 0.2, 0.3)
True
>>> 0.1 + 0.2 == 0.3
False

Since Python 3.5, you have been able to use the math.isclose() function for testing approximate equality:

>>> import math
>>> math.isclose(0.1 + 0.2, 0.3)
True
>>> 0.1 + 0.2 == 0.3
False

回复收藏 0 原文

我纯我任性 2025-02-11 12:11:08

查看此问题的另一种方法：使用的是64位表示数字。因此，不能超过2 ** 64 = 18,446,744,073,709,551,616可以精确表示不同的数字。

但是，数学表示，在0到1之间已经有很多小数。IEE754定义了一个编码，可以有效地使用这些64位，以适用于更大的数字空间加上NAN和+/- Infinity，因此精确表示的数字之间存在差距。数字仅近似。

不幸的是，0.3位于差距中。

回复收藏 0 原文

猫瑾少女 2025-02-11 12:11:08

想象一下，在基本十足的基础上工作，例如8位准确性。您检查是否

1/3 + 2 / 3 == 1

并了解此返回 false 。为什么？好吧，作为实数，我们有

1/3 = 0.333 .... 和 2/3 = 0.666 ....

在八个小数位置上截断，我们得到

0.33333333 + 0.66666666 = 0.99999999

的是，当然，与 1.00000000 不同， 0.00000001 。

具有固定数量的二进制数字的情况完全相似。作为实际数字，我们有

1/10 = 0.0001100110011001100 ...（基数2）

和

1/5 = 0.00110011001100110011001 ...（基本2）

如果我们将其截断为，例如，七个位，

0.0001100 + 0.0011001 = 0.0100101

另一方面，

我们会得到 3/10 = 0.01001100110011 ...（基本2）

，截断为七个位，是 0.0100110 ，这些与 0.0000001 完全不同。

确切的情况稍微更细微，因为这些数字通常以科学符号存储。因此，例如，我们可以将其存储为 0.0001100 我们可以将其存储为 1.10011 * 2^-4 ，具体取决于我们分配了多少位对于指数和曼蒂萨。这会影响您获得的计算精确度数字。

结果是，由于这些舍入错误，因此本质上不想在浮点数上使用==。相反，您可以检查其差异的绝对值是否小于一些固定的小数。

Imagine working in base ten with, say, 8 digits of accuracy. You check whether

1/3 + 2 / 3 == 1

and learn that this returns false. Why? Well, as real numbers we have

1/3 = 0.333.... and 2/3 = 0.666....

Truncating at eight decimal places, we get

0.33333333 + 0.66666666 = 0.99999999

which is, of course, different from 1.00000000 by exactly 0.00000001.

The situation for binary numbers with a fixed number of bits is exactly analogous. As real numbers, we have

1/10 = 0.0001100110011001100... (base 2)

and

1/5 = 0.0011001100110011001... (base 2)

If we truncated these to, say, seven bits, then we'd get

0.0001100 + 0.0011001 = 0.0100101

while on the other hand,

3/10 = 0.01001100110011... (base 2)

which, truncated to seven bits, is 0.0100110, and these differ by exactly 0.0000001.

The exact situation is slightly more subtle because these numbers are typically stored in scientific notation. So, for instance, instead of storing 1/10 as 0.0001100 we may store it as something like 1.10011 * 2^-4, depending on how many bits we've allocated for the exponent and the mantissa. This affects how many digits of precision you get for your calculations.

The upshot is that because of these rounding errors you essentially never want to use == on floating-point numbers. Instead, you can check if the absolute value of their difference is smaller than some fixed small number.

回复收藏 0 原文

尴尬癌患者 2025-02-11 12:11:08

实际上很简单。当您拥有10个基本10系统（如我们的基本系统）时，它只能表达使用基础主要因素的分数。 10的主要因素为2和5。因此，1/2、1/4、1/5、1/8和1/10的主要因素都可以干净地表达，因为分母都使用了10的主要因素。相反，相反，1 /3、1/6和1/7都是重复的小数仅包含2作为主要因素。在二进制中，1/2、1/4、1/8都将干净地表示为小数。而1/5或1/10将重复小数。因此，0.1和0.2和0.2（1/10和1/5）在基本10系统中的干净小数时，在基本2系统中重复小数计算机运行。当您对这些重复的小数进行数学计算时，您最终会剩下剩菜当您将计算机的基数2（二进制）编号转换为更人性化的基数10号时，它会延续。

来自 https：//0.300000000000000000000004.com/

回复收藏 0 原文

去了角落 2025-02-11 12:11:08

十进制数字，例如 0.1 ， 0.2 和 0.3 ，并未完全用二进制编码的浮点类型表示。 0.1 和 0.2 的近似值与用于 0.3 的近似值不同，因此 0.1 + 0.2 == 0.3 在这里可以更清楚地看到：

#include <stdio.h>

int main() {
    printf("0.1 + 0.2 == 0.3 is %s\n", 0.1 + 0.2 == 0.3 ? "true" : "false");
    printf("0.1 is %.23f\n", 0.1);
    printf("0.2 is %.23f\n", 0.2);
    printf("0.1 + 0.2 is %.23f\n", 0.1 + 0.2);
    printf("0.3 is %.23f\n", 0.3);
    printf("0.3 - (0.1 + 0.2) is %g\n", 0.3 - (0.1 + 0.2));
    return 0;
}

输出：

0.1 + 0.2 == 0.3 is false
0.1 is 0.10000000000000000555112
0.2 is 0.20000000000000001110223
0.1 + 0.2 is 0.30000000000000004440892
0.3 is 0.29999999999999998889777
0.3 - (0.1 + 0.2) is -5.55112e-17

为了更可靠地评估这些计算，您需要使用基于小数的表示浮点值。 C标准不默认指定此类类型，而是按A 技术报告。

_decimal32 ， _decimal64 和 _decimal128 类型可以在您的系统上使用（例如，选定的目标，但是 clang clang 不支持它们on os＆nbsp; x ）。

Decimal numbers such as 0.1, 0.2, and 0.3 are not represented exactly in binary encoded floating point types. The sum of the approximations for 0.1 and 0.2 differs from the approximation used for 0.3, hence the falsehood of 0.1 + 0.2 == 0.3 as can be seen more clearly here:

#include <stdio.h>

int main() {
    printf("0.1 + 0.2 == 0.3 is %s\n", 0.1 + 0.2 == 0.3 ? "true" : "false");
    printf("0.1 is %.23f\n", 0.1);
    printf("0.2 is %.23f\n", 0.2);
    printf("0.1 + 0.2 is %.23f\n", 0.1 + 0.2);
    printf("0.3 is %.23f\n", 0.3);
    printf("0.3 - (0.1 + 0.2) is %g\n", 0.3 - (0.1 + 0.2));
    return 0;
}

Output:

0.1 + 0.2 == 0.3 is false
0.1 is 0.10000000000000000555112
0.2 is 0.20000000000000001110223
0.1 + 0.2 is 0.30000000000000004440892
0.3 is 0.29999999999999998889777
0.3 - (0.1 + 0.2) is -5.55112e-17

For these computations to be evaluated more reliably, you would need to use a decimal-based representation for floating point values. The C Standard does not specify such types by default but as an extension described in a technical Report.

The _Decimal32, _Decimal64 and _Decimal128 types might be available on your system (for example, GCC supports them on selected targets, but Clang does not support them on OS X).

回复收藏 0 原文

还如梦归 2025-02-11 12:11:08

算术是基本10，因此小数代表十分之一，百分之十。

正常积分存储为整数mantissas和指数。 Mantissa代表着重要的数字。指数就像科学符号一样，但它使用2的碱基而不是10。例如，64.0将以1的mantissa表示，指数为6。0.125将以1的Mantissa为1，指数为-3。

浮点小数必须将负2

0.1b = 0.5d
0.01b = 0.25d
0.001b = 0.125d
0.0001b = 0.0625d
0.00001b = 0.03125d

等负功率添加。

在处理浮点算术时，通常使用错误的delta而不是使用平等运算符。而不是

if(a==b) ...

你会使用

delta = 0.0001; // or some arbitrarily small amount
if(a - b > -delta && a - b < delta) ...

Normal arithmetic is base-10, so decimals represent tenths, hundredths, etc. When you try to represent a floating-point number in binary base-2 arithmetic, you are dealing with halves, fourths, eighths, etc.

In the hardware, floating points are stored as integer mantissas and exponents. Mantissa represents the significant digits. Exponent is like scientific notation but it uses a base of 2 instead of 10. For example 64.0 would be represented with a mantissa of 1 and exponent of 6. 0.125 would be represented with a mantissa of 1 and an exponent of -3.

Floating point decimals have to add up negative powers of 2

0.1b = 0.5d
0.01b = 0.25d
0.001b = 0.125d
0.0001b = 0.0625d
0.00001b = 0.03125d

and so on.

It is common to use a error delta instead of using equality operators when dealing with floating point arithmetic. Instead of

if(a==b) ...

you would use

delta = 0.0001; // or some arbitrarily small amount
if(a - b > -delta && a - b < delta) ...

回复收藏 0 原文

墨洒年华 2025-02-11 12:11:08

有关于修复浮点实现问题的项目。

看看 unum＆amp;例如，potit 例如，它展示了一种称为 potit （及其前任 unum ）的数字类型，该数字有望提供更好的准确性，而更少的位。如果我的理解是正确的，它还可以解决问题中的问题。这是一个非常有趣的项目，背后的人是数学家， dr。约翰·古斯塔夫森（John Gustafson）。

整个过程都是开源的，在C/C ++，Python，Julia和C＃中具有许多实际实现（ https：https：// htttps：// hastlayer。 com/算术）。

回复收藏 0 原文

浮点数学破裂了吗？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（30）

旁注：所有位置（base-n）编号系统与精确的

侧面注意：在实践中使用浮子的浮子

Side Note: All positional (base-N) number systems share this problem with precision

Side Note: Working with Floats in Programming

硬件设计师的视角

1。

2。标准

3。造成舍入误差的原因，

4。其他操作中的四舍五入错误：截断

5。重复操作，

6.总而言之

A Hardware Designer's Perspective

1. Overview

2. Standards

3. Cause of Rounding Error in Division

4. Rounding Errors in Other Operations: Truncation

5. Repeated Operations

6. Summary

不，没有打破，但是大多数小数部分必须近似

摘要

这是如何发生的？

结论，

No, not broken, but most decimal fractions must be approximated

Summary

How did this happen?

Dealing with it

Conclusion

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。