为什么某些浮点计算会以其操作方式转动? (例如123456789f+ 1 = 123456792)
我试图更好地了解浮点算术,出现和累积的错误,以及为什么结果表现出它们的方式。这是我目前正在处理的3个示例:
1。) 0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 -1.1.0 = -110223302330233024625156565 e-16 aka添加 0.1 10 时给我的数字略小于 1.0 。但是, 0.1 表示(双重)略大于 0.1 。另外*0.1*3*略大于 0.3 ,但*0.1*8*略小, 0.8
2。) 123456789f+1 = 123456792和123456789F +4 = 123456800。
这些结果怎么了?对我来说,一切仍然有些神秘。
I'm trying to get a better understanding of floating point arithmetic, the attending errors that occur and accrue, as well as why exactly the results turn out the way they do. Here are 3 Examples in particular I'm currently working on:
1.) 0.1+0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 +0.1 -1.0 = -1.1102230246251565E-16 aka adding 0.1 10 times gives me a number slightly less than 1.0. However, 0.1 is represented (as a double) as slightly larger than 0.1. Also *0.1*3* is slightly larger than 0.3, but *0.1*8* slightly smaller that 0.8
2.) 123456789f+1 = 123456792 and 123456789f +4 = 123456800.
What's up with those results? It's all still a bit mysterious to me.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
典型的现代处理器和编程语言使用IEEE-754算术(或多或少),
float
和64位的二进制二进制浮点数 double> double> double> 。在double
中,使用了53位的意义。这意味着,当将十进制数字转换为double
时,它将转换为某个数字 s • f •2 e ,其中 s 是符号(+1或-1), f 是一个无符号整数,可以在53位中表示, 和 e 是-1074和971之间的整数,包括。 (或者,如果要转换的数字太大,则结果可以是 +无穷大或-infinity。)(那些知道浮点格式的人可能会抱怨指数在-1023和1023之间正确,但我已经移动了 使其成为一个整数。我正在描述数学值,而不是编码。有意义地 3602879701896397 /36028797018963968,因为在所需形式的所有数字中,最接近.1。分母为2 −55 ,因此 e 为-55。
当我们添加其中两个时,我们将获得7205759403792794 /36028797018963968。这很好,分子仍然小于2 53 < / sup>,因此它适合格式。
当我们添加第三个3602879701896397 /36028797018963968时,数学结果是10808639105689191 /36028797018963968。不幸的是,数字太大了;它大于2 53 (900719254740992)。因此,浮点硬件无法返回该号码。它必须使其适合以某种方式。
如果我们将分子和分母除以两个,则有5404319552844595.5 / 18014398509481984。这具有相同的值,但是分子不是整数。为了使其适合,硬件将其舍入整数。当分数恰好是1/2时,规则是为了使结果甚至使结果返回,因此硬件返回5404319552844596 /18014398509481984。
接下来,我们采取当前总和,54043195528444596 / 1801439859859877777777977977979777977977977777. / 36028797018963968。这次,总和是7205759403792794.5 / 18014398509481984。在这种情况下,硬件会落下,返回7205759403792794
/18014398509481984。和3602879701896397 /36028797018963968,总和是9007199254740992.5 / 18014398509481984。请注意,计算器不仅具有较小的分数,而且还大于2 53 < / sup>。因此,我们必须再次减少它,生产4503599627370496.25 / 9007199254740992。将分子舍入整数产生4503599627370496 /900719992547404992
。在这一点上,舍入的错误偶然取消了。添加0.1五倍的收益率为.5。
When we add 4503599627370496 / 9007199254740992 and 3602879701896397 / 36028797018963968, the result is exactly 5404319552844595.25 / 9007199254740992. The hardware rounds down and returns 5404319552844595 /9007199254740992。
现在,您可以看到我们将反复倒转。要将3602879701896397 /36028797018963968添加到累积的总和中,硬件必须将其分子除以四个以使其匹配。这意味着分数始终将是.25,并且将被舍入。因此,接下来的四个款项也被舍入。我们最终以9007199254740991/9007199254740992,仅小于1。
使用
float
而不是double
,Nemerator必须适合24位,因此必须更少,因此它必须更少。比2 24 (16777216)。因此,即使在完成任何算术之前,123456789也太大了。它必须表示为15432099•2 3 ,即123456792。添加1的确切数学结果为15432099.125•2 3 ,并将其与integer的产量合理地圆润15432099•2 3 ,因此没有更改。但是,如果添加四个,则结果是15432099.5•2 3 ,然后循环到15432100•2 3 。Typical modern processors and programming languages use IEEE-754 arithmetic (more or less) with 32-bit binary floating-point for
float
and 64-bit binary floating-point fordouble
. Indouble
, a 53-bit significand is used. This means that, when a decimal numeral is converted todouble
, it is converted to some number s•f•2e, where s is a sign (+1 or −1), f is an unsigned integer that can be represented in 53 bits, and e is an integer between −1074 and 971, inclusive. (Or, if the number being converted is too large, the result can be +infinity or -infinity.) (Those who know the floating-point format may complain that the exponent is properly between −1023 and 1023, but I have shifted the significand to make it an integer. I am describing the mathematical value, not the encoding.)Converting .1 to
double
yields 3602879701896397 / 36028797018963968, because, of all the numbers in the required form, that one is closest to .1. The denominator is 2−55, so e is −55.When we add two of these, we get 7205759403792794 / 36028797018963968. That is fine, the numerator is still less than 253, so it fits in the format.
When we add a third 3602879701896397 / 36028797018963968, the mathematical result is 10808639105689191 / 36028797018963968. Unfortunately, the numerator is too large; it is larger than 253 (9007199254740992). So the floating-point hardware cannot return that number. It has to make it fit somehow.
If we divide the numerator and the denominator by two, we have 5404319552844595.5 / 18014398509481984. This has the same value, but the numerator is not an integer. To make it fit, the hardware rounds it to an integer. When the fraction is exactly 1/2, the rule is to round to make the result even, so the hardware returns 5404319552844596 / 18014398509481984.
Next, we take the current sum, 5404319552844596 / 18014398509481984, and add 3602879701896397 / 36028797018963968 again. This time, the sum is 7205759403792794.5 / 18014398509481984. In this case, the hardware rounds down, returning 7205759403792794 / 18014398509481984.
Then we add 7205759403792794 / 18014398509481984 and 3602879701896397 / 36028797018963968, and the sum is 9007199254740992.5 / 18014398509481984. Note that the numerator not only has a fraction but is larger than 253. So we have to reduce it again, which produces 4503599627370496.25 / 9007199254740992. Rounding the numerator to an integer produces 4503599627370496 / 9007199254740992.
That is exactly 1/2. At this point, the rounding errors have coincidentally canceled; adding .1 five times yields exactly .5.
When we add 4503599627370496 / 9007199254740992 and 3602879701896397 / 36028797018963968, the result is exactly 5404319552844595.25 / 9007199254740992. The hardware rounds down and returns 5404319552844595 / 9007199254740992.
Now you can see we are going to round down repeatedly. To add 3602879701896397 / 36028797018963968 to the accumulating sum, the hardware has to divide its numerator by four to make it match. That means the fraction part is always going to be .25, and it will be rounded down. So the next four sums are also rounded down. We end up with 9007199254740991 / 9007199254740992, which is just less than 1.
With
float
instead ofdouble
, the numerator has to fit in 24 bits, so it has to be less than 224 (16777216). So 123456789 is too big even before any arithmetic is done. It has to be expressed as 15432099 • 23, which is 123456792. The exact mathematical result of adding 1 is 15432099.125 • 23, and rounding that significand to an integer yields 15432099 • 23, so there is no change. But, if you add four, the result is 15432099.5 • 23, and that rounds to 15432100 • 23.