C++中有浮点算术问题的解决方案吗?
我正在做一些浮点运算并遇到精度问题。对于相同的输入,两台机器上的结果值是不同的。我读了帖子@ 为什么我不能乘以浮点数?阅读网络上的其他材料理解它与浮点的二进制表示和机器上的 epsilon 有关。但是,我想检查是否有办法解决这个问题/C++ 中浮点算术的一些解决方法?我正在将浮点数转换为无符号短整型以进行存储,并在必要时转换回来。但是,当我将其转换回无符号短整型时,精度(精确到小数点后 6 位)在一台机器上保持正确,但在另一台机器上失败。
//convert FLOAT to short
unsigned short sConst = 0xFFFF;
unsigned short shortValue = (unsigned short)(floatValue * sConst);
//Convert SHORT to FLOAT
float floatValue = ((float)shortValue / sConst);
I am doing some floating point arithmetic and having precision problems. The resulting value is different on two machines for the same input. I read the post @ Why can't I multiply a float? and also read other material on the web & understood that it is got to do with binary representation of floating point and on machine epsilon. However, I wanted to check if there is a way to solve this problem / Some work around for Floating point arithmetic in C++ ?? I am converting a float to unsigned short for storage and am converting back when necessary. However, when I convert it back to unsigned short, the precision (to 6 decimal points) remains correct on one machine but fails on the other.
//convert FLOAT to short
unsigned short sConst = 0xFFFF;
unsigned short shortValue = (unsigned short)(floatValue * sConst);
//Convert SHORT to FLOAT
float floatValue = ((float)shortValue / sConst);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
short
必须至少为 16 位,并且在很多实现中正是如此。一个unsigned
16位short
将保存从0到65535的值。这意味着short不会保存完整的五位精度,当然也不会是六位。如果需要 6 位数字,则需要 20 位。因此,任何精度损失都可能是由于您尝试将六位数的精度压缩为小于五位数的结果。除了使用可能与
float
一样多的存储空间的整型类型之外,没有其他解决方案。我不知道为什么它似乎可以在一个给定的系统上运行。您在两者上使用相同的数字吗?是否使用了较旧的浮点系统,并且该系统恰好在您尝试的样本上给出了您期望的结果?它是否可能使用比另一个更大的
short
?A
short
must be at least 16 bits, and in a whole lot of implementations that's exactly what it is. Anunsigned
16-bitshort
will hold values from 0 to 65535. That means that a short will not hold a full five digits of precision, and certainly not six. If you want six digits, you need 20 bits.Therefore, any loss of precision is likely due to the fact that you're trying to pack six digits of precision into something less than five digits. There is no solution to this, other than using an integral type that probably takes as much storage as a
float
.I don't know why it would seem to work on one given system. Were you using the same numbers on both? Did one use an older floating-point system, and one that coincidentally gave the results you were expecting on the samples you tried? Was it possibly using a larger
short
than the other?如果您想使用本机浮点类型,您能做的最好的事情就是断言程序输出的值与一组参考值没有太大差异。
“太多”的精确定义完全取决于您的应用程序。例如,如果您在不同平台上计算
a + b
,您应该会发现两个结果彼此在机器精度范围内。另一方面,如果您正在做一些更复杂的事情,例如矩阵求逆,结果很可能会比机器精度不同。准确地确定您期望的结果彼此之间的接近程度是一个非常微妙和复杂的过程。除非您确切地知道自己在做什么,否则确定应用程序下游所需的精度并验证结果是否足够精确可能会更安全(更明智)。要了解如何稳健地计算两个浮点值之间的相对误差,请参阅此答案和其中链接的浮点指南:
C# 的浮点比较函数
If you want to use native floating point types, the best you can do is to assert that the values output by your program do not differ too much from a set of reference values.
The precise definition of "too much" depends entirely on your application. For example, if you compute
a + b
on different platforms, you should find the two results to be within machine precision of each other. On the other hand, if you're doing something more complicated like matrix inversion, the results will most likely differ by more than machine precision. Determining precisely how close you can expect the results to be to each other is a very subtle and complicated process. Unless you know exactly what you are doing, it is probably safer (and saner) to determine the amount of precision you need downstream in your application and verify that the result is sufficiently precise.To get an idea about how to compute the relative error between two floating point values robustly, see this answer and the floating point guide linked therein:
Floating point comparison functions for C#
您是否正在寻找这样的标准:
编程语言 C++ - 编程语言C++支持十进制浮点运算扩展Type 2技术报告草案
Are you looking for standard like this:
Programming Languages C++ - Technical Report of Type 2 on Extensions for the programming language C++ to support decimal floating point arithmetic draft
不使用 0xFFFF,而是使用其中的一半,即 32768 进行转换。 32768 (Ox8000) 的二进制表示形式为 1000000000000000,而 OxFFFF 的二进制表示形式为 1111111111111111。 Ox8000 的二进制表示形式清楚地暗示了乘法和乘法。转换期间的除法运算(转换回浮点数时转换为短整型(或))不会更改零后的精度值。然而,对于一侧转换,OxFFFF 更可取,因为它会带来更准确的结果。
Instead of using 0xFFFF use half of it, i.e. 32768 for conversion. 32768 (Ox8000) has a binary representation of 1000000000000000 whereas OxFFFF has a binary representation of 1111111111111111. Ox8000 's binary representation clearly implies, multiplication & divsion operations during conversion (to short (or) while converting back to float) will not change precision values after zero. For one side conversion, however OxFFFF is preferable, as it leads to more accurate result.