C++中有浮点算术问题的解决方案吗?

发布于 2024-09-29 08:44:55 字数 542 浏览 0 评论 0原文

我正在做一些浮点运算并遇到精度问题。对于相同的输入,两台机器上的结果值是不同的。我读了帖子@ 为什么我不能乘以浮点数?阅读网络上的其他材料理解它与浮点的二进制表示和机器上的 epsilon 有关。但是,我想检查是否有办法解决这个问题/C++ 中浮点算术的一些解决方法?我正在将浮点数转换为无符号短整型以进行存储,并在必要时转换回来。但是,当我将其转换回无符号短整型时,精度(精确到小数点后 6 位)在一台机器上保持正确,但在另一台机器上失败。

//convert FLOAT to short

unsigned short sConst = 0xFFFF;

unsigned short shortValue = (unsigned short)(floatValue * sConst);

//Convert SHORT to FLOAT

float floatValue = ((float)shortValue / sConst);

I am doing some floating point arithmetic and having precision problems. The resulting value is different on two machines for the same input. I read the post @ Why can't I multiply a float? and also read other material on the web & understood that it is got to do with binary representation of floating point and on machine epsilon. However, I wanted to check if there is a way to solve this problem / Some work around for Floating point arithmetic in C++ ?? I am converting a float to unsigned short for storage and am converting back when necessary. However, when I convert it back to unsigned short, the precision (to 6 decimal points) remains correct on one machine but fails on the other.

//convert FLOAT to short

unsigned short sConst = 0xFFFF;

unsigned short shortValue = (unsigned short)(floatValue * sConst);

//Convert SHORT to FLOAT

float floatValue = ((float)shortValue / sConst);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

意中人 2024-10-06 08:44:55

short 必须至少为 16 位,并且在很多实现中正是如此。一个unsigned 16位short将保存从0到65535的值。这意味着short不会保存完整的五位精度,当然也不会是六位。如果需要 6 位数字,则需要 20 位。

因此,任何精度损失都可能是由于您尝试将六位数的精度压缩为小于五位数的结果。除了使用可能与 float 一样多的存储空间的整型类型之外,没有其他解决方案。

我不知道为什么它似乎可以在一个给定的系统上运行。您在两者上使用相同的数字吗?是否使用了较旧的浮点系统,并且该系统恰好在您尝试的样本上给出了您期望的结果?它是否可能使用比另一个更大的short

A short must be at least 16 bits, and in a whole lot of implementations that's exactly what it is. An unsigned 16-bit short will hold values from 0 to 65535. That means that a short will not hold a full five digits of precision, and certainly not six. If you want six digits, you need 20 bits.

Therefore, any loss of precision is likely due to the fact that you're trying to pack six digits of precision into something less than five digits. There is no solution to this, other than using an integral type that probably takes as much storage as a float.

I don't know why it would seem to work on one given system. Were you using the same numbers on both? Did one use an older floating-point system, and one that coincidentally gave the results you were expecting on the samples you tried? Was it possibly using a larger short than the other?

当梦初醒 2024-10-06 08:44:55

如果您想使用本机浮点类型,您能做的最好的事情就是断言程序输出的值与一组参考值没有太大差异。

“太多”的精确定义完全取决于您的应用程序。例如,如果您在不同平台上计算 a + b,您应该会发现两个结果彼此在机器精度范围内。另一方面,如果您正在做一些更复杂的事情,例如矩阵求逆,结果很可能会比机器精度不同。准确地确定您期望的结果彼此之间的接近程度是一个非常微妙和复杂的过程。除非您确切地知道自己在做什么,否则确定应用程序下游所需的精度并验证结果是否足够精确可能会更安全(更明智)。

要了解如何稳健地计算两个浮点值之间的相对误差,请参阅此答案和其中链接的浮点指南:

C# 的浮点比较函数

If you want to use native floating point types, the best you can do is to assert that the values output by your program do not differ too much from a set of reference values.

The precise definition of "too much" depends entirely on your application. For example, if you compute a + b on different platforms, you should find the two results to be within machine precision of each other. On the other hand, if you're doing something more complicated like matrix inversion, the results will most likely differ by more than machine precision. Determining precisely how close you can expect the results to be to each other is a very subtle and complicated process. Unless you know exactly what you are doing, it is probably safer (and saner) to determine the amount of precision you need downstream in your application and verify that the result is sufficiently precise.

To get an idea about how to compute the relative error between two floating point values robustly, see this answer and the floating point guide linked therein:

Floating point comparison functions for C#

孤者何惧 2024-10-06 08:44:55

不使用 0xFFFF,而是使用其中的一半,即 32768 进行转换。 32768 (Ox8000) 的二进制表示形式为 1000000000000000,而 OxFFFF 的二进制表示形式为 1111111111111111。 Ox8000 的二进制表示形式清楚地暗示了乘法和乘法。转换期间的除法运算(转换回浮点数时转换为短整型(或))不会更改零后的精度值。然而,对于一侧转换,OxFFFF 更可取,因为它会带来更准确的结果。

Instead of using 0xFFFF use half of it, i.e. 32768 for conversion. 32768 (Ox8000) has a binary representation of 1000000000000000 whereas OxFFFF has a binary representation of 1111111111111111. Ox8000 's binary representation clearly implies, multiplication & divsion operations during conversion (to short (or) while converting back to float) will not change precision values after zero. For one side conversion, however OxFFFF is preferable, as it leads to more accurate result.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文