对于给定的精度,Float32的最大值与Float64相同的结果是什么?

发布于 2025-01-17 09:30:44 字数 524 浏览 2 评论 0原文

借助Numpy,我试图了解什么是从Float64到Float32降低的最大值,而精度的损失较小或等于0.001。

由于我在网上找不到一个简单的解释,因此我很快提出了这件代码进行测试:

result = {}
for j in range(1,1000):
    for i in range (1, 1_000_000):
        num = i + j/1000
        x=np.array([num],dtype=np.float32)
        y=np.array([num],dtype=np.float64)
        if abs(x[0]-y[0]) > 0.001:
            result[j] = i
            break

根据结果,似乎任何正值< 32768都可以从float64到float64安全地降低到float32,而准确性可接受的损失(鉴于< = 0.001的标准)

这是正确的吗? 有人可以解释背后的数学吗?

多谢

With numpy, I'm trying to understand what is the maximum value that can be downcasted from float64 to float32 with a loss on accuracy less or equal to 0.001.

Since I could not find a simple explanation online, I quickly came up with this piece of code to test :

result = {}
for j in range(1,1000):
    for i in range (1, 1_000_000):
        num = i + j/1000
        x=np.array([num],dtype=np.float32)
        y=np.array([num],dtype=np.float64)
        if abs(x[0]-y[0]) > 0.001:
            result[j] = i
            break

Based on the results, it seems any positive value <32768 can be safely downcasted from float64 to float32 with an acceptable loss on accuracy (given the criteria of <=0.001)

Is this correct ?
Could someone explain the math behind ?

Thanks a lot

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

深居我梦 2025-01-24 09:30:44

假设 IEEE 754 表示,float32 有 24位有效数字精度,而 float64 具有 53 位有效数精度(“非正规”数字除外)。

为了表示绝对误差最多为 0.001 的数字,二进制小数点右侧至少需要 9 位,这意味着数字将四舍五入到最接近的 1/512 倍数,从而具有最大表示值1/1024的误差=0.0009765625< 0.001。

总共有 24 位有效位,其中 9 位位于二进制小数点右侧,因此二进制小数点左侧还剩下 15 位,可以表示所有小于 215 = 32768 的整数,如下所示已通过实验确定。

但是,有些数字高于此阈值,但误差仍小于 0.001。正如 Eric Postpischil 在他的评论中指出的那样,所有 float64 值都在 32768.0 到 32768.001 之间(最大的正好是 32768+137438953/237),其中 float32< /code> 转换向下舍入正好为 32768.0,满足您的精度要求。当然,任何恰好可以用 float32 表示的数字都不会出现表示错误。

Assuming IEEE 754 representation, float32 has a 24-bit significand precision, while float64 has a 53-bit significand precision (except for “denormal” numbers).

In order to represent a number with an absolute error of at most 0.001, you need at least 9 bits to the right of the binary point, which means the numbers are rounded off to the nearest multiple of 1/512, thus having a maximum representation error of 1/1024 = 0.0009765625 < 0.001.

With 24 significant bits in total, and 9 to the right of the binary point, that leaves 15 bits to the left of the binary point, which can represent all integers less than 215 = 32768, as you have experimentally determined.

However, there are some numbers higher than this threshold that still have an error less than 0.001. As Eric Postpischil pointed out in his comment, all float64 values between 32768.0 and 32768.001 (the largest being exactly 32768+137438953/237), which the float32 conversion rounds down to exactly 32768.0, meet your accuracy requirement. And of course, any number that happens to be exactly representable in a float32 will have no representation error.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文