对于给定的精度,Float32的最大值与Float64相同的结果是什么?
借助Numpy,我试图了解什么是从Float64到Float32降低的最大值,而精度的损失较小或等于0.001。
由于我在网上找不到一个简单的解释,因此我很快提出了这件代码进行测试:
result = {}
for j in range(1,1000):
for i in range (1, 1_000_000):
num = i + j/1000
x=np.array([num],dtype=np.float32)
y=np.array([num],dtype=np.float64)
if abs(x[0]-y[0]) > 0.001:
result[j] = i
break
根据结果,似乎任何正值< 32768都可以从float64到float64安全地降低到float32,而准确性可接受的损失(鉴于< = 0.001的标准)
这是正确的吗? 有人可以解释背后的数学吗?
多谢
With numpy, I'm trying to understand what is the maximum value that can be downcasted from float64 to float32 with a loss on accuracy less or equal to 0.001.
Since I could not find a simple explanation online, I quickly came up with this piece of code to test :
result = {}
for j in range(1,1000):
for i in range (1, 1_000_000):
num = i + j/1000
x=np.array([num],dtype=np.float32)
y=np.array([num],dtype=np.float64)
if abs(x[0]-y[0]) > 0.001:
result[j] = i
break
Based on the results, it seems any positive value <32768 can be safely downcasted from float64 to float32 with an acceptable loss on accuracy (given the criteria of <=0.001)
Is this correct ?
Could someone explain the math behind ?
Thanks a lot
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
假设 IEEE 754 表示,
float32
有 24位有效数字精度,而float64
具有 53 位有效数精度(“非正规”数字除外)。为了表示绝对误差最多为 0.001 的数字,二进制小数点右侧至少需要 9 位,这意味着数字将四舍五入到最接近的 1/512 倍数,从而具有最大表示值1/1024的误差=0.0009765625< 0.001。
总共有 24 位有效位,其中 9 位位于二进制小数点右侧,因此二进制小数点左侧还剩下 15 位,可以表示所有小于 215 = 32768 的整数,如下所示已通过实验确定。
但是,有些数字高于此阈值,但误差仍小于 0.001。正如 Eric Postpischil 在他的评论中指出的那样,所有
float64
值都在 32768.0 到 32768.001 之间(最大的正好是 32768+137438953/237),其中float32< /code> 转换向下舍入正好为 32768.0,满足您的精度要求。当然,任何恰好可以用
float32
表示的数字都不会出现表示错误。Assuming IEEE 754 representation,
float32
has a 24-bit significand precision, whilefloat64
has a 53-bit significand precision (except for “denormal” numbers).In order to represent a number with an absolute error of at most 0.001, you need at least 9 bits to the right of the binary point, which means the numbers are rounded off to the nearest multiple of 1/512, thus having a maximum representation error of 1/1024 = 0.0009765625 < 0.001.
With 24 significant bits in total, and 9 to the right of the binary point, that leaves 15 bits to the left of the binary point, which can represent all integers less than 215 = 32768, as you have experimentally determined.
However, there are some numbers higher than this threshold that still have an error less than 0.001. As Eric Postpischil pointed out in his comment, all
float64
values between 32768.0 and 32768.001 (the largest being exactly 32768+137438953/237), which thefloat32
conversion rounds down to exactly 32768.0, meet your accuracy requirement. And of course, any number that happens to be exactly representable in afloat32
will have no representation error.