将 32 位浮点数打包为 30 位 (c++)
以下是我想要实现的目标:
- 我需要将 32 位 IEEE 浮点数打包为 30 位。
- 我想通过将尾数的大小减少 2 位来实现此目的。
- 操作本身应该尽可能快。
- 我知道会损失一些精度,这是可以接受的。
- 如果此操作不会破坏 SNaN、QNaN、无穷大等特殊情况,这将是一个优势。但我准备在速度上牺牲这一点。
我想这个问题由两部分组成:
1)我可以简单地清除尾数的最低有效位吗?我已经尝试过这个,到目前为止它是有效的,但也许我是在自找麻烦......比如:
float f;
int packed = (*(int*)&f) & ~3;
// later
f = *(float*)&packed;
2)如果在某些情况下1)会失败,那么实现这一目标的最快方法是什么?
提前致谢
Here are the goals I'm trying to achieve:
- I need to pack 32 bit IEEE floats into 30 bits.
- I want to do this by decreasing the size of mantissa by 2 bits.
- The operation itself should be as fast as possible.
- I'm aware that some precision will be lost, and this is acceptable.
- It would be an advantage, if this operation would not ruin special cases like SNaN, QNaN, infinities, etc. But I'm ready to sacrifice this over speed.
I guess this questions consists of two parts:
1) Can I just simply clear the least significant bits of mantissa? I've tried this, and so far it works, but maybe I'm asking for trouble... Something like:
float f;
int packed = (*(int*)&f) & ~3;
// later
f = *(float*)&packed;
2) If there are cases where 1) will fail, then what would be the fastest way to achieve this?
Thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
实际上,这些重新解释的强制转换违反了严格的别名规则(C++ 标准第 3.10 节)。当您打开编译器优化时,这可能会在您面前爆炸。
C++ 标准,第 3.10 节第 15 段说:
具体来说,3.10/15 不允许我们通过 unsigned int 类型的左值访问 float 对象。我自己其实也被这个咬过。我写的程序在打开优化后停止工作。显然,GCC 并不期望 float 类型的左值对 int 类型的左值进行别名,这在 3.10/15 中是一个合理的假设。优化器根据利用 3.10/15 的 as-if 规则对指令进行了改组,并且停止工作。
在以下假设下,
您应该能够这样做:
这不会受到“3.10 违规”的影响,并且通常非常快。至少 GCC 将 memcpy 视为一个内在函数。如果您不需要这些函数来处理 NaN、无穷大或具有极高数量级的数字,您甚至可以通过将“r >> 2”替换为“(r+1) >> 2”来提高准确性:
即使由于尾数溢出而改变指数,该方法也有效,因为 IEEE-754 编码将连续的浮点值映射到连续的整数(忽略 +/- 0)。该映射实际上非常接近对数。
You actually violate the strict aliasing rules (section 3.10 of the C++ standard) with these reinterpret casts. This will probably blow up in your face when you turn on the compiler optimizations.
C++ standard, section 3.10 paragraph 15 says:
Specifically, 3.10/15 doesn't allow us to access a float object via an lvalue of type unsigned int. I actually got bitten myself by this. The program I wrote stopped working after turning on optimizations. Apparently, GCC didn't expect an lvalue of type float to alias an lvalue of type int which is a fair assumption by 3.10/15. The instructions got shuffled around by the optimizer under the as-if rule exploiting 3.10/15 and it stopped working.
Under the following assumptions
you should be able to do it like this:
This doesn't suffer from the "3.10-violation" and is typically very fast. At least GCC treats memcpy as an intrinsic function. In case you don't need the functions to work with NaNs, infinities or numbers with extremely high magnitude you can even improve accuracy by replacing "r >> 2" with "(r+1) >> 2":
This works even if it changes the exponent due to a mantissa overflow because the IEEE-754 coding maps consecutive floating point values to consecutive integers (ignoring +/- zero). This mapping actually approximates a logarithm quite well.
对于少量异常 NaN 编码,盲目删除浮点数的 2 个 LSB 可能会失败。
NaN 被编码为 exponent=255, mantissa!=0,但 IEEE-754 没有说明应使用哪些 mantissa 值。如果尾数值 <= 3,您可以将 NaN 变成无穷大!
Blindly dropping the 2 LSBs of the float may fail for small number of unusual NaN encodings.
A NaN is encoded as exponent=255, mantissa!=0, but IEEE-754 doesn't say anything about which mantiassa values should be used. If the mantissa value is <= 3, you could turn a NaN into an infinity!
您应该将其封装在一个结构中,这样您就不会意外地将标记浮点数与常规“无符号整数”的使用混合在一起:
但我不能保证它的可移植性。
You should encapsulate it in a struct, so that you don't accidentally mix the usage of the tagged float with regular "unsigned int":
I can't guarantee its portability though.
我无法选择任何答案作为明确的答案,因为它们中的大多数都有有效的信息,但不完全是我想要的。所以我只是总结一下我的结论。
根据 C++ 标准,我在问题的第 1) 部分中发布的转换方法显然是错误的,因此应该使用其他方法来提取浮点数。
最重要的是...据我从阅读有关 IEEE754 浮点数的响应和其他来源了解到,可以从尾数中删除最低有效位。它主要只会影响精度,但有一个例外:sNaN。由于 sNaN 由设置为 255 的指数表示,并且尾数 != 0,因此可能存在尾数 <= 3 的情况,并且删除最后两位会将 sNaN 转换为+/-无穷大。但由于 sNaN 不是在 CPU 浮点运算期间生成的,因此在受控环境下是安全的。
I can't select any of the answers as the definite one, because most of them have valid information, but not quite what I was looking for. So I'll just summarize my conclusions.
The method for conversion I've posted in my question's part 1) is clearly wrong by C++ standard, so other methods to extract float's bits should be used.
And most important... as far as I understand from reading the responses and other sources about IEEE754 floats, it's ok to drop the least significant bits from mantissa. It will mostly affect only precision, with one exception: sNaN. Since sNaN is represented by exponent set to 255, and mantissa != 0, there can be situation where mantissa would be <= 3, and dropping last two bits would convert sNaN to +/-Infinity. But since sNaN are not generated during floating point operations on CPU, its safe under controlled environment.
您需要多少精度?如果 16 位浮点就足够了(对于某些类型的图形来说足够了),那么 ILM 的 16 位浮点(“一半”)(OpenEXR 的一部分)就很棒,遵守各种规则(http://www.openexr.com/ ),将其打包到结构中后,您将剩余足够的空间。
另一方面,如果您知道它们将采用的值的大致范围,则应该考虑定点。它们比大多数人意识到的更有用。
How much precision do you need? If 16-bit float is enough (sufficient for some types of graphics), then ILM's 16-bit float ("half"), part of OpenEXR is great, obeys all kinds of rules (http://www.openexr.com/), and you'll have plenty of space left over after you pack it into a struct.
On the other hand, if you know the approximate range of values they're going to take, you should consider fixed point. They're more useful than most people realize.