将 32 位浮点数打包为 30 位 (c++)

发布于 2024-09-26 00:15:07 字数 468 浏览 8 评论 0原文

以下是我想要实现的目标:

  • 我需要将 32 位 IEEE 浮点数打包为 30 位。
  • 我想通过将尾数的大小减少 2 位来实现此目的。
  • 操作本身应该尽可能快。
  • 我知道会损失一些精度,这是可以接受的。
  • 如果此操作不会破坏 SNaN、QNaN、无穷大等特殊情况,这将是一个优势。但我准备在速度上牺牲这一点。

我想这个问题由两部分组成:

1)我可以简单地清除尾数的最低有效位吗?我已经尝试过这个,到目前为止它是有效的,但也许我是在自找麻烦......比如:

float f;
int packed = (*(int*)&f) & ~3;
// later
f = *(float*)&packed;

2)如果在某些情况下1)会失败,那么实现这一目标的最快方法是什么?

提前致谢

Here are the goals I'm trying to achieve:

  • I need to pack 32 bit IEEE floats into 30 bits.
  • I want to do this by decreasing the size of mantissa by 2 bits.
  • The operation itself should be as fast as possible.
  • I'm aware that some precision will be lost, and this is acceptable.
  • It would be an advantage, if this operation would not ruin special cases like SNaN, QNaN, infinities, etc. But I'm ready to sacrifice this over speed.

I guess this questions consists of two parts:

1) Can I just simply clear the least significant bits of mantissa? I've tried this, and so far it works, but maybe I'm asking for trouble... Something like:

float f;
int packed = (*(int*)&f) & ~3;
// later
f = *(float*)&packed;

2) If there are cases where 1) will fail, then what would be the fastest way to achieve this?

Thanks in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

北方的巷 2024-10-03 00:15:07

实际上,这些重新解释的强制转换违反了严格的别名规则(C++ 标准第 3.10 节)。当您打开编译器优化时,这可能会在您面前爆炸。

C++ 标准,第 3.10 节第 15 段说:

如果程序尝试通过以下类型之一以外的左值访问对象的存储值,则行为未定义

  • 对象的动态类型,
  • 对象动态类型的 cv 限定版本,
  • 与对象的动态类型类似的类型,
  • 与对象的动态类型相对应的有符号或无符号类型,
  • 是与对象动态类型的 cv 限定版本相对应的有符号或无符号类型,
  • 在其成员中包含上述类型之一的聚合或联合类型(递归地包括子聚合或包含联合的成员),
  • 该类型是对象动态类型的(可能是 cv 限定的)基类类型,
  • char 或 unsigned char 类型。

具体来说,3.10/15 不允许我们通过 unsigned int 类型的左值访问 float 对象。我自己其实也被这个咬过。我写的程序在打开优化后停止工作。显然,GCC 并不期望 float 类型的左值对 int 类型的左值进行别名,这在 3.10/15 中是一个合理的假设。优化器根据利用 3.10/15 的 as-if 规则对指令进行了改组,并且停止工作。

在以下假设下,

  • float确实对应于32位IEEE浮点,
  • sizeof(float)==sizeof(int)
  • unsigned int没有填充位或陷阱表示,

您应该能够这样做:

/// returns a 30 bit number
unsigned int pack_float(float x) {
    unsigned r;
    std::memcpy(&r,&x,sizeof r);
    return r >> 2;
}

float unpack_float(unsigned int x) {
    x <<= 2;
    float r;
    std::memcpy(&r,&x,sizeof r);
    return r;
}

这不会受到“3.10 违规”的影响,并且通常非常快。至少 GCC 将 memcpy 视为一个内在函数。如果您不需要这些函数来处理 NaN、无穷大或具有极高数量级的数字,您甚至可以通过将“r >> 2”替换为“(r+1) >> 2”来提高准确性:

unsigned int pack_float(float x) {
    unsigned r;
    std::memcpy(&r,&x,sizeof r);
    return (r+1) >> 2;
}

即使由于尾数溢出而改变指数,该方法也有效,因为 IEEE-754 编码将连续的浮点值映射到连续的整数(忽略 +/- 0)。该映射实际上非常接近对数。

You actually violate the strict aliasing rules (section 3.10 of the C++ standard) with these reinterpret casts. This will probably blow up in your face when you turn on the compiler optimizations.

C++ standard, section 3.10 paragraph 15 says:

If a program attempts to access the stored value of an object through an lvalue of other than one of the following types the behavior is undefined

  • the dynamic type of the object,
  • a cv-qualified version of the dynamic type of the object,
  • a type similar to the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to the dynamic type of the object,
  • a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union),
  • a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
  • a char or unsigned char type.

Specifically, 3.10/15 doesn't allow us to access a float object via an lvalue of type unsigned int. I actually got bitten myself by this. The program I wrote stopped working after turning on optimizations. Apparently, GCC didn't expect an lvalue of type float to alias an lvalue of type int which is a fair assumption by 3.10/15. The instructions got shuffled around by the optimizer under the as-if rule exploiting 3.10/15 and it stopped working.

Under the following assumptions

  • float really corresponds to a 32bit IEEE-float,
  • sizeof(float)==sizeof(int)
  • unsigned int has no padding bits or trap representations

you should be able to do it like this:

/// returns a 30 bit number
unsigned int pack_float(float x) {
    unsigned r;
    std::memcpy(&r,&x,sizeof r);
    return r >> 2;
}

float unpack_float(unsigned int x) {
    x <<= 2;
    float r;
    std::memcpy(&r,&x,sizeof r);
    return r;
}

This doesn't suffer from the "3.10-violation" and is typically very fast. At least GCC treats memcpy as an intrinsic function. In case you don't need the functions to work with NaNs, infinities or numbers with extremely high magnitude you can even improve accuracy by replacing "r >> 2" with "(r+1) >> 2":

unsigned int pack_float(float x) {
    unsigned r;
    std::memcpy(&r,&x,sizeof r);
    return (r+1) >> 2;
}

This works even if it changes the exponent due to a mantissa overflow because the IEEE-754 coding maps consecutive floating point values to consecutive integers (ignoring +/- zero). This mapping actually approximates a logarithm quite well.

剪不断理还乱 2024-10-03 00:15:07

对于少量异常 NaN 编码,盲目删除浮点数的 2 个 LSB 可能会失败。

NaN 被编码为 exponent=255, mantissa!=0,但 IEEE-754 没有说明应使用哪些 mantissa 值。如果尾数值 <= 3,您可以将 NaN 变成无穷大!

Blindly dropping the 2 LSBs of the float may fail for small number of unusual NaN encodings.

A NaN is encoded as exponent=255, mantissa!=0, but IEEE-754 doesn't say anything about which mantiassa values should be used. If the mantissa value is <= 3, you could turn a NaN into an infinity!

隱形的亼 2024-10-03 00:15:07

您应该将其封装在一个结构中,这样您就不会意外地将标记浮点数与常规“无符号整数”的使用混合在一起:

#include <iostream>
using namespace std;

struct TypedFloat {
    private:
        union {
            unsigned int raw : 32;
            struct {
                unsigned int num  : 30;  
                unsigned int type : 2;  
            };
        };
    public:

        TypedFloat(unsigned int type=0) : num(0), type(type) {}

        operator float() const {
            unsigned int tmp = num << 2;
            return reinterpret_cast<float&>(tmp);
        }
        void operator=(float newnum) {
            num = reinterpret_cast<int&>(newnum) >> 2;
        }
        unsigned int getType() const {
            return type;
        }
        void setType(unsigned int type) {
            this->type = type;
        }
};

int main() { 
    const unsigned int TYPE_A = 1;
    TypedFloat a(TYPE_A);

    a = 3.4;
    cout << a + 5.4 << endl;
    float b = a;
    cout << a << endl;
    cout << b << endl;
    cout << a.getType() << endl;
    return 0;
}

但我不能保证它的可移植性。

You should encapsulate it in a struct, so that you don't accidentally mix the usage of the tagged float with regular "unsigned int":

#include <iostream>
using namespace std;

struct TypedFloat {
    private:
        union {
            unsigned int raw : 32;
            struct {
                unsigned int num  : 30;  
                unsigned int type : 2;  
            };
        };
    public:

        TypedFloat(unsigned int type=0) : num(0), type(type) {}

        operator float() const {
            unsigned int tmp = num << 2;
            return reinterpret_cast<float&>(tmp);
        }
        void operator=(float newnum) {
            num = reinterpret_cast<int&>(newnum) >> 2;
        }
        unsigned int getType() const {
            return type;
        }
        void setType(unsigned int type) {
            this->type = type;
        }
};

int main() { 
    const unsigned int TYPE_A = 1;
    TypedFloat a(TYPE_A);

    a = 3.4;
    cout << a + 5.4 << endl;
    float b = a;
    cout << a << endl;
    cout << b << endl;
    cout << a.getType() << endl;
    return 0;
}

I can't guarantee its portability though.

菩提树下叶撕阳。 2024-10-03 00:15:07

我无法选择任何答案作为明确的答案,因为它们中的大多数都有有效的信息,但不完全是我想要的。所以我只是总结一下我的结论。

根据 C++ 标准,我在问题的第 1) 部分中发布的转换方法显然是错误的,因此应该使用其他方法来提取浮点数。

最重要的是...据我从阅读有关 IEEE754 浮点数的响应和其他来源了解到,可以从尾数中删除最低有效位。它主要只会影响精度,但有一个例外:sNaN。由于 sNaN 由设置为 255 的指数表示,并且尾数 != 0,因此可能存在尾数 <= 3 的情况,并且删除最后两位会将 sNaN 转换为+/-无穷大。但由于 sNaN 不是在 CPU 浮点运算期间生成的,因此在受控环境下是安全的。

I can't select any of the answers as the definite one, because most of them have valid information, but not quite what I was looking for. So I'll just summarize my conclusions.

The method for conversion I've posted in my question's part 1) is clearly wrong by C++ standard, so other methods to extract float's bits should be used.

And most important... as far as I understand from reading the responses and other sources about IEEE754 floats, it's ok to drop the least significant bits from mantissa. It will mostly affect only precision, with one exception: sNaN. Since sNaN is represented by exponent set to 255, and mantissa != 0, there can be situation where mantissa would be <= 3, and dropping last two bits would convert sNaN to +/-Infinity. But since sNaN are not generated during floating point operations on CPU, its safe under controlled environment.

流绪微梦 2024-10-03 00:15:07

您需要多少精度?如果 16 位浮点就足够了(对于某些类型的图形来说足够了),那么 ILM 的 16 位浮点(“一半”)(OpenEXR 的一部分)就很棒,遵守各种规则(http://www.openexr.com/ ),将其打包到结构中后,您将剩余足够的空间。

另一方面,如果您知道它们将采用的值的大致范围,则应该考虑定点。它们比大多数人意识到的更有用。

How much precision do you need? If 16-bit float is enough (sufficient for some types of graphics), then ILM's 16-bit float ("half"), part of OpenEXR is great, obeys all kinds of rules (http://www.openexr.com/), and you'll have plenty of space left over after you pack it into a struct.

On the other hand, if you know the approximate range of values they're going to take, you should consider fixed point. They're more useful than most people realize.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文