如何将压缩整数 (16.16) 定点转换为浮点数?

发布于 2024-12-23 02:27:49 字数 215 浏览 1 评论 0原文

如何将“32位有符号定点数(16.16)”转换为浮点数?

(fixed >> 16) + (fixed & 0xffff) / 65536.0 可以吗? -2.5呢?还有-0.5?

或者 fixed / 65536.0 是正确的方法吗?

(PS:有符号定点“-0.5”在内存中是什么样子的?)

How to convert a "32-bit signed fixed-point number (16.16)" to a float?

Is (fixed >> 16) + (fixed & 0xffff) / 65536.0 ok? What about -2.5? And -0.5?

Or is fixed / 65536.0 the right way?

(PS: How does signed fixed-point "-0.5" looks like in memory anyway?)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

秋日私语 2024-12-30 02:27:49

我假设 32 位整数的补码和运算符的工作方式与 C# 中一样。

如何进行转换?

fixed / 65536.0

是正确且容易理解的。


(fixed >> 16) + (fixed & 0xffff) / 65536.0

与上面的正整数等效,但速度较慢且难以阅读。您基本上使用分配律将单个部分分成两个部分,并使用位移位写入第一个部分。

对于负整数fixed & 0xffff 不会提供小数位,因此对于负数来说它是不正确的。

查看原始整数 -1,它应映射到 -1/65536。此代码改为返回 65535/65536


根据您的编译器,它可能会更快:

fixed * (1/65536.0)

但我假设大多数现代编译器已经进行了这种优化。

有符号定点“-0.5”在内存中是什么样子的?

反转转换给我们:

RoundToInt(float*65536)

设置 float=-0.5 给我们:-32768

I assume two's complement 32 bit integers and operators working as in C#.

How to do the conversion?

fixed / 65536.0

is correct and easy to understand.


(fixed >> 16) + (fixed & 0xffff) / 65536.0

Is equivalent to the above for positive integers, but slower, and harder to read. You're basically using the distributive law to separate a single division into two divisions, and write the first one using a bitshift.

For negative integers fixed & 0xffff doesn't give you the fractional bits, so it's not correct for negative numbers.

Look at the raw integer -1 which should map to -1/65536. This code returns 65535/65536 instead.


Depending on your compiler it might be faster to do:

fixed * (1/65536.0)

But I assume most modern compilers already do that optimization.

How does signed fixed-point "-0.5" looks like in memory anyway?

Inverting the conversion gives us:

RoundToInt(float*65536)

Setting float=-0.5 gives us: -32768.

如此安好 2024-12-30 02:27:49
class FixedPointUtils {
  public static final int ONE = 0x10000;

  /**
   * Convert an array of floats to 16.16 fixed-point
   * @param arr The array
   * @return A newly allocated array of fixed-point values.
   */
  public static int[] toFixed(float[] arr) {
    int[] res = new int[arr.length];
    toFixed(arr, res);
    return res;
  }

  /**
   * Convert a float to  16.16 fixed-point representation
   * @param val The value to convert
   * @return The resulting fixed-point representation
   */
  public static int toFixed(float val) {
    return (int)(val * 65536F);
  }

  /**
   * Convert an array of floats to 16.16 fixed-point
   * @param arr The array of floats
   * @param storage The location to store the fixed-point values.
   */
  public static void toFixed(float[] arr, int[] storage)
  {
    for (int i=0;i<storage.length;i++) {
      storage[i] = toFixed(arr[i]);
    }
  }

  /**
   * Convert a 16.16 fixed-point value to floating point
   * @param val The fixed-point value
   * @return The equivalent floating-point value.
   */
  public static float toFloat(int val) {
    return ((float)val)/65536.0f;
  }

  /**
   * Convert an array of 16.16 fixed-point values to floating point
   * @param arr The array to convert
   * @return A newly allocated array of floats.
   */
  public static float[] toFloat(int[] arr) {
    float[] res = new float[arr.length];
    toFloat(arr, res);
    return res;
  }

  /**
   * Convert an array of 16.16 fixed-point values to floating point
   * @param arr The array to convert
   * @param storage Pre-allocated storage for the result.
   */
  public static void toFloat(int[] arr, float[] storage)
  {
    for (int i=0;i<storage.length;i++) {
      storage[i] = toFloat(arr[i]);
    }
  }

}
class FixedPointUtils {
  public static final int ONE = 0x10000;

  /**
   * Convert an array of floats to 16.16 fixed-point
   * @param arr The array
   * @return A newly allocated array of fixed-point values.
   */
  public static int[] toFixed(float[] arr) {
    int[] res = new int[arr.length];
    toFixed(arr, res);
    return res;
  }

  /**
   * Convert a float to  16.16 fixed-point representation
   * @param val The value to convert
   * @return The resulting fixed-point representation
   */
  public static int toFixed(float val) {
    return (int)(val * 65536F);
  }

  /**
   * Convert an array of floats to 16.16 fixed-point
   * @param arr The array of floats
   * @param storage The location to store the fixed-point values.
   */
  public static void toFixed(float[] arr, int[] storage)
  {
    for (int i=0;i<storage.length;i++) {
      storage[i] = toFixed(arr[i]);
    }
  }

  /**
   * Convert a 16.16 fixed-point value to floating point
   * @param val The fixed-point value
   * @return The equivalent floating-point value.
   */
  public static float toFloat(int val) {
    return ((float)val)/65536.0f;
  }

  /**
   * Convert an array of 16.16 fixed-point values to floating point
   * @param arr The array to convert
   * @return A newly allocated array of floats.
   */
  public static float[] toFloat(int[] arr) {
    float[] res = new float[arr.length];
    toFloat(arr, res);
    return res;
  }

  /**
   * Convert an array of 16.16 fixed-point values to floating point
   * @param arr The array to convert
   * @param storage Pre-allocated storage for the result.
   */
  public static void toFloat(int[] arr, float[] storage)
  {
    for (int i=0;i<storage.length;i++) {
      storage[i] = toFloat(arr[i]);
    }
  }

}
倥絔 2024-12-30 02:27:49

阅读 CodesInChaos 的答案后,我编写了一个 C++ 函数模板,非常方便。您可以传递小数部分的长度(例如,BMP 文件格式使用 2.30 定点数)。如果省略小数部分长度,则函数假设小数部分和整数部分具有相同的长度

#include <math.h> // for NaN
#include <limits.h> // for CHAR_BIT = 8

template<class T> inline double fixed_point2double(const T& x, int frac_digits = (CHAR_BIT * sizeof(T)) / 2 )
{
  if (frac_digits >= CHAR_BIT * sizeof(T)) return NAN;
  return double(x) / double( T(1) << frac_digits) );
}

如果你想从内存中读取这样的数字,我写了一个函数模板

#include <math.h> // for NaN
#include <limits.h> // for CHAR_BIT = 8

template<class T> inline double read_little_endian_fixed_point(const unsigned char *x, int frac_digits = (CHAR_BIT * sizeof(T)) / 2)
// ! do not use for single byte types 'T'
{
  if (frac_digits >= CHAR_BIT * sizeof(T)) return NAN;

  T res = 0;

  for (int i = 0, shift = 0; i < sizeof(T); ++i, shift += CHAR_BIT)
    res |= ((T)x[i]) << shift;

  return double(res) / double( T(1) << frac_digits) );
}

After reading an answer by CodesInChaos I wrote a C++ function template, which is very convenient. You can pass the length of fractional part (for example, BMP file format uses 2.30 fixed point numbers). If fractional part length is omitted, the function assumes that fractional and integer parts have the same length

#include <math.h> // for NaN
#include <limits.h> // for CHAR_BIT = 8

template<class T> inline double fixed_point2double(const T& x, int frac_digits = (CHAR_BIT * sizeof(T)) / 2 )
{
  if (frac_digits >= CHAR_BIT * sizeof(T)) return NAN;
  return double(x) / double( T(1) << frac_digits) );
}

And if you want to read such number from memory, I wrote a function template

#include <math.h> // for NaN
#include <limits.h> // for CHAR_BIT = 8

template<class T> inline double read_little_endian_fixed_point(const unsigned char *x, int frac_digits = (CHAR_BIT * sizeof(T)) / 2)
// ! do not use for single byte types 'T'
{
  if (frac_digits >= CHAR_BIT * sizeof(T)) return NAN;

  T res = 0;

  for (int i = 0, shift = 0; i < sizeof(T); ++i, shift += CHAR_BIT)
    res |= ((T)x[i]) << shift;

  return double(res) / double( T(1) << frac_digits) );
}
假扮的天使 2024-12-30 02:27:49

CodesInChaos 实际上是错误的,说这

(fixed >> 16) + (fixed & 0xffff) / 65536.0

不起作用。
如果固定是一个 32 位有符号整数,那么对于负数,它实际上是从 0 减去的值,或者说 0x1_0000_0000,即 33 位数字。这就是二进制补码的工作原理。因此,这些小数位需要从下一个较小的整数中添加才能读取正确的值!

因此,整数 -1 的 (fixed >> 16) 将生成浮点数 -1,并添加 (fixed & 0xffff) / 65536.0 = 65535 /65536 到 -1 将产生正确的值 -1/65536,因为 -65536/65536 + 65535/65536 = -1/65536

CodesInChaos was actually wrong saying that

(fixed >> 16) + (fixed & 0xffff) / 65536.0

does not work.
If fixed is a 32 bit signed integer, then for negative numbers it's actually the value subtracted from 0 or say 0x1_0000_0000, i.e. a 33 bit number. That's just how two's complement works. So those fractional bits are what is needed to be added from the next smaller integer to read the correct value!

Thus (fixed >> 16) for the integer -1 will produce the float -1, and adding (fixed & 0xffff) / 65536.0 = 65535/65536 to -1 will produce the correct value -1/65536 because -65536/65536 + 65535/65536 = -1/65536

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文