c floating-point bit-manipulation ieee-754

将 double 转换为 float 而不依赖 FPU 舍入模式

发布于 2024-08-17 03:10:53 字数 376 浏览 3 评论 0原文

有没有人有方便的代码片段，可以将 IEEE 754 double 转换为紧邻的下级（resp.superior）float，无需更改或假设有关 FPU 的任何内容当前舍入模式？

注意：此限制可能意味着根本不使用 FPU。我希望在这些条件下执行此操作的最简单方法是读取 64 位长的双精度数位并进行处理。

为了简单起见，您可以假设您选择的字节序，并且可以通过下面联合的 d 字段获得所讨论的双精度：

union double_bits
{
  long i;
  double d;
};

我会尝试自己做，但我确信我会引入硬性-注意非规范化或负数的错误。

原文

Does anyone have handy the snippets of code to convert an IEEE 754 double to the immediately inferior (resp. superior) float, without changing or assuming anything about the FPU's current rounding mode?

Note: this constraint probably implies not using the FPU at all. I expect the simplest way to do it in these conditions is to read the bits of the double in a 64-bit long and to work with that.

You can assume the endianness of your choice for simplicity, and that the double in question is available through the d field of the union below:

union double_bits
{
  long i;
  double d;
};

I would try to do it myself but I am certain I would introduce hard-to-notice bugs for denormalized or negative numbers.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绝影如岚 2024-08-24 03:10:53

我认为以下方法可行，但我将首先陈述我的假设：

浮点数在您的实现中以 IEEE-754 格式存储，
没有溢出，
您有 nextafterf() 可用（它在 C99 中指定））。

而且，这种方法很可能不是很有效。

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char *argv[])
{
    /* Change to non-zero for superior, otherwise inferior */
    int superior = 0;

    /* double value to convert */
    double d = 0.1;

    float f;
    double tmp = d;

    if (argc > 1)
        d = strtod(argv[1], NULL);

    /* First, get an approximation of the double value */
    f = d;

    /* Now, convert that back to double */
    tmp = f;

    /* Print the numbers. %a is C99 */
    printf("Double: %.20f (%a)\n", d, d);
    printf("Float: %.20f (%a)\n", f, f);
    printf("tmp: %.20f (%a)\n", tmp, tmp);

    if (superior) {
        /* If we wanted superior, and got a smaller value,
           get the next value */
        if (tmp < d)
            f = nextafterf(f, INFINITY);
    } else {
        if (tmp > d)
            f = nextafterf(f, -INFINITY);
    }
    printf("converted: %.20f (%a)\n", f, f);

    return 0;
}

在我的机器上，它打印：

Double: 0.10000000000000000555 (0x1.999999999999ap-4)
Float: 0.10000000149011611938 (0x1.99999ap-4)
tmp: 0.10000000149011611938 (0x1.99999ap-4)
converted: 0.09999999403953552246 (0x1.999998p-4)

这个想法是，我将 double 值转换为 float 值 - 这可能小于或大于 double 值，具体取决于舍入模式。当转换回 double 时，我们可以检查它是否小于或大于原始值。然后，如果 float 的值方向不正确，我们会沿着原始数字的方向查看转换后的数字中的下一个 float 数字。

I think the following works, but I will state my assumptions first:

floating-point numbers are stored in IEEE-754 format on your implementation,
No overflow,
You have nextafterf() available (it's specified in C99).

Also, most likely, this method is not very efficient.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char *argv[])
{
    /* Change to non-zero for superior, otherwise inferior */
    int superior = 0;

    /* double value to convert */
    double d = 0.1;

    float f;
    double tmp = d;

    if (argc > 1)
        d = strtod(argv[1], NULL);

    /* First, get an approximation of the double value */
    f = d;

    /* Now, convert that back to double */
    tmp = f;

    /* Print the numbers. %a is C99 */
    printf("Double: %.20f (%a)\n", d, d);
    printf("Float: %.20f (%a)\n", f, f);
    printf("tmp: %.20f (%a)\n", tmp, tmp);

    if (superior) {
        /* If we wanted superior, and got a smaller value,
           get the next value */
        if (tmp < d)
            f = nextafterf(f, INFINITY);
    } else {
        if (tmp > d)
            f = nextafterf(f, -INFINITY);
    }
    printf("converted: %.20f (%a)\n", f, f);

    return 0;
}

On my machine, it prints:

Double: 0.10000000000000000555 (0x1.999999999999ap-4)
Float: 0.10000000149011611938 (0x1.99999ap-4)
tmp: 0.10000000149011611938 (0x1.99999ap-4)
converted: 0.09999999403953552246 (0x1.999998p-4)

The idea is that I am converting the double value to a float value—this could be less than or greater than the double value depending upon the rounding mode. When converted back to double, we can check if it is smaller or greater than the original value. Then, if the value of the float is not in the right direction, we look at the next float number from the converted number in the original number's direction.

回复收藏 0 原文

有木有妳兜一样 2024-08-24 03:10:53

要更准确地完成这项工作，而不仅仅是重新组合尾数和指数位，请查看以下内容：

http:// /www.mathworks.com/matlabcentral/fileexchange/23173

问候

回复收藏 0 原文

终遇你 2024-08-24 03:10:53

我在这里发布了执行此操作的代码： https://stackoverflow.com/q/19644895/364818 并将其复制到下面您的方便。

    // d is IEEE double, but double is not natively supported.
    static float ConvertDoubleToFloat(void* d)
    {
        unsigned long long x;
        float f; // assumed to be IEEE float
        unsigned long long sign ;
        unsigned long long exponent;
        unsigned long long mantissa;

        memcpy(&x,d,8);

        // IEEE binary64 format (unsupported)
        sign     = (x >> 63) & 1; // 1
        exponent = ((x >> 52) & 0x7FF); // 11
        mantissa = (x >> 0) & 0x000FFFFFFFFFFFFFULL; // 52
        exponent -= 1023;

        // IEEE binary32 format (supported)
        exponent += 127; // rebase
        exponent &= 0xFF;
        mantissa >>= (52-23); // left justify

        x = mantissa | (exponent << 23) | (sign << 31);
        memcpy(&f,&x,4);

        return f;
    }

I posted code to do this here: https://stackoverflow.com/q/19644895/364818 and copied it below for your convenience.

    // d is IEEE double, but double is not natively supported.
    static float ConvertDoubleToFloat(void* d)
    {
        unsigned long long x;
        float f; // assumed to be IEEE float
        unsigned long long sign ;
        unsigned long long exponent;
        unsigned long long mantissa;

        memcpy(&x,d,8);

        // IEEE binary64 format (unsupported)
        sign     = (x >> 63) & 1; // 1
        exponent = ((x >> 52) & 0x7FF); // 11
        mantissa = (x >> 0) & 0x000FFFFFFFFFFFFFULL; // 52
        exponent -= 1023;

        // IEEE binary32 format (supported)
        exponent += 127; // rebase
        exponent &= 0xFF;
        mantissa >>= (52-23); // left justify

        x = mantissa | (exponent << 23) | (sign << 31);
        memcpy(&f,&x,4);

        return f;
    }

回复收藏 0 原文

~没有更多了~