从 unsigned long long 转换为 float,并四舍五入到最接近的偶数

发布于 2024-10-07 07:43:40 字数 394 浏览 9 评论 0原文

我需要编写一个从 unsigned long long 舍入到 float 的函数,并且舍入应该向最接近的偶数舍入。 我不能只进行 C++ 类型转换,因为据我所知,标准没有指定舍入。 我正在考虑使用 boost::numeric,但在阅读文档后我找不到任何有用的线索。可以使用该库来完成此操作吗? 当然,如果有替代方案,我很乐意使用它。

任何帮助将不胜感激。

编辑:添加一个示例以使事情更清楚一些。 假设我想将 0xffffff7ffffffffff 转换为其浮点表示形式。 C++ 标准允许以下任一结果:

  1. 0x5f7fffff ~ 1.9999999*2^63
  2. 0x5f800000 = 2^64

现在,如果添加舍入到最接近偶数的限制,则只有第一个结果可接受。

I need to write a function that rounds from unsigned long long to float, and the rounding should be toward nearest even.
I cannot just do a C++ type-cast, since AFAIK the standard does not specify the rounding.
I was thinking of using boost::numeric, but i could not find any useful lead after reading the documentation. Can this be done using that library?
Of course, if there is an alternative, i would be glad to use it.

Any help would be much appreciated.

EDIT: Adding an example to make things a bit clearer.
Suppose i want to convert 0xffffff7fffffffff to its floating point representation. The C++ standard permits either one of:

  1. 0x5f7fffff ~ 1.9999999*2^63
  2. 0x5f800000 = 2^64

Now if you add the restriction of round to nearest even, only the first result is acceptable.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

最美的太阳 2024-10-14 07:43:40

由于源代码中有很多位无法在 float 中表示,并且您不能(显然)依赖语言的转换,因此您必须自己完成。

我设计的一个方案可能对你有帮助,也可能没有帮助。基本上,float 有 31 位来表示正数,因此我选取源数字中的 31 个最高有效位。然后我保存并掩盖所有较低的位。然后根据较低位的值,将“新”LSB 向上或向下舍入,最后使用static_cast 创建一个float

我留下了一些你可以根据需要删除的提示。

const unsigned long long mask_bit_count = 31;

float ull_to_float2(unsigned long long val)
{
    // How many bits are needed?
    int b = sizeof(unsigned long long) * CHAR_BIT - 1;
    for(; b >= 0; --b)
    {
        if(val & (1ull << b))
        {
            break;
        }
    }

    std::cout << "Need " << (b + 1) << " bits." << std::endl;

    // If there are few enough significant bits, use normal cast and done.
    if(b < mask_bit_count)
    {
        return static_cast<float>(val & ~1ull);
    }

    // Save off the low-order useless bits:
    unsigned long long low_bits = val & ((1ull << (b - mask_bit_count)) - 1);
    std::cout << "Saved low bits=" << low_bits << std::endl;

    std::cout << val << "->mask->";
    // Now mask away those useless low bits:
    val &= ~((1ull << (b - mask_bit_count)) - 1);
    std::cout << val << std::endl;

    // Finally, decide how to round the new LSB:
    if(low_bits > ((1ull << (b - mask_bit_count)) / 2ull))
    {
        std::cout << "Rounding up " << val;
        // Round up.
        val |= (1ull << (b - mask_bit_count));
        std::cout << " to " << val << std::endl;
    }
    else
    {
        // Round down.
        val &= ~(1ull << (b - mask_bit_count));
    }

    return static_cast<float>(val);
}

Since you have so many bits in the source that can't be represented in the float and you can't (apparently) rely on the language's conversion, you'll have to do it yourself.

I devised a scheme that may or may not help you. Basically, there are 31 bits to represent positive numbers in a float so I pick up the 31 most significant bits in the source number. Then I save off and mask away all the lower bits. Then based on the value of the lower bits I round the "new" LSB up or down and finally use static_cast to create a float.

I left in some couts that you can remove as desired.

const unsigned long long mask_bit_count = 31;

float ull_to_float2(unsigned long long val)
{
    // How many bits are needed?
    int b = sizeof(unsigned long long) * CHAR_BIT - 1;
    for(; b >= 0; --b)
    {
        if(val & (1ull << b))
        {
            break;
        }
    }

    std::cout << "Need " << (b + 1) << " bits." << std::endl;

    // If there are few enough significant bits, use normal cast and done.
    if(b < mask_bit_count)
    {
        return static_cast<float>(val & ~1ull);
    }

    // Save off the low-order useless bits:
    unsigned long long low_bits = val & ((1ull << (b - mask_bit_count)) - 1);
    std::cout << "Saved low bits=" << low_bits << std::endl;

    std::cout << val << "->mask->";
    // Now mask away those useless low bits:
    val &= ~((1ull << (b - mask_bit_count)) - 1);
    std::cout << val << std::endl;

    // Finally, decide how to round the new LSB:
    if(low_bits > ((1ull << (b - mask_bit_count)) / 2ull))
    {
        std::cout << "Rounding up " << val;
        // Round up.
        val |= (1ull << (b - mask_bit_count));
        std::cout << " to " << val << std::endl;
    }
    else
    {
        // Round down.
        val &= ~(1ull << (b - mask_bit_count));
    }

    return static_cast<float>(val);
}
把时间冻结 2024-10-14 07:43:40

我在 Smalltalk 中对任意精度整数 (LargeInteger) 执行了此操作,在 Squeak/Pharo/Visualworks/Gnu Smalltalk/Dolphin Smalltalk 中实现和测试,如果您可以阅读 Smalltalk 代码,甚至还可以在博客中介绍它 http://smallissimo.blogspot.fr/2011/09/clarifying-and-optimizing.html .< br>
加速算法的技巧是这样的:符合 IEEE 754 标准的 FPU 将精确舍入不精确运算的结果。因此,我们可以承受 1 次不精确的操作,并让硬件为我们正确舍入。这让我们可以轻松处理前 48 位。但我们无法承受两次不精确的操作,因此有时我们必须以不同的方式处理最低位...
希望代码有足够的记录:

#include <math.h>
#include <float.h>
float ull_to_float3(unsigned long long val)
{
    int prec=FLT_MANT_DIG ;             // 24 bits, the float precision
    unsigned long long high=val>>prec;  // the high bits above float precision
    unsigned long long mask=(1ull<<prec) - 1 ;      // 0xFFFFFFull a mask for extracting significant bits
    unsigned long long tmsk=(1ull<<(prec - 1)) - 1; // 0x7FFFFFull same but tie bit
    // handle trivial cases, 48 bits or less,
    // let FPU apply correct rounding after exactly 1 inexact operation
    if( high <= mask )
        return ldexpf((float) high,prec) + (float) (val & mask);
    // more than 48 bits,
    // what scaling s is needed to isolate highest 48 bits of val?
    int s = 0;
    for( ; high > mask ; high >>= 1) ++s;
    // high now contains highest 24 bits
    float f_high = ldexpf( (float) high , prec + s );
    // store next 24 bits in mid
    unsigned long long mid = (val >> s) & mask;
    // care of rare case when trailing low bits can change the rounding:
    // can mid bits be a case of perfect tie or perfect zero?
    if( (mid & tmsk) == 0ull )
    {
        // if low bits are zero, mid is either an exact tie or an exact zero
        // else just increment mid to distinguish from such case
        unsigned long long low = val & ((1ull << s) - 1);
        if(low > 0ull) mid++;
    }
    return f_high + ldexpf( (float) mid , s );
}

奖励:此代码应该根据您的 FPU 舍入模式进行舍入,无论它是什么,因为我们隐式地使用 FPU 来执行 + 运算的舍入。
但是,请注意标准中的激进优化< C99,谁知道编译器什么时候会使用扩展精度...(除非你强制使用类似 -ffloat-store 的东西)。
如果您总是想舍入到最接近的偶数,无论当前的舍入模式如何,那么您必须在以下情况下增加高位:

  • 中位 > 0领带,其中领带=1ull<<(prec-1);
  • 中位 == 平局且(低位 > 0 或高位为奇数)。

编辑:
如果您坚持舍入到最近偶数平局打破,那么另一个解决方案是使用非相邻部分 (fhigh,flow) 和 (fmid) 的 Shewchuck EXPANSION-SUM,请参阅 http://www-2.cs.cmu.edu/afs/cs/project /quake/public/papers/robust-arithmetic.ps

#include <math.h>
#include <float.h>
float ull_to_float4(unsigned long long val)
{
    int prec=FLT_MANT_DIG ;             // 24 bits, the float precision
    unsigned long long mask=(1ull<<prec) - 1 ; // 0xFFFFFFull a mask for extracting significant bits
    unsigned long long high=val>>(2*prec);     // the high bits
    unsigned long long mid=(val>>prec) & mask; // the mid bits
    unsigned long long low=val & mask;         // the low bits
    float fhigh = ldexpf((float) high,2*prec);
    float fmid  = ldexpf((float) mid,prec);
    float flow  = (float) low;
    float sum1 = fmid + flow;
    float residue1 = flow - (sum1 - fmid);
    float sum2 = fhigh + sum1;
    float residue2 = sum1 - (sum2 - fhigh);
    return (residue1 + residue2) + sum2;
}

这使得无分支算法具有更多的操作。它可能适用于其他舍入模式,但我让您分析论文以确保......

I did this in Smalltalk for arbitrary precision integer (LargeInteger), implemented and tested in Squeak/Pharo/Visualworks/Gnu Smalltalk/Dolphin Smalltalk, and even blogged about it if you can read Smalltalk code http://smallissimo.blogspot.fr/2011/09/clarifying-and-optimizing.html .
The trick for accelerating the algorithm is this one: IEEE 754 compliant FPU will round exactly the result of an inexact operation. So we can afford 1 inexact operation and let the hardware rounds correctly for us. That let us handle easily first 48 bits. But we cannot afford two inexact operations, so we sometimes have to care of the lowest bits differently...
Hope the code is documented enough:

#include <math.h>
#include <float.h>
float ull_to_float3(unsigned long long val)
{
    int prec=FLT_MANT_DIG ;             // 24 bits, the float precision
    unsigned long long high=val>>prec;  // the high bits above float precision
    unsigned long long mask=(1ull<<prec) - 1 ;      // 0xFFFFFFull a mask for extracting significant bits
    unsigned long long tmsk=(1ull<<(prec - 1)) - 1; // 0x7FFFFFull same but tie bit
    // handle trivial cases, 48 bits or less,
    // let FPU apply correct rounding after exactly 1 inexact operation
    if( high <= mask )
        return ldexpf((float) high,prec) + (float) (val & mask);
    // more than 48 bits,
    // what scaling s is needed to isolate highest 48 bits of val?
    int s = 0;
    for( ; high > mask ; high >>= 1) ++s;
    // high now contains highest 24 bits
    float f_high = ldexpf( (float) high , prec + s );
    // store next 24 bits in mid
    unsigned long long mid = (val >> s) & mask;
    // care of rare case when trailing low bits can change the rounding:
    // can mid bits be a case of perfect tie or perfect zero?
    if( (mid & tmsk) == 0ull )
    {
        // if low bits are zero, mid is either an exact tie or an exact zero
        // else just increment mid to distinguish from such case
        unsigned long long low = val & ((1ull << s) - 1);
        if(low > 0ull) mid++;
    }
    return f_high + ldexpf( (float) mid , s );
}

Bonus: this code should round according to your FPU rounding mode whatever it may be, since we implicitely used the FPU to perform rounding with + operation.
However, beware of aggressive optimizations in standards < C99, who knows when the compiler will use extended precision... (unless you force something like -ffloat-store).
If you always want to round to nearest even, whatever the current rounding mode, then you'll have to increment high bits when:

  • mid bits > tie, where tie=1ull<<(prec-1);
  • mid bits == tie and (low bits > 0 or high bits is odd).

EDIT:
If you stick to round-to-nearest-even tie breaking, then another solution is to use Shewchuck EXPANSION-SUM of non adjacent parts (fhigh,flow) and (fmid) see http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps :

#include <math.h>
#include <float.h>
float ull_to_float4(unsigned long long val)
{
    int prec=FLT_MANT_DIG ;             // 24 bits, the float precision
    unsigned long long mask=(1ull<<prec) - 1 ; // 0xFFFFFFull a mask for extracting significant bits
    unsigned long long high=val>>(2*prec);     // the high bits
    unsigned long long mid=(val>>prec) & mask; // the mid bits
    unsigned long long low=val & mask;         // the low bits
    float fhigh = ldexpf((float) high,2*prec);
    float fmid  = ldexpf((float) mid,prec);
    float flow  = (float) low;
    float sum1 = fmid + flow;
    float residue1 = flow - (sum1 - fmid);
    float sum2 = fhigh + sum1;
    float residue2 = sum1 - (sum2 - fhigh);
    return (residue1 + residue2) + sum2;
}

This makes a branch-free algorithm with a bit more ops. It may work with other rounding modes, but I let you analyze the paper to make sure...

笑看君怀她人 2024-10-14 07:43:40

8 字节整数和浮点格式之间的可能性很容易解释,但实现起来却不那么简单!

下一段涉及 8 字节有符号整数可以表示的内容。

1 (2^0) 和 16777215 (2^24-1) 之间的所有正整数都可以用 iEEE754 单精度(浮点)精确表示。或者,准确地说,是 2^0 到 2^24-2^0 之间的所有数字,增量为 2^0。下一个可精确表示的正整数范围是 2^1 到 2^25-2^1,增量为 2^1,依此类推,直到 2^39 到 2^63-2^39,增量为 2^39。

无符号 8 字节整数值最多可表示为 2^64-2^40,增量为 2^40。

单精度格式并没有就此停止,而是一直持续到 2^103 到 2^127-2^103 的范围(以 2^103 为增量)。

对于 4 字节整数(长整型),最高浮点范围为 2^7 到 2^31-2^7,增量为 2^7。

在 x86 架构上,浮点指令集支持的最大整数类型是 8 字节有符号整数。 2^64-1无法通过常规方式加载。

这意味着对于表示为“2^i,其中 i 是整数 >0”的给定范围增量,以位模式 0x1 到 2^i-1 结尾的所有整数将无法在该范围内以浮点形式精确表示
这意味着您所谓的向上舍入实际上取决于您正在工作的范围。如果您的范围的粒度是您想要的范围,那么尝试向上舍入 1 (2^0) 或 16 (2^4) 是没有用的。是 2^19。

如果您尝试进行以下转换,您建议执行的操作(将 2^63-1 舍入为 2^63)的另一个结果可能会导致(长整数格式)溢出:longlong_int=(long long) ((float) 2^ 63)。

看看我写的这个小程序(用 C 语言),它应该有助于说明什么是可能的,什么是不可能的。

int main (void)
{
  __int64 basel=1,baseh=16777215,src,dst,j;
  float cnvl,cnvh,range;
  int i=0;

  while (i<40)
  {
    src=basel<<i;
    cnvl=(float) src;
    dst=(__int64) cnvl;    /* compare dst with basel */

    src=baseh<<i;
    cnvh=(float) src;
    dst=(__int64) cnvh;    /* compare dst with baseh */

    j=basel;
    while (j<=baseh)
    {
      range=(float) j;
      dst=(__int64) range;

      if (j!=dst) dst/=0;

      j+=basel;
    }

    ++i;
  }
  return i;
}

该程序显示了可表示的整数值范围。它们之间存在重叠:例如 2^5 可以在所有范围内表示,下限为 2^b,其中 1=

What is possible between between 8-byte integers and the float format is straightforward to explain but less so to implement!

The next paragraph concerns what is representable in 8 byte signed integers.

All positive integers between 1 (2^0) and 16777215 (2^24-1) are exactly representable in iEEE754 single precision (float). Or, to be precise, all numbers between 2^0 and 2^24-2^0 in increments of 2^0. The next range of exactly representable positive integers is 2^1 to 2^25-2^1 in increments of 2^1 and so on up to 2^39 to 2^63-2^39 in increments of 2^39.

Unsigned 8-byte integer values can be expressed up to 2^64-2^40 in increments of 2^40.

The single precison format doesn't stop here but goes on all the way up to the range 2^103 to 2^127-2^103 in increments of 2^103.

For 4-byte integers (long) the highest float range is 2^7 to 2^31-2^7 in 2^7 increments.

On the x86 architecture the largest integer type supported by the floating point instruction set is the 8 byte signed integer. 2^64-1 cannot be loaded by conventional means.

This means that for a given range increment expressed as "2^i where i is an integer >0" all integers that end with the bit pattern 0x1 up to 2^i-1 will not be exactly representable within that range in a float
This means that what you call rounding upwards is actually dependent on what range you are working in. It is of no use to try to round up by 1 (2^0) or 16 (2^4) if the granularity of the range you are in is 2^19.

An additional consequence of what you propose to do (rounding 2^63-1 to 2^63) could result in an (long integer format) overflow if you attempt the following conversion: longlong_int=(long long) ((float) 2^63).

Check out this small program I wrote (in C) which should help illustrate what is possible and what isn't.

int main (void)
{
  __int64 basel=1,baseh=16777215,src,dst,j;
  float cnvl,cnvh,range;
  int i=0;

  while (i<40)
  {
    src=basel<<i;
    cnvl=(float) src;
    dst=(__int64) cnvl;    /* compare dst with basel */

    src=baseh<<i;
    cnvh=(float) src;
    dst=(__int64) cnvh;    /* compare dst with baseh */

    j=basel;
    while (j<=baseh)
    {
      range=(float) j;
      dst=(__int64) range;

      if (j!=dst) dst/=0;

      j+=basel;
    }

    ++i;
  }
  return i;
}

This program shows the representable integer value ranges. There is overlap beteen them: for example 2^5 is representable in all ranges with a lower boundary 2^b where 1=

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文