截断时的浮点舍入

发布于 2024-07-14 14:13:14 字数 866 浏览 6 评论 0原文

这可能是 x86 FPU 专家的问题：

我正在尝试编写一个函数，它生成 [min,max] 范围内的随机浮点值。问题是我的生成器算法（浮点 Mersenne Twister，如果您好奇的话）仅返回 [1,2) 范围内的值 - 即，我想要一个包含的上限，但我的“源”生成值是来自排它上限。这里的问题是底层生成器返回一个 8 字节双精度值，但我只想要一个 4 字节浮点数，并且我使用的是默认的 FPU 舍入模式“Nearest”。

我想知道的是，在这种情况下，当 FPU 内部 80 位值足够接近时，截断本身是否会导致我的返回值包含 max，或者我是否应该在乘以 max 之前增加 max 值的有效数[1,2) 中的中间随机数，或者我是否应该更改 FPU 模式。当然，或者任何其他想法。

这是我当前使用的代码，并且我确实验证了 1.0f 解析为 0x3f800000：

float MersenneFloat( float min, float max )
{
    //genrand returns a double in [1,2)
    const float random = (float)genrand_close1_open2(); 
    //return in desired range
    return min + ( random - 1.0f ) * (max - min);
}

如果它有所不同，则需要在 Win32 MSVC++ 和 Linux gcc 上工作。另外，使用任何版本的 SSE 优化是否会改变这个问题的答案？

编辑：答案是肯定的，在这种情况下从 double 到 float 的截断足以使结果包含 max。有关更多信息，请参阅 Crashworks 的答案。

原文

This is probably a question for an x86 FPU expert:

I am trying to write a function which generates a random floating point value in the range [min,max]. The problem is that my generator algorithm (the floating-point Mersenne Twister, if you're curious) only returns values in the range [1,2) - ie, I want an inclusive upper bound, but my "source" generated value is from an exclusive upper bound. The catch here is that the underlying generator returns an 8-byte double, but I only want a 4-byte float, and I am using the default FPU rounding mode of Nearest.

What I want to know is whether the truncation itself in this case will result in my return value being inclusive of max when the FPU internal 80-bit value is sufficiently close, or whether I should increment the significand of my max value before multiplying it by the intermediary random in [1,2), or whether I should change FPU modes. Or any other ideas, of course.

Here's the code I am currently using, and I did verify that 1.0f resolves to 0x3f800000:

float MersenneFloat( float min, float max )
{
    //genrand returns a double in [1,2)
    const float random = (float)genrand_close1_open2(); 
    //return in desired range
    return min + ( random - 1.0f ) * (max - min);
}

If it makes a difference, this needs to work on both Win32 MSVC++ and Linux gcc. Also, will using any versions of the SSE optimizations change the answer to this?

Edit: The answer is yes, truncation in this case from double to float is sufficient to cause the result to be inclusive of max. See Crashworks' answer for more.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

抹茶夏天i‖ 2024-07-21 14:13:14

SSE 操作将巧妙地改变该算法的行为，因为它们没有中间 80 位表示——数学实际上是在 32 或 64 位中完成的。好消息是，您可以轻松地测试它，并通过简单地向 MSVC 指定 /ARCH:SSE2 命令行选项来查看它是否会改变您的结果，这将导致它使用 SSE 标量操作而不是普通浮点的 x87 FPU 指令数学。

我暂时不确定整数边界周围的确切舍入行为，但您可以测试一下当 1.999.. 从 64 位舍入到 32 位时会发生什么 eg

static uint64 OnePointNineRepeating = 0x3FF FFFFF FFFF FFFF // exponent 0 (biased to 1023), all 1 bits in mantissa
double asDouble = *(double *)(&OnePointNineRepeating);
float asFloat = asDouble;
return asFloat;

编辑，结果：原始发帖人运行了此测试，发现通过截断，无论有没有 /arch:SSE2，1.99999 都会四舍五入为 2。

The SSE ops will subtly change the behavior of this algorithm because they don't have the intermediate 80-bit representation -- the math truly is done in 32 or 64 bits. The good news is that you can easily test it and see if it changes your results by simply specifying the /ARCH:SSE2 command line option to MSVC, which will cause it to use the SSE scalar ops instead of x87 FPU instructions for ordinary floating point math.

I'm not sure offhand of what the exact rounding behavior is around the integer boundaries, but you can test to see what'll happen when 1.999.. gets rounded from 64 to 32 bits by eg

static uint64 OnePointNineRepeating = 0x3FF FFFFF FFFF FFFF // exponent 0 (biased to 1023), all 1 bits in mantissa
double asDouble = *(double *)(&OnePointNineRepeating);
float asFloat = asDouble;
return asFloat;

Edit, result: original poster ran this test and found that with truncation, the 1.99999 will round up to 2 both with and without /arch:SSE2.

回复收藏 0 原文

猫腻 2024-07-21 14:13:14

如果您确实调整舍入以包含范围的两端，那么这些极端值的可能性是否只有任何非极端值的一半？

回复收藏 0 原文

平定天下 2024-07-21 14:13:14

通过截断，您永远不会包含最大值。

你确定你真的需要最大吗？实际上，您恰好达到最大值的机会几乎为零。

也就是说，您可以利用您放弃精度的事实并执行以下操作：

float MersenneFloat( float min, float max )
{
    double random = 100000.0; // just a dummy value
    while ((float)random > 65535.0)
    {
        //genrand returns a double in [1,2)
        double random = genrand_close1_open2() - 1.0; // now it's [0,1)
        random *= 65536.0; // now it's [0,65536). We try again if it's > 65535.0
    }
    //return in desired range
    return min + float(random/65535.0) * (max - min);
}

请注意，现在，每次您调用 MersenneFloat 时，它都有轻微的机会多次调用 genrand。所以你已经放弃了封闭区间内可能的表现。由于您从 double 向下转换为 float，因此最终不会牺牲任何精度。

编辑：改进算法

With truncation, you are never going to be inclusive of the max.

Are you sure you really need the max? There is literally an almost 0 chance that you will land on exactly the maximum.

That said, you can exploit the fact that you are giving up precision and do something like this:

float MersenneFloat( float min, float max )
{
    double random = 100000.0; // just a dummy value
    while ((float)random > 65535.0)
    {
        //genrand returns a double in [1,2)
        double random = genrand_close1_open2() - 1.0; // now it's [0,1)
        random *= 65536.0; // now it's [0,65536). We try again if it's > 65535.0
    }
    //return in desired range
    return min + float(random/65535.0) * (max - min);
}

Note that, now, it has a slight chance of multiple calls to genrand each time you call MersenneFloat. So you have given up possible performance for a closed interval. Since you are downcasting from double to float, you end up sacrificing no precision.

Edit: improved algorithm

回复收藏 0 原文

~没有更多了~