C# floating-point directx-11 compute-shader

使用两个浮点数进行双除法？

发布于 2024-10-01 03:09:48 字数 482 浏览 10 评论 0原文

我想使用两个浮点数进行双除法（看来直接计算不支持双除法）。

这可能吗？

这是我到目前为止所尝试的（c#代码，应该是HLSL）：

int count = 7;
double value = 0.0073812398871474;
float f1 = (float)value;
float f2 = (float)((value - f1));
float r1 = f1 / count;
float r2 = f2 / count;
double result = (double)r1 + (double)r2;

0,00105446285765182（结果）

0,00105446284102106（正确的结果）

它与f1中的舍入有关。如果值为：

 double value = 0.0073812344471474;

那么结果是正确的。

原文

I would like to do a double devision using two floats (It appears that Direct Compute does not support double devision).

Is that possible?

This is what I tried so far (c# code, should be HLSL later):

int count = 7;
double value = 0.0073812398871474;
float f1 = (float)value;
float f2 = (float)((value - f1));
float r1 = f1 / count;
float r2 = f2 / count;
double result = (double)r1 + (double)r2;

0,00105446285765182 (result)

0,00105446284102106 (correct result)

It has to do with the rounding in f1. If value is instead:

 double value = 0.0073812344471474;

Then the result is correct.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

剑心龙吟 2024-10-08 03:09:48

使用浮点除法计算计数的倒数，然后使用牛顿-拉夫森倒数公式将精度提高到全双精度。

int count = 7;
double value = 0.0073812398871474;
double r = (double) (1.0f / count); // approximate reciprocal
r = r * (2.0 - count*r); // much better approximation
r = r * (2.0 - count*r); // should be full double precision by now.
double result = value * r;

Calculate reciprocal of count with float division and then improve the precision to full double using Newton-Raphson reciprocal formula.

int count = 7;
double value = 0.0073812398871474;
double r = (double) (1.0f / count); // approximate reciprocal
r = r * (2.0 - count*r); // much better approximation
r = r * (2.0 - count*r); // should be full double precision by now.
double result = value * r;

回复收藏 0 原文

多彩岁月 2024-10-08 03:09:48

显然你并不能立即清楚你的算术错误。让我把它拼出来。

假设 double 有两部分，大部分和小部分，每个部分大约有 32 位精度。（这并不完全是双打的工作方式，但它可以满足我们的目的。）

浮点数只有一个部分。

想象一下，我们一次执行 32 位，但将所有内容保持为双精度：

double divisor = whatever;
double dividend = dividendbig + dividendlittle;
double bigquotient = dividendbig / divisor;

什么是 bigquotient？这是一个双。所以它有两个部分。 bigquotient 等于 bigquotientbig + bigquotientlittle。继续：

double littlequotient = dividendlittle / divisor;

同样，littlequotient 是littlequotientbig + Littlequotientlittle。现在我们将商相加：

double quotient = bigquotient + littlequotient;

我们如何计算呢？商有两部分。 quotientbig 将设置为 bigquotientbig。 quotientlittle 将设置为 bigquotientlittle + Littlequotientbig。小商小被丢弃。

现在假设您在浮动中执行此操作。你有：

float f1 = dividendbig;
float f2 = dividendlittle;
float r1 = f1 / divisor;

好的，r1 是什么？这是一个漂浮物。所以它只有一部分。 r1 是大商大。

float r2 = f2 / divisor;

r2是什么？这是一个漂浮物。所以它只有一个部分。 r2 是小商大。

double result = (double)r1 + (double)r2;

将它们加在一起，就得到 bigquotientbig + Littlequotientbig。 bigquotientlittle 发生了什么？您已经丢失了 32 位精度，因此一路上出现 32 位不准确也就不足为奇了。 您根本没有想出用 32 位近似 64 位算术的正确算法。

为了计算 (big + Little)/divisor，您不能只需执行（大/除数）+（小/除数）。当您在每个除法期间舍入时，这条代数规则并不适用！

现在清楚了吗？

Apparently your arithmetic error is not immediately clear to you. Let me spell it out.

Suppose a double has two parts, the big part and the little part, each with roughly 32 bits of precision. (This is not exactly how doubles work but it will do for our purposes.)

A float only has one part.

Imagine we were doing it 32 bits at a time but keeping everything in doubles:

double divisor = whatever;
double dividend = dividendbig + dividendlittle;
double bigquotient = dividendbig / divisor;

what is bigquotient? It's a double. So it has two parts. bigquotient is equal to bigquotientbig + bigquotientlittle. Continuing on:

double littlequotient = dividendlittle / divisor;

again, littlequotient is littlequotientbig + littlequotientlittle. Now we add the quotients:

double quotient = bigquotient + littlequotient;

How do we compute that? quotient has two parts. quotientbig will be set to bigquotientbig. quotientlittle will be set to bigquotientlittle + littlequotientbig. littlequotientlittle gets discarded.

Now suppose you do it in floats. You have:

float f1 = dividendbig;
float f2 = dividendlittle;
float r1 = f1 / divisor;

OK, what is r1? It's a float. So it only has one part. r1 is bigquotientbig.

float r2 = f2 / divisor;

What is r2? It's a float. So it only has one part. r2 is littlequotientbig.

double result = (double)r1 + (double)r2;

You add them together and you get bigquotientbig + littlequotientbig. What happened to bigquotientlittle? You've lost 32 bits of precision in there, and so it should come as no surprise that you get innaccuracies 32 bits along the way. You have not come up with at all the right algorithm for approximating 64 bit arithmetic in 32 bits.

In order to compute (big + little)/divisor, you can't simply do (big / divisor) + (little / divisor). That rule of algebra does not apply when you are rounding during every division!

Is that now clear?

回复收藏 0 原文

旧人 2024-10-08 03:09:48

这可能吗？

是的，只要您：

接受不可避免的精度损失
请记住，并非所有双精度数都适合浮点数

更新

阅读您的评论后（双精度是必需的），我更新的答案是：

不。

回复收藏 0 原文

静若繁花 2024-10-08 03:09:48

这样的东西怎么样？

那么像结果 = 值 * (double)(1f / (float)count);
？

在那里你只划分两个浮点数。我的演员阵容比需要的要多，但重要的是这个概念。

编辑：
好吧，你担心实际值和四舍五入值之间的差异，对吗？所以只要一遍又一遍地做，直到你做对为止！

double result = 0;
double difference = value;
double total = 0;
float f1 = 0;
while (difference != 0)
{
    f1 = (float)difference;
    total += f1;
    difference = value - total;
    result += (double)(f1 / count);
}

...但你知道，简单的答案仍然是“不”。这仍然没有捕获所有舍入错误。根据我的测试，它最多将误差降低到 1e-17，大约 30% 的时间。

So how about something like

result = value * (double)(1f / (float)count);
?

There you're only dividing two floats. I have more casts in there than needed, but it's the concept that counts.

Edit:
Okay, so you're worried about the difference between the actual and the rounded, right? so just do it over and over until you get it right!

double result = 0;
double difference = value;
double total = 0;
float f1 = 0;
while (difference != 0)
{
    f1 = (float)difference;
    total += f1;
    difference = value - total;
    result += (double)(f1 / count);
}

...but you know, the easy answer still is "No". This still doesn't even catch ALL the rounding errors. From my tests it lowers the inaccuracies to 1e-17 at the most, about 30% of the time.

回复收藏 0 原文