使用两个浮点数进行双除法?

发布于 2024-10-01 03:09:48 字数 482 浏览 0 评论 0原文

我想使用两个浮点数进行双除法(看来直接计算不支持双除法)。

这可能吗?

这是我到目前为止所尝试的(c#代码,应该是HLSL):

int count = 7;
double value = 0.0073812398871474;
float f1 = (float)value;
float f2 = (float)((value - f1));
float r1 = f1 / count;
float r2 = f2 / count;
double result = (double)r1 + (double)r2;

0,00105446285765182(结果)

0,00105446284102106(正确的结果)

它与f1中的舍入有关。如果值为:

 double value = 0.0073812344471474;

那么结果是正确的。

I would like to do a double devision using two floats (It appears that Direct Compute does not support double devision).

Is that possible?

This is what I tried so far (c# code, should be HLSL later):

int count = 7;
double value = 0.0073812398871474;
float f1 = (float)value;
float f2 = (float)((value - f1));
float r1 = f1 / count;
float r2 = f2 / count;
double result = (double)r1 + (double)r2;

0,00105446285765182 (result)

0,00105446284102106 (correct result)

It has to do with the rounding in f1. If value is instead:

 double value = 0.0073812344471474;

Then the result is correct.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

剑心龙吟 2024-10-08 03:09:48

使用浮点除法计算计数的倒数,然后使用牛顿-拉夫森倒数公式将精度提高到全双精度。

int count = 7;
double value = 0.0073812398871474;
double r = (double) (1.0f / count); // approximate reciprocal
r = r * (2.0 - count*r); // much better approximation
r = r * (2.0 - count*r); // should be full double precision by now.
double result = value * r;

Calculate reciprocal of count with float division and then improve the precision to full double using Newton-Raphson reciprocal formula.

int count = 7;
double value = 0.0073812398871474;
double r = (double) (1.0f / count); // approximate reciprocal
r = r * (2.0 - count*r); // much better approximation
r = r * (2.0 - count*r); // should be full double precision by now.
double result = value * r;
多彩岁月 2024-10-08 03:09:48

显然你并不能立即清楚你的算术错误。让我把它拼出来。

假设 double 有两部分,大部分和小部分,每个部分大约有 32 位精度。 (这并不完全是双打的工作方式,但它可以满足我们的目的。)

浮点数只有一个部分。

想象一下,我们一次执行 32 位,但将所有内容保持为双精度:

double divisor = whatever;
double dividend = dividendbig + dividendlittle;
double bigquotient = dividendbig / divisor;

什么是 bigquotient?这是一个双。所以它有两个部分。 bigquotient 等于 bigquotientbig + bigquotientlittle。继续:

double littlequotient = dividendlittle / divisor;

同样,littlequotient 是littlequotientbig + Littlequotientlittle。现在我们将商相加:

double quotient = bigquotient + littlequotient;

我们如何计算呢?商有两部分。 quotientbig 将设置为 bigquotientbig。 quotientlittle 将设置为 bigquotientlittle + Littlequotientbig。小商小被丢弃。

现在假设您在浮动中执行此操作。你有:

float f1 = dividendbig;
float f2 = dividendlittle;
float r1 = f1 / divisor;

好的,r1 是什么?这是一个漂浮物。所以它只有一部分。 r1 是大商大。

float r2 = f2 / divisor;

r2是什么?这是一个漂浮物。所以它只有一个部分。 r2 是小商大。

double result = (double)r1 + (double)r2;

将它们加在一起,就得到 bigquotientbig + Littlequotientbig。 bigquotientlittle 发生了什么?您已经丢失了 32 位精度,因此一路上出现 32 位不准确也就不足为奇了。 您根本没有想出用 32 位近似 64 位算术的正确算法。

为了计算 (big + Little)/divisor,您不能只需执行(大/除数)+(小/除数)。当您在每个除法期间舍入时,这条代数规则并不适用!

现在清楚了吗?

Apparently your arithmetic error is not immediately clear to you. Let me spell it out.

Suppose a double has two parts, the big part and the little part, each with roughly 32 bits of precision. (This is not exactly how doubles work but it will do for our purposes.)

A float only has one part.

Imagine we were doing it 32 bits at a time but keeping everything in doubles:

double divisor = whatever;
double dividend = dividendbig + dividendlittle;
double bigquotient = dividendbig / divisor;

what is bigquotient? It's a double. So it has two parts. bigquotient is equal to bigquotientbig + bigquotientlittle. Continuing on:

double littlequotient = dividendlittle / divisor;

again, littlequotient is littlequotientbig + littlequotientlittle. Now we add the quotients:

double quotient = bigquotient + littlequotient;

How do we compute that? quotient has two parts. quotientbig will be set to bigquotientbig. quotientlittle will be set to bigquotientlittle + littlequotientbig. littlequotientlittle gets discarded.

Now suppose you do it in floats. You have:

float f1 = dividendbig;
float f2 = dividendlittle;
float r1 = f1 / divisor;

OK, what is r1? It's a float. So it only has one part. r1 is bigquotientbig.

float r2 = f2 / divisor;

What is r2? It's a float. So it only has one part. r2 is littlequotientbig.

double result = (double)r1 + (double)r2;

You add them together and you get bigquotientbig + littlequotientbig. What happened to bigquotientlittle? You've lost 32 bits of precision in there, and so it should come as no surprise that you get innaccuracies 32 bits along the way. You have not come up with at all the right algorithm for approximating 64 bit arithmetic in 32 bits.

In order to compute (big + little)/divisor, you can't simply do (big / divisor) + (little / divisor). That rule of algebra does not apply when you are rounding during every division!

Is that now clear?

旧人 2024-10-08 03:09:48

这可能吗?

是的,只要您:

  • 接受不可避免的精度损失
  • 请记住,并非所有双精度数都适合浮点数

更新

阅读您的评论后(双精度是必需的),我更新的答案是:

不。

Is that possible?

Yes, as long as you:

  • Accept the inevitable loss of precision
  • Bear in mind that not all doubles fit into floats in the first place

Update

After reading your comments (double precision is a requirement), my updated answer is:

No.

静若繁花 2024-10-08 03:09:48

这样的东西怎么样?

那么像
结果 = 值 * (double)(1f / (float)count);

在那里你只划分两个浮点数。我的演员阵容比需要的要多,但重要的是这个概念。

编辑:
好吧,你担心实际值和四舍五入值之间的差异,对吗?所以只要一遍又一遍地做,直到你做对为止!

double result = 0;
double difference = value;
double total = 0;
float f1 = 0;
while (difference != 0)
{
    f1 = (float)difference;
    total += f1;
    difference = value - total;
    result += (double)(f1 / count);
}

...但你知道,简单的答案仍然是“不”。这仍然没有捕获所有舍入错误。根据我的测试,它最多将误差降低到 1e-17,大约 30% 的时间。

So how about something like


result = value * (double)(1f / (float)count);

?

There you're only dividing two floats. I have more casts in there than needed, but it's the concept that counts.

Edit:
Okay, so you're worried about the difference between the actual and the rounded, right? so just do it over and over until you get it right!

double result = 0;
double difference = value;
double total = 0;
float f1 = 0;
while (difference != 0)
{
    f1 = (float)difference;
    total += f1;
    difference = value - total;
    result += (double)(f1 / count);
}

...but you know, the easy answer still is "No". This still doesn't even catch ALL the rounding errors. From my tests it lowers the inaccuracies to 1e-17 at the most, about 30% of the time.

风尘浪孓 2024-10-08 03:09:48

在评论中,您说:

当然不能有任何损失
的精度。这就是我使用的原因
两个浮标。如果我愿意接受损失
精度,那么我可以投两个
浮动并进行除法。

IEEE-754 单精度值有 24 个有效二进制数字。 双精度值有 53 位有效数字。您甚至无法将双精度值表示为两个单精度值而不损失精度,更不用说使用这种表示进行算术了。

也就是说,仅使用双精度和单精度之间的转换、双精度减法/加法以及单精度运算来进行正确舍入的双精度除法是有可能的,但如果您确实想要这样做,则相当复杂没错。您是否需要实际的 IEEE-754 正确舍入,或者只是一个直到最后一两位都正确的答案?

In a comment, you say:

Of course there should not be any loss
of precision. This is why I'm using
two floats. If I would accept loss of
precision, then I could just cast two
float and do the division.

An IEEE-754 single precision value has 24 significant binary digits. A double precision value has 53 significant digits. You can't even represent a double precision value as two single precision values without loss of accuracy, much less do arithmetic with such a representation.

That said, it is possible to do a correctly rounded double precision division using only conversions between double and single, double precision subtraction/addition, and single precision operations, but it's pretty complicated if you really want to do it right. Do you need actual IEEE-754 correct rounding, or just an answer that's correct up to the last bit or two?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文