约翰·卡马克不寻常的快速平方根倒数(雷神之锤 III)

发布于 2024-08-03 20:17:49 字数 603 浏览 1 评论 0原文

John Carmack 在 Quake III 源代码中有一个特殊函数,可以计算浮点数的倒数平方根,比常规 (float)(1.0/sqrt(x)) 快 4 倍,其中包括一个奇怪的 0x5f3759df 常量。请参阅下面的代码。有人可以逐行解释这里到底发生了什么以及为什么它比常规实现快得多吗?

float Q_rsqrt( float number )
{
  long i;
  float x2, y;
  const float threehalfs = 1.5F;

  x2 = number * 0.5F;
  y  = number;
  i  = * ( long * ) &y;
  i  = 0x5f3759df - ( i >> 1 );
  y  = * ( float * ) &i;
  y  = y * ( threehalfs - ( x2 * y * y ) );

  #ifndef Q3_VM
  #ifdef __linux__
    assert( !isnan(y) );
  #endif
  #endif
  return y;
}

John Carmack has a special function in the Quake III source code which calculates the inverse square root of a float, 4x faster than regular (float)(1.0/sqrt(x)), including a strange 0x5f3759df constant. See the code below. Can someone explain line by line what exactly is going on here and why this works so much faster than the regular implementation?

float Q_rsqrt( float number )
{
  long i;
  float x2, y;
  const float threehalfs = 1.5F;

  x2 = number * 0.5F;
  y  = number;
  i  = * ( long * ) &y;
  i  = 0x5f3759df - ( i >> 1 );
  y  = * ( float * ) &i;
  y  = y * ( threehalfs - ( x2 * y * y ) );

  #ifndef Q3_VM
  #ifdef __linux__
    assert( !isnan(y) );
  #endif
  #endif
  return y;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

甜嗑 2024-08-10 20:17:49

供参考。这不是卡马克写的。泰耶·马蒂森 (Terje Mathisen) 和加里·塔罗利 (Gary Tarolli) 都对此给予了部分(而且非常谦虚)的功劳,同时也归功于其他一些来源。

神话常数是如何得出的仍然是个谜。

引用加里·塔罗利的话:

这实际上是在做浮动
整数点计算 - 花费了
花了很长时间才弄清楚如何以及为什么
这有效,我不记得了
详情不再赘述。

一个稍微好一点的常数,由专家数学家开发(Chris Lomont)尝试弄清楚原始算法是如何工作的:

float InvSqrt(float x)
{
    float xhalf = 0.5f * x;
    int i = *(int*)&x;              // get bits for floating value
    i = 0x5f375a86 - (i >> 1);      // gives initial guess y0
    x = *(float*)&i;                // convert bits back to float
    x = x * (1.5f - xhalf * x * x); // Newton step, repeating increases accuracy
    return x;
}

尽管如此,他最初的尝试是 id 的 sqrt 的数学“高级”版本(其常数几乎相同),但事实证明,它不如加里最初开发的版本,尽管在数学上要差得多。更纯粹’。他无法解释为什么id的iirc如此优秀。

FYI. Carmack didn't write it. Terje Mathisen and Gary Tarolli both take partial (and very modest) credit for it, as well as crediting some other sources.

How the mythical constant was derived is something of a mystery.

To quote Gary Tarolli:

Which actually is doing a floating
point computation in integer - it took
a long time to figure out how and why
this works, and I can't remember the
details anymore.

A slightly better constant, developed by an expert mathematician (Chris Lomont) trying to work out how the original algorithm worked is:

float InvSqrt(float x)
{
    float xhalf = 0.5f * x;
    int i = *(int*)&x;              // get bits for floating value
    i = 0x5f375a86 - (i >> 1);      // gives initial guess y0
    x = *(float*)&i;                // convert bits back to float
    x = x * (1.5f - xhalf * x * x); // Newton step, repeating increases accuracy
    return x;
}

In spite of this, his initial attempt a mathematically 'superior' version of id's sqrt (which came to almost the same constant) proved inferior to the one initially developed by Gary despite being mathematically much 'purer'. He couldn't explain why id's was so excellent iirc.

情栀口红 2024-08-10 20:17:49

当然,现在,事实证明它比仅使用 FPU 的 sqrt(尤其是在 360/PS3 上)要慢得多,因为浮点和 int 寄存器之间的交换会导致加载-命中-存储,而浮点单元可以执行倒数平方植根于硬件。

它只是展示了优化如何随着底层硬件性质的变化而发展。

Of course these days, it turns out to be much slower than just using an FPU's sqrt (especially on 360/PS3), because swapping between float and int registers induces a load-hit-store, while the floating point unit can do reciprocal square root in hardware.

It just shows how optimizations have to evolve as the nature of underlying hardware changes.

听,心雨的声音 2024-08-10 20:17:49

Greg HewgillIllidanS4 提供了具有出色数学解释的链接。
对于那些不想过多讨论细节的人,我将尝试在这里总结一下。

任何数学函数(除了一些例外)都可以用多项式和来表示:

y = f(x)

可以精确转换为:

y = a0 + a1*x + a2*(x^2) + a3*(x^3) + a4*(x^4) + ...

其中a0、a1、a2...是常数。问题是,对于许多函数(例如平方根)来说,对于精确值,该和具有无限数量的成员,它不会以某个 x^n 结尾。但是,如果我们停在某个 x^n 处,我们仍然会得到达到一定精度的结果。

因此,如果我们:

y = 1/sqrt(x)

在这种特殊情况下,他们决定丢弃第二个以上的所有多项式成员,可能是因为计算速度:

y = a0 + a1*x + [...discarded...]

现在的任务是计算 a0 和 a1,以便 y 与精确值的差异最小价值。他们计算出最合适的值是:

a0 = 0x5f375a86
a1 = -0.5

因此,当您将其代入等式时,您会得到:

y = 0x5f375a86 - 0.5*x

这与您在代码中看到的行相同:

i = 0x5f375a86 - (i >> 1);

编辑:实际上在这里 y = 0x5f375a86 - 0.5*xi = 0x5f375a86 - (i >> 1); 不同,因为将浮点数转换为整数不仅会除以 2,还会将指数除以 2,并导致一些其他问题,但归根结底还是要计算一些系数a0、a1、a2...。

此时他们发现这个结果的精度不足以达到目的。因此,他们另外只进行了牛顿迭代的一步来提高结果精度:

x = x * (1.5f - xhalf * x * x)

他们可以在循环中进行更多迭代,每一次都改进结果,直到满足所需的精度。 这正是它在CPU/FPU中的工作原理!但似乎只需要一次迭代就足够了,这对速度来说也是一种祝福。 CPU/FPU 根据需要进行多次迭代,以达到存储结果的浮点数的精度,并且它具有适用于所有情况的更通用的算法。


简而言之,他们所做的是:

使用(几乎)与 CPU/FPU 相同的算法,针对 1/sqrt(x) 的特殊情况利用初始条件的改进,并且不一路计算精度CPU/FPU会提前到达但停止,从而提高计算速度。

Greg Hewgill and IllidanS4 gave a link with excellent mathematical explanation.
I'll try to sum it up here for ones who don't want to go too much into details.

Any mathematical function, with some exceptions, can be represented by a polynomial sum:

y = f(x)

can be exactly transformed into:

y = a0 + a1*x + a2*(x^2) + a3*(x^3) + a4*(x^4) + ...

Where a0, a1, a2,... are constants. The problem is that for many functions, like square root, for exact value this sum has infinite number of members, it does not end at some x^n. But, if we stop at some x^n we would still have a result up to some precision.

So, if we have:

y = 1/sqrt(x)

In this particular case they decided to discard all polynomial members above second, probably because of calculation speed:

y = a0 + a1*x + [...discarded...]

And the task has now came down to calculate a0 and a1 in order for y to have the least difference from the exact value. They have calculated that the most appropriate values are:

a0 = 0x5f375a86
a1 = -0.5

So when you put this into equation you get:

y = 0x5f375a86 - 0.5*x

Which is the same as the line you see in the code:

i = 0x5f375a86 - (i >> 1);

Edit: actually here y = 0x5f375a86 - 0.5*x is not the same as i = 0x5f375a86 - (i >> 1); since shifting float as integer not only divides by two but also divides exponent by two and causes some other artifacts, but it still comes down to calculating some coefficients a0, a1, a2... .

At this point they've found out that this result's precision is not enough for the purpose. So they additionally did only one step of Newton's iteration to improve the result accuracy:

x = x * (1.5f - xhalf * x * x)

They could have done some more iterations in a loop, each one improving result, until required accuracy is met. This is exactly how it works in CPU/FPU! But it seems that only one iteration was enough, which was also a blessing for the speed. CPU/FPU does as many iterations as needed to reach the accuracy for the floating point number in which the result is stored and it has more general algorithm which works for all cases.


So in short, what they did is:

Use (almost) the same algorithm as CPU/FPU, exploit the improvement of initial conditions for the special case of 1/sqrt(x) and don't calculate all the way to precision CPU/FPU will go to but stop earlier, thus gaining in calculation speed.

不可一世的女人 2024-08-10 20:17:49

我很好奇这个常量是什么浮点数,所以我简单地写了这段代码并用谷歌搜索了弹出的整数。

long i = 0x5F3759DF;
float* fp = (float*)&i;
printf("(2^127)^(1/2) = %f\n", *fp);
//Output
//(2^127)^(1/2) = 13211836172961054720.000000

看起来这个常数是“2^127 平方根的整数近似值,通过其浮点表示的十六进制形式 0x5f3759df 更为人所知” https://mrob.com/pub/math/numbers-18.html

在同一个网站上它解释了整个事情。 https://mrob.com/pub/math/numbers-16.html#le009_16

I was curious to see what the constant was as a float so I simply wrote this bit of code and googled the integer that popped out.

long i = 0x5F3759DF;
float* fp = (float*)&i;
printf("(2^127)^(1/2) = %f\n", *fp);
//Output
//(2^127)^(1/2) = 13211836172961054720.000000

It looks like the constant is "An integer approximation to the square root of 2^127 better known by the hexadecimal form of its floating-point representation, 0x5f3759df" https://mrob.com/pub/math/numbers-18.html

On the same site it explains the whole thing. https://mrob.com/pub/math/numbers-16.html#le009_16

凯凯我们等你回来 2024-08-10 20:17:49

根据不久前写的这篇好文章...

代码的魔力,即使你
无法遵循它,因为 i =
0x5f3759df - (i>>1);线。简化,
牛顿-拉夫逊是一种近似
从猜测开始
通过迭代来完善它。服用
充分利用 32 位 x86 的特性
处理器,i,一个整数,是
最初设置为的值
你想要的浮点数
的平方反比,使用
整数转换。然后我被设置为
0x5f3759df,减号本身移位一位
向右一点。右移
删除 i 的最低有效位,
基本上减半了。

这真是一本好书。这只是其中的一小部分。

According to this nice article written a while back...

The magic of the code, even if you
can't follow it, stands out as the i =
0x5f3759df - (i>>1); line. Simplified,
Newton-Raphson is an approximation
that starts off with a guess and
refines it with iteration. Taking
advantage of the nature of 32-bit x86
processors, i, an integer, is
initially set to the value of the
floating point number you want to take
the inverse square of, using an
integer cast. i is then set to
0x5f3759df, minus itself shifted one
bit to the right. The right shift
drops the least significant bit of i,
essentially halving it.

It's a really good read. This is only a tiny piece of it.

小嗷兮 2024-08-10 20:17:49

该代码由两个主要部分组成。第一部分计算 1/sqrt(y) 的近似值,第二部分采用该数字并运行牛顿法的一次迭代以获得更好的近似值。

计算 1/sqrt(y) 的近似值

i  = * ( long * ) &y;
i  = 0x5f3759df - ( i >> 1 );
y  = * ( float * ) &i;

第 1 行采用 y 的浮点表示形式并将其视为整数 i。第 2 行将 i 移动一位并从一个神秘常数中减去它。第 3 行获取结果数字并将其转换回标准 float32。为什么这有效呢?

令 g 为将浮点数映射到其浮点表示形式(读取为整数)的函数。上面的第 1 行设置 i = g(y)

g 存在以下良好近似值 (*):
对于某些常数 C 和 D,g(y) ≈ Clog_2 y + D。对于为什么存在如此好的近似值,直觉是 y 的浮点表示形式在指数中大致呈线性。

第 2 行的目的是将 g(y) 映射到 g(1/sqrt(y)),之后第 3 行可以使用 g^​​-1 将该数字映射到 1/sqrt(y)。使用上面的近似,我们有 g(1/sqrt(y)) ≈ Clog_2 (1/sqrt(y)) + D = -C/2 log_2 y + D。我们可以使用这些公式计算从g(y)g(1/sqrt(y))的映射,即g(1/sqrt( y)) ≈ 3D/2 - 1/2 * g(y)。在第 2 行中,我们有 0x5f3759df ≈ 3D/2i >>> 。 1 ≈ 1/2*g(y)。

常量 0x5f3759df 略小于为 g(1/sqrt(y)) 提供最佳近似值的常量。这是因为这一步不是孤立完成的。由于牛顿法往往会错过方向,因此使用稍小的常数往往会产生更好的结果。在此设置中使用的确切最佳常数取决于 y 的输入分布,但 0x5f3759df 就是这样一个常数,它可以在相当宽的范围内提供良好的结果。

有关此过程的更详细描述可以在维基百科上找到:https://en.wikipedia.org /wiki/Fast_inverse_square_root#Algorithm

(*) 更明确地说,让y = 2^e*(1+f)。取两边的对数,我们得到 log_2 y = e + log_2(1+f),对于 a 可以近似为 log_2 y ≈ e + f + σ小常数西格玛。另外,以整数表示的 y 的 float32 编码g(y) ≈ 2^23 * (e+127) + f * 2^23。结合这两个方程,我们得到g(y) ≈ 2^23 * log_2 y + 2^23 * (127 - σ)

使用牛顿法

y  = y * ( threehalfs - ( x2 * y * y ) );

考虑函数f(y) = 1/y^2 - num。 f 的正零是 y = 1/sqrt(num) ,这就是我们感兴趣的计算。

牛顿法是一种迭代算法,用于对函数 f 的零点取近似值 y_n,并使用以下等式计算更好的近似值 y_n+1: y_n+1 = y_n - f(y_n)/f'( y_n)。

计算函数 f 的结果得出以下等式: y_n+1 = y_n - (-y_n+y_n^3*num)/2 = y_n * (3/2 - num/2 * y_n * y_n )。这正是上面这行代码所做的事情。

您可以在此处了解有关牛顿法的详细信息:https://en.wikipedia.org/维基/Newton%27s_method

The code consists of two major parts. Part one calculates an approximation for 1/sqrt(y), and part two takes that number and runs one iteration of Newton's method to get a better approximation.

Calculating an approximation for 1/sqrt(y)

i  = * ( long * ) &y;
i  = 0x5f3759df - ( i >> 1 );
y  = * ( float * ) &i;

Line 1 takes the floating point representation of y and treats it as an integer i. Line 2 shifts i over one bit and subtracts it from a mysterious constant. Line 3 takes the resulting number and converts it back to a standard float32. Now why does this work?

Let g be a function that maps a floating point number to its floating point representation, read as an integer. Line 1 above is setting i = g(y).

The following good approximation of g exists(*):
g(y) ≈ Clog_2 y + D for some constants C and D. An intuition for why such a good approximation exists is that the floating point representation of y is roughly linear in the exponent.

The purpose of line 2 is to map from g(y) to g(1/sqrt(y)), after which line 3 can use g^-1 to map that number to 1/sqrt(y). Using the approximation above, we have g(1/sqrt(y)) ≈ Clog_2 (1/sqrt(y)) + D = -C/2 log_2 y + D. We can use these formulas to calculate the map from g(y) to g(1/sqrt(y)), which is g(1/sqrt(y)) ≈ 3D/2 - 1/2 * g(y). In line 2, we have 0x5f3759df ≈ 3D/2, and i >> 1 ≈ 1/2*g(y).

The constant 0x5f3759df is slightly smaller than the constant that gives the best possible approximation for g(1/sqrt(y)). That is because this step is not done in isolation. Due to the direction that Newton's method tends to miss in, using a slightly smaller constant tends to yield better results. The exact optimal constant to use in this setting depends on your input distribution of y, but 0x5f3759df is one such constant that gives good results over a fairly broad range.

A more detailed description of this process can be found on Wikipedia: https://en.wikipedia.org/wiki/Fast_inverse_square_root#Algorithm

(*) More explicitly, let y = 2^e*(1+f). Taking the log of both sides, we get log_2 y = e + log_2(1+f), which can be approximated as log_2 y ≈ e + f + σ for a small constant sigma. Separately, the float32 encoding of y expressed as an integer is g(y) ≈ 2^23 * (e+127) + f * 2^23. Combining the two equations, we get g(y) ≈ 2^23 * log_2 y + 2^23 * (127 - σ).

Using Newton's method

y  = y * ( threehalfs - ( x2 * y * y ) );

Consider the function f(y) = 1/y^2 - num. The positive zero of f is y = 1/sqrt(num), which is what we are interested in calculating.

Newton's method is an iterative algorithm for taking an approximation y_n for the zero of a function f, and calculating a better approximation y_n+1, using the following equation: y_n+1 = y_n - f(y_n)/f'(y_n).

Calculating what that looks like for our function f gives the following equation: y_n+1 = y_n - (-y_n+y_n^3*num)/2 = y_n * (3/2 - num/2 * y_n * y_n). This is exactly what the line of code above is doing.

You can learn more about the details of Newton's method here: https://en.wikipedia.org/wiki/Newton%27s_method

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文