SSE归一化比简单近似慢？

发布于 2024-10-15 11:55:22 字数 1701 浏览 12 评论 0原文

我正在尝试标准化 4d 向量。

我的第一个方法是使用 SSE 内在函数 - 它可以将我的向量算术速度提高 2 倍。这是基本代码：（v.v4 是输入）（使用 GCC）（所有这些都是内联的）

//find squares
v4sf s = __builtin_ia32_mulps(v.v4, v.v4);
//set t to square
v4sf t = s;
//add the 4 squares together
s   = __builtin_ia32_shufps(s, s, 0x1B);
t      = __builtin_ia32_addps(t, s);
s   = __builtin_ia32_shufps(s, s, 0x4e);
t      = __builtin_ia32_addps(t, s);
s   = __builtin_ia32_shufps(s, s, 0x1B);
t      = __builtin_ia32_addps(t, s);
//find 1/sqrt of t
t      = __builtin_ia32_rsqrtps(t);
//multiply to get normal
return Vec4(__builtin_ia32_mulps(v.v4, t));

我检查反汇编，它看起来像我所期望的那样。我没有看到任何大问题。

无论如何，然后我使用近似值尝试了它：（我从谷歌得到这个）

float x = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float xhalf = 0.5f*x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f3759df - (i>>1); // give initial guess y0
x = *(float*)&i; // convert bits back to float
x *= 1.5f - xhalf*x*x; // newton step, repeating this step
// increases accuracy
//x *= 1.5f - xhalf*x*x;
return Vec4(v.w*x, v.x*x, v.y*x, v.z*x);

它的运行速度比 SSE 版本稍快！（大约快 5-10%）它的结果也非常准确 - 我会说在查找长度时为 0.001！ 但是.. 由于类型双关，GCC 给了我那个蹩脚的严格别名规则。

所以我修改了它：

union {
    float fa;
    int ia;
};
fa = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float faHalf = 0.5f*fa;
ia = 0x5f3759df - (ia>>1);
fa *= 1.5f - faHalf*fa*fa;
//fa *= 1.5f - faHalf*fa*fa;
return Vec4(v.w*fa, v.x*fa, v.y*fa, v.z*fa);

现在修改后的版本（没有警告）运行速度较慢！它的运行速度几乎是 SSE 版本运行速度的 60%（但结果相同）！这是为什么呢？

所以这里有一个问题：

我的 SSE 实施正确吗？
SSE 真的比正常的 fpu 操作慢吗？
为什么第三个代码慢这么多？

原文

I am trying to normalize a 4d vector.

My first approch was to use SSE intrinsics - something that provided a 2 times speed boost to my vector arithmetic.
Here is the basic code: (v.v4 is the input) (using GCC) (all of this is inlined)

//find squares
v4sf s = __builtin_ia32_mulps(v.v4, v.v4);
//set t to square
v4sf t = s;
//add the 4 squares together
s   = __builtin_ia32_shufps(s, s, 0x1B);
t      = __builtin_ia32_addps(t, s);
s   = __builtin_ia32_shufps(s, s, 0x4e);
t      = __builtin_ia32_addps(t, s);
s   = __builtin_ia32_shufps(s, s, 0x1B);
t      = __builtin_ia32_addps(t, s);
//find 1/sqrt of t
t      = __builtin_ia32_rsqrtps(t);
//multiply to get normal
return Vec4(__builtin_ia32_mulps(v.v4, t));

I check the disassembly and it looks like how I would expect. I don't see any big problems there.

Anyways, then I tried it using an approximation: (I got this from google)

float x = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float xhalf = 0.5f*x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f3759df - (i>>1); // give initial guess y0
x = *(float*)&i; // convert bits back to float
x *= 1.5f - xhalf*x*x; // newton step, repeating this step
// increases accuracy
//x *= 1.5f - xhalf*x*x;
return Vec4(v.w*x, v.x*x, v.y*x, v.z*x);

It is running slightly faster than the SSE version! (about 5-10% faster) It's results also are very accurate - I would say to 0.001 when finding length!
But.. GCC is giving me that lame strict aliasing rule because of the type punning.

So I modify it:

union {
    float fa;
    int ia;
};
fa = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float faHalf = 0.5f*fa;
ia = 0x5f3759df - (ia>>1);
fa *= 1.5f - faHalf*fa*fa;
//fa *= 1.5f - faHalf*fa*fa;
return Vec4(v.w*fa, v.x*fa, v.y*fa, v.z*fa);

And now the modified version (with no warnings) is running slower!! It's running almost 60% the speed that SSE version runs (but same result)! Why is this?

So here is question(s):

Is my SSE implentation correct?
Is SSE really slower than normal fpu operations?
Why the hell is the 3rd code so much slower?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情绪失控 2024-10-22 11:55:22

我是个傻瓜 - 我意识到我在基准测试时运行了 SETI@Home。我猜它正在扼杀我的 SSE 表现。将其关闭并使其运行速度提高两倍。

我还在 AMD athlon 上测试了它并得到了相同的结果 - SSE 更快。

至少我修复了 shuf 错误！

回复收藏 0 原文

绿光 2024-10-22 11:55:22

这是我能想到的最有效的汇编代码。您可以将其与编译器生成的内容进行比较。假设输入和输出位于 XMM0 中。

       ; start with xmm0 = { v.x v.y v.z v.w }
       movaps  %xmm0, %mm1         ; save it till the end
       mulps   %xmm0, %xmm0        ; v=v*v
       pshufd  $1, %xmm0, %xmm1    ; xmm1 = { v.y v.x v.x v.x }
       addss   %xmm0, %xmm1        ; xmm1 = { v.y+v.x v.x v.x v.x }
       pshufd  $3, %xmm0, %xmm2    ; xmm2 = { v.w v.x v.x v.x }
       movhlps %xmm0, %xmm3        ; xmm3 = { v.z v.w ? ? }
       addss   %xmm1, %xmm3        ; xmm3 = { v.y+v.x+v.z v.x ? ? }
       addss   %xmm3, %xmm2        ; xmm2 = { v.y+v.x+v.z+v.w v.x v.x v.x }
       rsqrtps  %xmm2, %xmm1        ; xmm1 = { rsqrt(v.y+v.x+v.z+v.w) ... }
       pshufd  $0, %xmm1, %xmm1    ; xmm1 = { rsqrt(v.y+v.x+v.z+v.w) x4 }
       mulps   %xmm1, %xmm0       
       ; end with xmm0 = { v.x*sqrt(...) v.y*sqrt(...) v.z*sqrt(...) v.w*sqrt(...) }

Here is the most efficient assembly code i can think of. You can compare this to what your compiler generates. assume the input and output are in XMM0.

       ; start with xmm0 = { v.x v.y v.z v.w }
       movaps  %xmm0, %mm1         ; save it till the end
       mulps   %xmm0, %xmm0        ; v=v*v
       pshufd  $1, %xmm0, %xmm1    ; xmm1 = { v.y v.x v.x v.x }
       addss   %xmm0, %xmm1        ; xmm1 = { v.y+v.x v.x v.x v.x }
       pshufd  $3, %xmm0, %xmm2    ; xmm2 = { v.w v.x v.x v.x }
       movhlps %xmm0, %xmm3        ; xmm3 = { v.z v.w ? ? }
       addss   %xmm1, %xmm3        ; xmm3 = { v.y+v.x+v.z v.x ? ? }
       addss   %xmm3, %xmm2        ; xmm2 = { v.y+v.x+v.z+v.w v.x v.x v.x }
       rsqrtps  %xmm2, %xmm1        ; xmm1 = { rsqrt(v.y+v.x+v.z+v.w) ... }
       pshufd  $0, %xmm1, %xmm1    ; xmm1 = { rsqrt(v.y+v.x+v.z+v.w) x4 }
       mulps   %xmm1, %xmm0       
       ; end with xmm0 = { v.x*sqrt(...) v.y*sqrt(...) v.z*sqrt(...) v.w*sqrt(...) }

回复收藏 0 原文