SSE归一化比简单近似慢?
我正在尝试标准化 4d 向量。
我的第一个方法是使用 SSE 内在函数 - 它可以将我的向量算术速度提高 2 倍。 这是基本代码:(v.v4 是输入)(使用 GCC)(所有这些都是内联的)
//find squares
v4sf s = __builtin_ia32_mulps(v.v4, v.v4);
//set t to square
v4sf t = s;
//add the 4 squares together
s = __builtin_ia32_shufps(s, s, 0x1B);
t = __builtin_ia32_addps(t, s);
s = __builtin_ia32_shufps(s, s, 0x4e);
t = __builtin_ia32_addps(t, s);
s = __builtin_ia32_shufps(s, s, 0x1B);
t = __builtin_ia32_addps(t, s);
//find 1/sqrt of t
t = __builtin_ia32_rsqrtps(t);
//multiply to get normal
return Vec4(__builtin_ia32_mulps(v.v4, t));
我检查反汇编,它看起来像我所期望的那样。我没有看到任何大问题。
无论如何,然后我使用近似值尝试了它:(我从谷歌得到这个)
float x = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float xhalf = 0.5f*x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f3759df - (i>>1); // give initial guess y0
x = *(float*)&i; // convert bits back to float
x *= 1.5f - xhalf*x*x; // newton step, repeating this step
// increases accuracy
//x *= 1.5f - xhalf*x*x;
return Vec4(v.w*x, v.x*x, v.y*x, v.z*x);
它的运行速度比 SSE 版本稍快! (大约快 5-10%)它的结果也非常准确 - 我会说在查找长度时为 0.001! 但是.. 由于类型双关,GCC 给了我那个蹩脚的严格别名规则。
所以我修改了它:
union {
float fa;
int ia;
};
fa = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float faHalf = 0.5f*fa;
ia = 0x5f3759df - (ia>>1);
fa *= 1.5f - faHalf*fa*fa;
//fa *= 1.5f - faHalf*fa*fa;
return Vec4(v.w*fa, v.x*fa, v.y*fa, v.z*fa);
现在修改后的版本(没有警告)运行速度较慢!它的运行速度几乎是 SSE 版本运行速度的 60%(但结果相同)!这是为什么呢?
所以这里有一个问题:
- 我的 SSE 实施正确吗?
- SSE 真的比正常的 fpu 操作慢吗?
- 为什么第三个代码慢这么多?
I am trying to normalize a 4d vector.
My first approch was to use SSE intrinsics - something that provided a 2 times speed boost to my vector arithmetic.
Here is the basic code: (v.v4 is the input) (using GCC) (all of this is inlined)
//find squares
v4sf s = __builtin_ia32_mulps(v.v4, v.v4);
//set t to square
v4sf t = s;
//add the 4 squares together
s = __builtin_ia32_shufps(s, s, 0x1B);
t = __builtin_ia32_addps(t, s);
s = __builtin_ia32_shufps(s, s, 0x4e);
t = __builtin_ia32_addps(t, s);
s = __builtin_ia32_shufps(s, s, 0x1B);
t = __builtin_ia32_addps(t, s);
//find 1/sqrt of t
t = __builtin_ia32_rsqrtps(t);
//multiply to get normal
return Vec4(__builtin_ia32_mulps(v.v4, t));
I check the disassembly and it looks like how I would expect. I don't see any big problems there.
Anyways, then I tried it using an approximation: (I got this from google)
float x = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float xhalf = 0.5f*x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f3759df - (i>>1); // give initial guess y0
x = *(float*)&i; // convert bits back to float
x *= 1.5f - xhalf*x*x; // newton step, repeating this step
// increases accuracy
//x *= 1.5f - xhalf*x*x;
return Vec4(v.w*x, v.x*x, v.y*x, v.z*x);
It is running slightly faster than the SSE version! (about 5-10% faster) It's results also are very accurate - I would say to 0.001 when finding length!
But.. GCC is giving me that lame strict aliasing rule because of the type punning.
So I modify it:
union {
float fa;
int ia;
};
fa = (v.w*v.w) + (v.x*v.x) + (v.y*v.y) + (v.z*v.z);
float faHalf = 0.5f*fa;
ia = 0x5f3759df - (ia>>1);
fa *= 1.5f - faHalf*fa*fa;
//fa *= 1.5f - faHalf*fa*fa;
return Vec4(v.w*fa, v.x*fa, v.y*fa, v.z*fa);
And now the modified version (with no warnings) is running slower!! It's running almost 60% the speed that SSE version runs (but same result)! Why is this?
So here is question(s):
- Is my SSE implentation correct?
- Is SSE really slower than normal fpu operations?
- Why the hell is the 3rd code so much slower?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我是个傻瓜 - 我意识到我在基准测试时运行了 SETI@Home。我猜它正在扼杀我的 SSE 表现。将其关闭并使其运行速度提高两倍。
我还在 AMD athlon 上测试了它并得到了相同的结果 - SSE 更快。
至少我修复了 shuf 错误!
I am a dope - I realized I had SETI@Home running while benchmarking. I'm guessing it was killing my SSE performance. Turned it off and got it running twice as fast.
I also tested it on an AMD athlon and got the same results - SSE was faster.
At least I fixed the shuf bug!
这是我能想到的最有效的汇编代码。您可以将其与编译器生成的内容进行比较。假设输入和输出位于 XMM0 中。
Here is the most efficient assembly code i can think of. You can compare this to what your compiler generates. assume the input and output are in XMM0.
我的猜测是,第三个版本较慢,因为编译器决定将联合放入内存变量中。在强制转换的情况下,它可以将值从一个寄存器复制到另一个寄存器。您只需查看生成的机器代码即可。
至于为什么上交所不准确,我没有答案。如果您能提供真实的数字,将会有所帮助。如果大小为 1 的向量上的差异为 0.3,那就太离谱了。
My guess is that the 3rd version is slower because the compiler decides to put the union in a memory variable. In the cast case, it can copy the values from register to register. You can just look at the generated machine code.
As to why SSE is inaccurate, I don't have an answer. It would help if you can give real numbers. If the difference is 0.3 on a vector of size 1, that would be outrageous.