使用 sse 指令进行复杂的 Mul 和 Div
通过 SSE 指令执行复杂的乘法和除法是否有益? 我知道使用 SSE 时加法和减法表现更好。有人可以告诉我如何使用 SSE 执行复杂的乘法以获得更好的性能吗?
Is performing complex multiplication and division beneficial through SSE instructions?
I know that addition and subtraction perform better when using SSE. Can someone tell me how I can use SSE to perform complex multiplication to get better performance?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
为了完整起见,可以下载《英特尔® 64 和 IA-32 架构优化参考手册》此处包含复数乘法(示例 6-9)和复数除法(示例 6-10)的汇编。
例如,乘法代码如下:
程序集直接映射到 gccs X86 内在函数(只需用
__builtin_ia32_
谓词每个指令)。Just for completeness, the Intel® 64 and IA-32 Architectures Optimization Reference Manual that can be downloaded here contains assembly for complex multiply (Example 6-9) and complex divide (Example 6-10).
Here's for example the multiply code:
The assembly maps directly to gccs X86 intrinsics (just predicate each instruction with
__builtin_ia32_
).那么复数乘法定义为:
因此,复数中的 2 个分量将是
因此,假设您使用 8 个浮点数来表示 4 个复数,定义如下:
并且您想同时执行 (c1 * c3) 和 (c2 * c4)你的SSE代码看起来像下面这样:(
注意我在Windows下使用了MSVC,但原理是一样的)。
我上面所做的是稍微简化了数学。假设如下:
通过重新排列,我最终得到以下向量
,然后将 0 和 2 相乘得到:
接下来我将 3 和 1 相乘得到:
最后我翻转 3 中几个浮点的符号
所以我可以添加他们在一起并得到
这就是我们所追求的:)
Well complex multiplication is defined as:
So your 2 components in a complex number would be
So assuming you are using 8 floats to represent 4 complex numbers defined as follows:
And you want to simultaneously do (c1 * c3) and (c2 * c4) your SSE code would look "something" like the following:
(Note I used MSVC under windows but the principle WILL be the same).
What I've done above is I've simplified the maths out a bit. Assuming the following:
By rearranging I end up with the following vectors
I then multiply 0 and 2 together to get:
Next I multiply 3 and 1 together to get:
Finally I flip the signs of a couple of the floats in 3
So I can add them together and get
Which is what we were after :)
英特尔优化参考中的算法无法正确处理输入中的溢出和 NaN。
数字的实部或虚部中的单个
NaN
将错误地传播到其他部分。由于多个无穷大运算(例如无穷大 * 0)以
NaN
结尾,溢出可能会导致NaN
出现在原本表现良好的数据中。如果溢出和 NaN 很少见,避免这种情况的一个简单方法是仅检查结果中的 NaN 并使用编译器 IEEE 兼容实现重新计算它:
The algorithm in the intel optimization reference does not handle overflows and
NaN
s in the input properly.A single
NaN
in the real or imaginary part of the number will incorrectly spread to the other part.As several operations with infinity (e.g. infinity * 0) end in
NaN
, overflows can causeNaN
s to appear in your otherwise well-behaved data.If overflows and
NaN
s are rare, a simple way to avoid this is to just check forNaN
in the result and recompute it with the compilers IEEE compliant implementation: