使用 sse 指令进行复杂的 Mul 和 Div

发布于 2024-09-08 18:32:01 字数 86 浏览 15 评论 0原文

通过 SSE 指令执行复杂的乘法和除法是否有益? 我知道使用 SSE 时加法和减法表现更好。有人可以告诉我如何使用 SSE 执行复杂的乘法以获得更好的性能吗?

Is performing complex multiplication and division beneficial through SSE instructions?
I know that addition and subtraction perform better when using SSE. Can someone tell me how I can use SSE to perform complex multiplication to get better performance?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

风月客 2024-09-15 18:32:01

为了完整起见,可以下载《英特尔® 64 和 IA-32 架构优化参考手册》此处包含复数乘法(示例 6-9)和复数除法(示例 6-10)的汇编。

例如,乘法代码如下:

// Multiplication of (ak + i bk ) * (ck + i dk )
// a + i b can be stored as a data structure
movsldup xmm0, src1; load real parts into the destination, a1, a1, a0, a0
movaps xmm1, src2; load the 2nd pair of complex values, i.e. d1, c1, d0, c0
mulps xmm0, xmm1; temporary results, a1d1, a1c1, a0d0, a0c0
shufps xmm1, xmm1, b1; reorder the real and imaginary parts, c1, d1, c0, d0
movshdup xmm2, src1; load imaginary parts into the destination, b1, b1, b0, b0
mulps xmm2, xmm1; temporary results, b1c1, b1d1, b0c0, b0d0
addsubps xmm0, xmm2; b1c1+a1d1, a1c1 -b1d1, b0c0+a0d0, ; a0c0-b0d0

程序集直接映射到 gccs X86 内在函数(只需用 __builtin_ia32_ 谓词每个指令)。

Just for completeness, the Intel® 64 and IA-32 Architectures Optimization Reference Manual that can be downloaded here contains assembly for complex multiply (Example 6-9) and complex divide (Example 6-10).

Here's for example the multiply code:

// Multiplication of (ak + i bk ) * (ck + i dk )
// a + i b can be stored as a data structure
movsldup xmm0, src1; load real parts into the destination, a1, a1, a0, a0
movaps xmm1, src2; load the 2nd pair of complex values, i.e. d1, c1, d0, c0
mulps xmm0, xmm1; temporary results, a1d1, a1c1, a0d0, a0c0
shufps xmm1, xmm1, b1; reorder the real and imaginary parts, c1, d1, c0, d0
movshdup xmm2, src1; load imaginary parts into the destination, b1, b1, b0, b0
mulps xmm2, xmm1; temporary results, b1c1, b1d1, b0c0, b0d0
addsubps xmm0, xmm2; b1c1+a1d1, a1c1 -b1d1, b0c0+a0d0, ; a0c0-b0d0

The assembly maps directly to gccs X86 intrinsics (just predicate each instruction with __builtin_ia32_).

青瓷清茶倾城歌 2024-09-15 18:32:01

那么复数乘法定义为:

((c1a * c2a) - (c1b * c2b)) + ((c1b * c2a) + (c1a * c2b))i

因此,复数中的 2 个分量将是

((c1a * c2a) - (c1b * c2b)) and ((c1b * c2a) + (c1a * c2b))i

因此,假设您使用 8 个浮点数来表示 4 个复数,定义如下:

c1a, c1b, c2a, c2b
c3a, c3b, c4a, c4b

并且您想同时执行 (c1 * c3) 和 (c2 * c4)你的SSE代码看起来像下面这样:(

注意我在Windows下使用了MSVC,但原理是一样的)。

__declspec( align( 16 ) ) float c1c2[]        = { 1.0f, 2.0f, 3.0f, 4.0f };
__declspec( align( 16 ) ) float c3c4[]          = { 4.0f, 3.0f, 2.0f, 1.0f };
__declspec( align( 16 ) ) float mulfactors[]    = { -1.0f, 1.0f, -1.0f, 1.0f };
__declspec( align( 16 ) ) float res[]           = { 0.0f, 0.0f, 0.0f, 0.0f };

__asm 
{
    movaps xmm0, xmmword ptr [c1c2]         // Load c1 and c2 into xmm0.
    movaps xmm1, xmmword ptr [c3c4]         // Load c3 and c4 into xmm1.
    movaps xmm4, xmmword ptr [mulfactors]   // load multiplication factors into xmm4

    movaps xmm2, xmm1                       
    movaps xmm3, xmm0                       
    shufps xmm2, xmm1, 0xA0                 // Change order to c3a c3a c4a c4a and store in xmm2
    shufps xmm1, xmm1, 0xF5                 // Change order to c3b c3b c4b c4b and store in xmm1
    shufps xmm3, xmm0, 0xB1                 // change order to c1b c1a c2b c2a abd store in xmm3

    mulps xmm0, xmm2                        
    mulps xmm3, xmm1                    
    mulps xmm3, xmm4                        // Flip the signs of the 'a's so the add works correctly.

    addps xmm0, xmm3                        // Add together

    movaps xmmword ptr [res], xmm0          // Store back out
};

float res1a = (c1c2[0] * c3c4[0]) - (c1c2[1] * c3c4[1]);
float res1b = (c1c2[1] * c3c4[0]) + (c1c2[0] * c3c4[1]);

float res2a = (c1c2[2] * c3c4[2]) - (c1c2[3] * c3c4[3]);
float res2b = (c1c2[3] * c3c4[2]) + (c1c2[2] * c3c4[3]);

if ( res1a != res[0] || 
     res1b != res[1] || 
     res2a != res[2] || 
     res2b != res[3] )
{
    _exit( 1 );
}

我上面所做的是稍微简化了数学。假设如下:

c1a c1b c2a c2b
c3a c3b c4a c4b

通过重新排列,我最终得到以下向量

0 => c1a c1b c2a c2b
1 => c3b c3b c4b c4b
2 => c3a c3a c4a c4a
3 => c1b c1a c2b c2a

,然后将 0 和 2 相乘得到:

0 => c1a * c3a, c1b * c3a, c2a * c4a, c2b * c4a

接下来我将 3 和 1 相乘得到:

3 => c1b * c3b, c1a * c3b, c2b * c4b, c2a * c4b

最后我翻转 3 中几个浮点的符号

3 => -(c1b * c3b), c1a * c3b, -(c2b * c4b), c2a * c4b

所以我可以添加他们在一起并得到

(c1a * c3a) - (c1b * c3b), (c1b * c3a ) + (c1a * c3b), (c2a * c4a) - (c2b * c4b), (c2b * c4a) + (c2a * c4b)

这就是我们所追求的:)

Well complex multiplication is defined as:

((c1a * c2a) - (c1b * c2b)) + ((c1b * c2a) + (c1a * c2b))i

So your 2 components in a complex number would be

((c1a * c2a) - (c1b * c2b)) and ((c1b * c2a) + (c1a * c2b))i

So assuming you are using 8 floats to represent 4 complex numbers defined as follows:

c1a, c1b, c2a, c2b
c3a, c3b, c4a, c4b

And you want to simultaneously do (c1 * c3) and (c2 * c4) your SSE code would look "something" like the following:

(Note I used MSVC under windows but the principle WILL be the same).

__declspec( align( 16 ) ) float c1c2[]        = { 1.0f, 2.0f, 3.0f, 4.0f };
__declspec( align( 16 ) ) float c3c4[]          = { 4.0f, 3.0f, 2.0f, 1.0f };
__declspec( align( 16 ) ) float mulfactors[]    = { -1.0f, 1.0f, -1.0f, 1.0f };
__declspec( align( 16 ) ) float res[]           = { 0.0f, 0.0f, 0.0f, 0.0f };

__asm 
{
    movaps xmm0, xmmword ptr [c1c2]         // Load c1 and c2 into xmm0.
    movaps xmm1, xmmword ptr [c3c4]         // Load c3 and c4 into xmm1.
    movaps xmm4, xmmword ptr [mulfactors]   // load multiplication factors into xmm4

    movaps xmm2, xmm1                       
    movaps xmm3, xmm0                       
    shufps xmm2, xmm1, 0xA0                 // Change order to c3a c3a c4a c4a and store in xmm2
    shufps xmm1, xmm1, 0xF5                 // Change order to c3b c3b c4b c4b and store in xmm1
    shufps xmm3, xmm0, 0xB1                 // change order to c1b c1a c2b c2a abd store in xmm3

    mulps xmm0, xmm2                        
    mulps xmm3, xmm1                    
    mulps xmm3, xmm4                        // Flip the signs of the 'a's so the add works correctly.

    addps xmm0, xmm3                        // Add together

    movaps xmmword ptr [res], xmm0          // Store back out
};

float res1a = (c1c2[0] * c3c4[0]) - (c1c2[1] * c3c4[1]);
float res1b = (c1c2[1] * c3c4[0]) + (c1c2[0] * c3c4[1]);

float res2a = (c1c2[2] * c3c4[2]) - (c1c2[3] * c3c4[3]);
float res2b = (c1c2[3] * c3c4[2]) + (c1c2[2] * c3c4[3]);

if ( res1a != res[0] || 
     res1b != res[1] || 
     res2a != res[2] || 
     res2b != res[3] )
{
    _exit( 1 );
}

What I've done above is I've simplified the maths out a bit. Assuming the following:

c1a c1b c2a c2b
c3a c3b c4a c4b

By rearranging I end up with the following vectors

0 => c1a c1b c2a c2b
1 => c3b c3b c4b c4b
2 => c3a c3a c4a c4a
3 => c1b c1a c2b c2a

I then multiply 0 and 2 together to get:

0 => c1a * c3a, c1b * c3a, c2a * c4a, c2b * c4a

Next I multiply 3 and 1 together to get:

3 => c1b * c3b, c1a * c3b, c2b * c4b, c2a * c4b

Finally I flip the signs of a couple of the floats in 3

3 => -(c1b * c3b), c1a * c3b, -(c2b * c4b), c2a * c4b

So I can add them together and get

(c1a * c3a) - (c1b * c3b), (c1b * c3a ) + (c1a * c3b), (c2a * c4a) - (c2b * c4b), (c2b * c4a) + (c2a * c4b)

Which is what we were after :)

浅黛梨妆こ 2024-09-15 18:32:01

英特尔优化参考中的算法无法正确处理输入中的溢出和 NaN。

数字的实部或虚部中的单个 NaN 将错误地传播到其他部分。

由于多个无穷大运算(例如无穷大 * 0)以 NaN 结尾,溢出可能会导致 NaN 出现在原本表现良好的数据中。

如果溢出和 NaN 很少见,避免这种情况的一个简单方法是仅检查结果中的 NaN 并使用编译器 IEEE 兼容实现重新计算它:

float complex a[2], b[2];
__m128 res = simd_fast_multiply(a, b);

/* store unconditionally, can be executed in parallel with the check
 * making it almost free if there is no NaN in data */
_mm_store_ps(dest, res);

/* check for NaN */
__m128 n = _mm_cmpneq_ps(res, res);
int have_nan = _mm_movemask_ps(n);
if (have_nan != 0) {
    /* do it again unvectorized */
    dest[0] = a[0] * b[0];
    dest[1] = a[1] * b[1];
}

The algorithm in the intel optimization reference does not handle overflows and NaNs in the input properly.

A single NaN in the real or imaginary part of the number will incorrectly spread to the other part.

As several operations with infinity (e.g. infinity * 0) end in NaN, overflows can cause NaNs to appear in your otherwise well-behaved data.

If overflows and NaNs are rare, a simple way to avoid this is to just check for NaN in the result and recompute it with the compilers IEEE compliant implementation:

float complex a[2], b[2];
__m128 res = simd_fast_multiply(a, b);

/* store unconditionally, can be executed in parallel with the check
 * making it almost free if there is no NaN in data */
_mm_store_ps(dest, res);

/* check for NaN */
__m128 n = _mm_cmpneq_ps(res, res);
int have_nan = _mm_movemask_ps(n);
if (have_nan != 0) {
    /* do it again unvectorized */
    dest[0] = a[0] * b[0];
    dest[1] = a[1] * b[1];
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文