SIMD 代码与标量代码

发布于 2024-10-07 03:19:24 字数 1614 浏览 9 评论 0原文

以下循环执行数百次。
<代码> elma 和 elmc 都是无符号长(64 位)数组,res1 和 res2 也是如此。

unsigned long simdstore[2];  
__m128i *p, simda, simdb, simdc;  
p = (__m128i *) simdstore;  

for (i = 0; i < _polylen; i++)  
{  

    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k++)  
    {     

    1.  //res1[i + k] ^= _mulpre1[u1][k];  
    2.  //res2[i + k] ^= _mulpre2[u2][k];               
    3.        _mm_prefetch ((const void *) &_mulpre2[u2][k], _MM_HINT_T0);
    4.        _mm_prefetch ((const void *) &_mulpre1[u1][k], _MM_HINT_T0);
    5.        simda = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);
    6.        _mm_prefetch ((const void *) &res2[i + k], _MM_HINT_T0); 
    7.        _mm_prefetch ((const void *) &res1[i + k], _MM_HINT_T0); 
    8.        simdb = _mm_set_epi64x (res2[i + k], res1[i + k]);  
    9.        simdc = _mm_xor_si128 (simda, simdb);  
    10.        _mm_store_si128 (p, simdc);  
    11.        res1[i + k] = simdstore[0];  
    12.        res2[i + k] = simdstore[1];                      
    }     
}  

在 for 循环中,标量版本的代码(带注释)的运行速度是 simd 代码的两倍。下面提到了上面几行的cachegrind输出(指令读取)。

1号线:668,460,000 2 2
2号线:668,460,000 1 1
第3行:89,985,000 1 1
4号线:89,985,000 1 1
5号线:617,040,000 2 2
6号线:44,992,500 0 0
7号线:44,992,500 0 0
8号线:539,910,000 1 1
第9行:128,550,000 0 0
第 10 行: . 。 .
第11行:205,680,000 0 0
第 12 行:205,680,000 0 0

从上图中可以看出,注释的(标量代码)所需的指令数量明显少于 simd 代码。

如何使这段代码变得更快?

The following loop is executed hundreds of times.

elma and elmc are both unsigned long (64-bit) arrays, so is res1 and res2.

unsigned long simdstore[2];  
__m128i *p, simda, simdb, simdc;  
p = (__m128i *) simdstore;  

for (i = 0; i < _polylen; i++)  
{  

    u1 = (elma[i] >> l) & 15;  
    u2 = (elmc[i] >> l) & 15;  
    for (k = 0; k < 20; k++)  
    {     

    1.  //res1[i + k] ^= _mulpre1[u1][k];  
    2.  //res2[i + k] ^= _mulpre2[u2][k];               
    3.        _mm_prefetch ((const void *) &_mulpre2[u2][k], _MM_HINT_T0);
    4.        _mm_prefetch ((const void *) &_mulpre1[u1][k], _MM_HINT_T0);
    5.        simda = _mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);
    6.        _mm_prefetch ((const void *) &res2[i + k], _MM_HINT_T0); 
    7.        _mm_prefetch ((const void *) &res1[i + k], _MM_HINT_T0); 
    8.        simdb = _mm_set_epi64x (res2[i + k], res1[i + k]);  
    9.        simdc = _mm_xor_si128 (simda, simdb);  
    10.        _mm_store_si128 (p, simdc);  
    11.        res1[i + k] = simdstore[0];  
    12.        res2[i + k] = simdstore[1];                      
    }     
}  

Within the for loop, scalar version of code (commented) runs twice as faster than simd code. With cachegrind output (Instruction reads) of the above lines is mentioned below.

Line 1: 668,460,000 2 2
Line 2: 668,460,000 1 1
Line 3: 89,985,000 1 1
Line 4: 89,985,000 1 1
Line 5: 617,040,000 2 2
Line 6: 44,992,500 0 0
Line 7: 44,992,500 0 0
Line 8: 539,910,000 1 1
Line 9: 128,550,000 0 0
Line 10: . . .
Line 11: 205,680,000 0 0
Line 12: 205,680,000 0 0

From the above figure, it appears that the commented (scalar code) requires significantly less number of instructions than simd code.

How can this code be made faster?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

南街女流氓 2024-10-14 03:19:24

去掉 _mm_prefetch 内在函数 - 它们在这种情况下没有取得任何成果,甚至可能会损害性能。仅当 (a) 您有空闲带宽并且 (b) 您可以在实际需要数据之前几百个时钟周期发出预取提示时,预取才有用。我认为(a)和(b)在你的情况下都不成立。

Take out the _mm_prefetch intrinsics - they are achieving nothing in this context and may even be hurting performance. Prefetch is only of benefit if (a) you have bandwidth to spare and (b) you can issue the prefetch hint several hundred clock cycles ahead of when the data is actually needed. I think neither (a) nor (b) are true in your case.

韶华倾负 2024-10-14 03:19:24

您的性能问题是这样的:

_mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);

mm_set(a,b,c,d) 类内部函数非常慢。
只有单个参数集内在函数(也称为广播)是快速的。

我查看了它们在汇编代码中的作用。

他们基本上在堆栈上创建一个数组,使用正常的内存移动(mov DWORD)将两个整数从它们当前所在的多维数组移动到堆栈数组。然后使用 XMM 内存从堆栈数组中移动 (mov XMWORD)。

标量版本直接从内存到寄存器。快点!

您会看到,开销来自于 XMM 寄存器一次只能与 128 位进行通信,因此您的程序在加载之前首先对内存的另一个区域中的 128 位进行排序。

如果有一种方法可以将 64 位值直接移动到普通寄存器或从普通寄存器移动到 XMM 寄存器,我仍在寻找它。

为了通过使用 SSE/XMM 寄存器来提高速度,您的数据可能需要在内存中按顺序排列。仅当您可以对每个乱序加载执行多个 XMM 操作时,将乱序数据加载到 XMM 寄存器才值得。这里你做了一个异或运算。

Your peformance problem is this:

_mm_set_epi64x (_mulpre2[u2][k], _mulpre1[u1][k]);

The mm_set(a,b,c,d) class of intrinsics are very slow.
Only single parameter set intrinsics (aka broadcast) are fast.

I looked at what they do in assembly code.

They basically create an array on the stack, move your two integers from the multidimensional arrays they currently reside in, to a stack array, using normal memory moves (mov DWORD). Then from the stack array using an XMM memory move (mov XMWORD).

The scalar version goes directly from memory to registers. FASTER!

You see the overhead comes from the fact that that the XMM register can only be communicated with 128 bits at a time, so your program is first ordering the 128 bits in another area of memory before loading them.

If there's a way to move 64 bit values directly to or from a normal register to an XMM register i'm still looking for it.

To get a speed boost from using SSE/XMM registers your data will probably need to be already in order in memory. Loading out of order data into an XMM register is only worth it if you can do several XMM operations per out of order load. Here your doing a single XOR operation.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文