SSE SIMD 优化 For 循环
我在循环中有一些代码
for(int i = 0; i < n; i++)
{
u[i] = c * u[i] + s * b[i];
}
所以,u 和 b 是相同长度的向量,c 和 s 是标量。该代码是否适合与 SSE 一起使用矢量化以获得加速?
更新
我学习了矢量化(事实证明,如果您使用内在函数,这并不难)并在 SSE 中实现了我的循环。然而,当在 VC++ 编译器中设置 SSE2 标志时,我获得的性能与我自己的 SSE 代码大致相同。另一方面,Intel 编译器比我的 SSE 代码或 VC++ 编译器快得多。
这是我写的代码供参考
double *u = (double*) _aligned_malloc(n * sizeof(double), 16);
for(int i = 0; i < n; i++)
{
u[i] = 0;
}
int j = 0;
__m128d *uSSE = (__m128d*) u;
__m128d cStore = _mm_set1_pd(c);
__m128d sStore = _mm_set1_pd(s);
for (j = 0; j <= i - 2; j+=2)
{
__m128d uStore = _mm_set_pd(u[j+1], u[j]);
__m128d cu = _mm_mul_pd(cStore, uStore);
__m128d so = _mm_mul_pd(sStore, omegaStore);
uSSE[j/2] = _mm_add_pd(cu, so);
}
for(; j <= i; ++j)
{
u[j] = c * u[j] + s * omegaCache[j];
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
是的,这是矢量化的绝佳候选者。但是,在执行此操作之前,请确保您已对代码进行了分析,以确保这实际上值得优化。也就是说,矢量化会像这样:
为了获得更高的性能,您可以考虑预取更多数组元素,和/或展开循环并使用 软件流水线将一个循环中的计算与来自不同迭代的内存访问交错。
Yes, this is an excellent candidate for vectorization. But, before you do so, make sure you've profiled your code to be sure that this is actually worth optimizing. That said, the vectorization would go something like this:
For even more performance, you can consider prefetching further array elements, and/or unrolling the loop and using software pipelining to interleave the computation in one loop with the memory accesses from a different iteration.
_mm_set_pd
未矢量化。如果从字面上理解,它使用标量运算读取两个双精度数,然后组合这两个标量双精度数并将它们复制到 SSE 寄存器中。请改用_mm_load_pd
。_mm_set_pd
is not vectorized. If taken literally, it reads the two doubles using scalar operations, then combines the two scalar doubles and copy them into the SSE register. Use_mm_load_pd
instead.可能是的,但是您必须通过一些提示来帮助编译器。
放置在指针上的
__restrict__
告诉编译器两个指针之间没有别名。如果您知道向量的对齐方式,请将其传达给编译器(Visual C++ 可能有一些功能)。
我自己对 Visual C++ 并不熟悉,但我听说它对矢量化没有好处。
考虑改用英特尔编译器。
Intel 允许对生成的程序集进行相当细粒度的控制: http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_vector.htm
probably yes, but you have to help compiler with some hints.
__restrict__
placed on pointers tells compiler that there is no alias between two pointers.if you know alignment of your vectors, communicate that to compiler (Visual C++ may have some facility).
I am not familiar with Visual C++ myself, but I have heard it is no good for vectorization.
Consider using Intel compiler instead.
Intel allows pretty fine-grained control over assembly generated: http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_vector.htm
是的,假设 U 和 B 数组没有重叠,这是矢量化的一个很好的候选者。但代码受到内存访问(加载/存储)的约束。矢量化有助于减少每个循环的周期,但指令将由于 U 和 B 数组上的缓存未命中而停止。英特尔 C/C++ 编译器使用 Xeon x5500 处理器的默认标志生成以下代码。编译器按 8 展开循环,并使用 xmm[0-16] SIMD 寄存器使用 SIMD ADD (addpd) 和 MULTIPLY (mulpd) 指令。在每个周期中,假设寄存器中已准备好数据,处理器可以发出 2 个 SIMD 指令,从而产生 4 路标量 ILP。
这里 U、B、C 和 S 是双精度(8 字节)。
Yes, this is a great candidate for vectorizaton, assuming there is no overlap of U and B array. But the code is bound by memory access(load/store). Vectorization helps reduce cycles per loop, but the instructions will stall due to cache-miss on U and B array . The Intel C/C++ Compiler generates the following code with default flags for Xeon x5500 processor. The compiler unrolls the loop by 8 and employs SIMD ADD (addpd) and MULTIPLY (mulpd) instructions using xmm[0-16] SIMD registers. In each cycle, the processor can issue 2 SIMD instructions yielding 4-way scalar ILP, assuming you have the data ready in the registers.
Here U, B, C and S are Double Precision (8 bytes).
这取决于你如何将 u 和 b 放置在内存中。
如果两个内存块彼此相距很远,SSE 在这种情况下不会有太大提升。
建议数组 u 和 b 是 AOE(结构数组)而不是 SOA(数组结构),因为您可以在一条指令中将它们都加载到寄存器中。
it depends on how you placed u and b in memory.
if both memory block are far from each other, SSE wouldn't boost much in this scenario.
it is suggested that the array u and b are AOE (array of structure) instead of SOA (structure of array), because you can load both of them into register in single instruction.