跨不同CPU架构的SIMD操作的性能差异
我看到基于SIMD的总和减少与在不同CPU体系结构之间的标量对应物之间存在重要的性能差异。
有问题的功能很简单;您会收到uint8_t
元素和范围b [l,r] 的16字节对数B,其中
l
和> r
是16的倍数。该功能返回b [l,r]
中的元素总和。
这是我的代码:
\\SIMD version of the reduction
inline int simd_sum_red(size_t l, size_t r, const uint8_t* B) {
__m128i zero = _mm_setzero_si128();
__m128i sum0 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
l+=16;
while(l<=r){
__m128i sum1 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
sum0 = _mm_add_epi32(sum0, sum1);
l+=16;
}
__m128i totalsum = _mm_add_epi32(sum0, _mm_shuffle_epi32(sum0, 2));
return _mm_cvtsi128_si32(totalsum);
}
\\Regular reduction
inline size_t reg_sum_red(size_t l, size_t r, const uint8_t* B) {
size_t acc=0;
for(size_t i=l;i<=r;i++){
acc+=B[i];
}
return acc;
}
值得一提的是,我使用我几天前提出的另一个问题的答案构建了我的SIMD函数:
对于实验,我最多只接受256个元素的B随机B范围( 16 SIMD寄存器),然后测量在b [l,r]
符号中花费的每个函数的平均纳秒数量。我比较了两个CPU架构; Apple M1
和Intel(R)Xeon(R)Silver 4110
。我在两种情况下都使用了相同的源代码,并使用编译器标志-STD = C ++ 17 -MSSE4.2 -O3 -funroll -loops -fomit -frame-pointer -ffast-Math
。唯一的区别是,对于Apple M1
,我必须包括一个称为sse2neon.h
的额外标头,该标头将英特尔内在的内在插入转换为Neon Interinsics(基于ARM的SIMD基于ARM的体系结构) 。我省略了此情况的-MSSE4.2
标志。
这些是我使用Apple M1
处理器获得的结果:
nanosecs/sym for reg_sum_red : 1.16952
nanosecs/sym for simd_sum_red : 0.383278
如您所见,使用SIMD指令和不使用它们之间存在重要的区别。
这些是intel(r)xeon(r)银4110
处理器的结果:
nanosecs/sym for reg_sum_red : 6.01793
nanosecs/sym for simd_sum_red : 5.94958
在这种情况下,没有很大的区别。
我想原因是因为我正在使用的编译器;与Apple GCC相比,GNU-GCC的GNU-GCC。那么,我应该将哪种编译器标志传递给GNU-GCC(Intel),以查看SIMD降低与常规减少之间的性能差异,就像我在Apple中看到的一样好吗?
更新:
我意识到OSX中的G ++是Clang的别名(也由@CodyGray指出),因此我在以前的实验中使用了不同的编译器。现在,我尝试了intel
体系结构中的clang,实际上我获得了类似于apple
的减排。但是,问题仍然存在。我可以在源代码或编译器标志中进行任何修改,以使我的GCC编译源代码与Clang的源代码高效?
I see an important performance difference between a SIMD-based sum reduction versus its scalar counterpart across different CPU architectures.
The problematic function is simple; you receive a 16-byte-aligned vector B of uint8_t
elements and a range B[l,r]
, where l
and r
are multiples of 16. The function returns the sum of the elements within B[l,r]
.
This is my code:
\\SIMD version of the reduction
inline int simd_sum_red(size_t l, size_t r, const uint8_t* B) {
__m128i zero = _mm_setzero_si128();
__m128i sum0 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
l+=16;
while(l<=r){
__m128i sum1 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
sum0 = _mm_add_epi32(sum0, sum1);
l+=16;
}
__m128i totalsum = _mm_add_epi32(sum0, _mm_shuffle_epi32(sum0, 2));
return _mm_cvtsi128_si32(totalsum);
}
\\Regular reduction
inline size_t reg_sum_red(size_t l, size_t r, const uint8_t* B) {
size_t acc=0;
for(size_t i=l;i<=r;i++){
acc+=B[i];
}
return acc;
}
It is worth mentioning that I built my SIMD function using the answers from another question I made a couple of days ago:
Accessing the fields of a __m128i variable in a portable way
For the experiments, I took random ranges of B of at most 256 elements (16 SIMD registers) and then measured the average number of nanoseconds that every function spent in a symbol of B[l,r]
. I compared two CPU architectures; Apple M1
and Intel(R) Xeon(R) Silver 4110
. I used the same source code for both cases, and the same compiler (g++
) with the compiler flags -std=c++17 -msse4.2 -O3 -funroll-loops -fomit-frame-pointer -ffast-math
. The only difference is that for Apple M1
I had to include an extra header called sse2neon.h
that transforms Intel intrinsics to Neon intrinsics (the SIMD system for ARM-based architectures). I omitted the -msse4.2
flag for this case.
These are the results I obtained with the Apple M1
processor:
nanosecs/sym for reg_sum_red : 1.16952
nanosecs/sym for simd_sum_red : 0.383278
As you see, there is an important difference between using SIMD instructions, and not using them.
These are the results with the Intel(R) Xeon(R) Silver 4110
processor:
nanosecs/sym for reg_sum_red : 6.01793
nanosecs/sym for simd_sum_red : 5.94958
In this case, there is not a big difference.
I suppose the reason is because of the compilers I am using; gnu-gcc for Intel versus the Apple gcc. So what kind of compiler flags should I pass to gnu-gcc (Intel) to see a performance difference between the SIMD reduction and the regular reduction as good as the one I see in Apple?
Update:
I realized that g++ in OSx is an alias for Clang (also pointed out by @CodyGray), so I used different compilers in my previous experiments. Now, I tried Clang in the Intel
Architecture, and indeed I obtained reductions similar to Apple
. However, the question remains; is there any modification I can make in either the source code or the compiler flags to make my gcc-compiled source code as efficient as that of Clang?.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论