跨不同CPU架构的SIMD操作的性能差异

发布于 2025-01-17 12:20:29 字数 2396 浏览 4 评论 0原文

我看到基于SIMD的总和减少与在不同CPU体系结构之间的标量对应物之间存在重要的性能差异。

有问题的功能很简单；您会收到uint8_t元素和范围b [l，r] 的16字节对数B，其中l和> r是16的倍数。该功能返回b [l，r]中的元素总和。

这是我的代码：

\\SIMD version of the reduction
inline int simd_sum_red(size_t l, size_t r, const uint8_t* B) {

    __m128i zero = _mm_setzero_si128();
    __m128i sum0 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
    l+=16;
    while(l<=r){
        __m128i sum1 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
        sum0 = _mm_add_epi32(sum0, sum1);
        l+=16;
    }

    __m128i totalsum = _mm_add_epi32(sum0, _mm_shuffle_epi32(sum0, 2));
    return _mm_cvtsi128_si32(totalsum);
}

\\Regular reduction
inline size_t reg_sum_red(size_t l, size_t r, const uint8_t* B) {
    size_t acc=0;
    for(size_t i=l;i<=r;i++){
        acc+=B[i];
    }
    return acc;
}

值得一提的是，我使用我几天前提出的另一个问题的答案构建了我的SIMD函数：

以便携式方式访问__m128i变量的字段

对于实验，我最多只接受256个元素的B随机B范围（ 16 SIMD寄存器），然后测量在b [l，r]符号中花费的每个函数的平均纳秒数量。我比较了两个CPU架构； Apple M1和Intel（R）Xeon（R）Silver 4110。我在两种情况下都使用了相同的源代码，并使用编译器标志-STD = C ++ 17 -MSSE4.2 -O3 -funroll -loops -fomit -frame-pointer -ffast-Math。唯一的区别是，对于Apple M1，我必须包括一个称为sse2neon.h的额外标头，该标头将英特尔内在的内在插入转换为Neon Interinsics（基于ARM的SIMD基于ARM的体系结构）。我省略了此情况的-MSSE4.2标志。

这些是我使用Apple M1处理器获得的结果：

nanosecs/sym for reg_sum_red : 1.16952
nanosecs/sym for simd_sum_red : 0.383278

如您所见，使用SIMD指令和不使用它们之间存在重要的区别。

这些是intel（r）xeon（r）银4110处理器的结果：

nanosecs/sym for reg_sum_red : 6.01793
nanosecs/sym for simd_sum_red : 5.94958

在这种情况下，没有很大的区别。

我想原因是因为我正在使用的编译器；与Apple GCC相比，GNU-GCC的GNU-GCC。那么，我应该将哪种编译器标志传递给GNU-GCC（Intel），以查看SIMD降低与常规减少之间的性能差异，就像我在Apple中看到的一样好吗？

更新：

我意识到OSX中的G ++是Clang的别名（也由@CodyGray指出），因此我在以前的实验中使用了不同的编译器。现在，我尝试了intel体系结构中的clang，实际上我获得了类似于apple的减排。但是，问题仍然存在。我可以在源代码或编译器标志中进行任何修改，以使我的GCC编译源代码与Clang的源代码高效？

原文

I see an important performance difference between a SIMD-based sum reduction versus its scalar counterpart across different CPU architectures.

The problematic function is simple; you receive a 16-byte-aligned vector B of uint8_t elements and a range B[l,r], where l and r are multiples of 16. The function returns the sum of the elements within B[l,r].

This is my code:

\\SIMD version of the reduction
inline int simd_sum_red(size_t l, size_t r, const uint8_t* B) {

    __m128i zero = _mm_setzero_si128();
    __m128i sum0 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
    l+=16;
    while(l<=r){
        __m128i sum1 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
        sum0 = _mm_add_epi32(sum0, sum1);
        l+=16;
    }

    __m128i totalsum = _mm_add_epi32(sum0, _mm_shuffle_epi32(sum0, 2));
    return _mm_cvtsi128_si32(totalsum);
}

\\Regular reduction
inline size_t reg_sum_red(size_t l, size_t r, const uint8_t* B) {
    size_t acc=0;
    for(size_t i=l;i<=r;i++){
        acc+=B[i];
    }
    return acc;
}

It is worth mentioning that I built my SIMD function using the answers from another question I made a couple of days ago:

Accessing the fields of a __m128i variable in a portable way

For the experiments, I took random ranges of B of at most 256 elements (16 SIMD registers) and then measured the average number of nanoseconds that every function spent in a symbol of B[l,r]. I compared two CPU architectures; Apple M1 and Intel(R) Xeon(R) Silver 4110. I used the same source code for both cases, and the same compiler (g++) with the compiler flags -std=c++17 -msse4.2 -O3 -funroll-loops -fomit-frame-pointer -ffast-math. The only difference is that for Apple M1 I had to include an extra header called sse2neon.h that transforms Intel intrinsics to Neon intrinsics (the SIMD system for ARM-based architectures). I omitted the -msse4.2 flag for this case.

These are the results I obtained with the Apple M1 processor:

nanosecs/sym for reg_sum_red : 1.16952
nanosecs/sym for simd_sum_red : 0.383278

As you see, there is an important difference between using SIMD instructions, and not using them.

These are the results with the Intel(R) Xeon(R) Silver 4110 processor:

nanosecs/sym for reg_sum_red : 6.01793
nanosecs/sym for simd_sum_red : 5.94958

In this case, there is not a big difference.

I suppose the reason is because of the compilers I am using; gnu-gcc for Intel versus the Apple gcc. So what kind of compiler flags should I pass to gnu-gcc (Intel) to see a performance difference between the SIMD reduction and the regular reduction as good as the one I see in Apple?

Update:

I realized that g++ in OSx is an alias for Clang (also pointed out by @CodyGray), so I used different compilers in my previous experiments. Now, I tried Clang in the Intel Architecture, and indeed I obtained reductions similar to Apple. However, the question remains; is there any modification I can make in either the source code or the compiler flags to make my gcc-compiled source code as efficient as that of Clang?.

分享到QQ

分享到微博