跨不同CPU架构的SIMD操作的性能差异

发布于 2025-01-17 12:20:29 字数 2396 浏览 4 评论 0原文

我看到基于SIMD的总和减少与在不同CPU体系结构之间的标量对应物之间存在重要的性能差异。

有问题的功能很简单;您会收到uint8_t元素和范围b [l,r] 的16字节对数B,其中l> r是16的倍数。该功能返回b [l,r]中的元素总和。

这是我的代码:

\\SIMD version of the reduction
inline int simd_sum_red(size_t l, size_t r, const uint8_t* B) {

    __m128i zero = _mm_setzero_si128();
    __m128i sum0 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
    l+=16;
    while(l<=r){
        __m128i sum1 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
        sum0 = _mm_add_epi32(sum0, sum1);
        l+=16;
    }

    __m128i totalsum = _mm_add_epi32(sum0, _mm_shuffle_epi32(sum0, 2));
    return _mm_cvtsi128_si32(totalsum);
}

\\Regular reduction
inline size_t reg_sum_red(size_t l, size_t r, const uint8_t* B) {
    size_t acc=0;
    for(size_t i=l;i<=r;i++){
        acc+=B[i];
    }
    return acc;
}

值得一提的是,我使用我几天前提出的另一个问题的答案构建了我的SIMD函数:

以便携式方式访问__m128i变量的字段

对于实验,我最多只接受256个元素的B随机B范围( 16 SIMD寄存器),然后测量在b [l,r]符号中花费的每个函数的平均纳秒数量。我比较了两个CPU架构; Apple M1Intel(R)Xeon(R)Silver 4110。我在两种情况下都使用了相同的源代码,并使用编译器标志-STD = C ++ 17 -MSSE4.2 -O3 -funroll -loops -fomit -frame-pointer -ffast-Math。唯一的区别是,对于Apple M1,我必须包括一个称为sse2neon.h的额外标头,该标头将英特尔内在的内在插入转换为Neon Interinsics(基于ARM的SIMD基于ARM的体系结构) 。我省略了此情况的-MSSE4.2标志。

这些是我使用Apple M1处理器获得的结果:

nanosecs/sym for reg_sum_red : 1.16952
nanosecs/sym for simd_sum_red : 0.383278

如您所见,使用SIMD指令和不使用它们之间存在重要的区别。

这些是intel(r)xeon(r)银4110处理器的结果:

nanosecs/sym for reg_sum_red : 6.01793
nanosecs/sym for simd_sum_red : 5.94958

在这种情况下,没有很大的区别。

我想原因是因为我正在使用的编译器;与Apple GCC相比,GNU-GCC的GNU-GCC。那么,我应该将哪种编译器标志传递给GNU-GCC(Intel),以查看SIMD降低与常规减少之间的性能差异,就像我在Apple中看到的一样好吗?

更新:

我意识到OSX中的G ++是Clang的别名(也由@CodyGray指出),因此我在以前的实验中使用了不同的编译器。现在,我尝试了intel体系结构中的clang,实际上我获得了类似于apple的减排。但是,问题仍然存在。我可以在源代码或编译器标志中进行任何修改,以使我的GCC编译源代码与Clang的源代码高效?

I see an important performance difference between a SIMD-based sum reduction versus its scalar counterpart across different CPU architectures.

The problematic function is simple; you receive a 16-byte-aligned vector B of uint8_t elements and a range B[l,r], where l and r are multiples of 16. The function returns the sum of the elements within B[l,r].

This is my code:

\\SIMD version of the reduction
inline int simd_sum_red(size_t l, size_t r, const uint8_t* B) {

    __m128i zero = _mm_setzero_si128();
    __m128i sum0 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
    l+=16;
    while(l<=r){
        __m128i sum1 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
        sum0 = _mm_add_epi32(sum0, sum1);
        l+=16;
    }

    __m128i totalsum = _mm_add_epi32(sum0, _mm_shuffle_epi32(sum0, 2));
    return _mm_cvtsi128_si32(totalsum);
}

\\Regular reduction
inline size_t reg_sum_red(size_t l, size_t r, const uint8_t* B) {
    size_t acc=0;
    for(size_t i=l;i<=r;i++){
        acc+=B[i];
    }
    return acc;
}

It is worth mentioning that I built my SIMD function using the answers from another question I made a couple of days ago:

Accessing the fields of a __m128i variable in a portable way

For the experiments, I took random ranges of B of at most 256 elements (16 SIMD registers) and then measured the average number of nanoseconds that every function spent in a symbol of B[l,r]. I compared two CPU architectures; Apple M1 and Intel(R) Xeon(R) Silver 4110. I used the same source code for both cases, and the same compiler (g++) with the compiler flags -std=c++17 -msse4.2 -O3 -funroll-loops -fomit-frame-pointer -ffast-math. The only difference is that for Apple M1 I had to include an extra header called sse2neon.h that transforms Intel intrinsics to Neon intrinsics (the SIMD system for ARM-based architectures). I omitted the -msse4.2 flag for this case.

These are the results I obtained with the Apple M1 processor:

nanosecs/sym for reg_sum_red : 1.16952
nanosecs/sym for simd_sum_red : 0.383278

As you see, there is an important difference between using SIMD instructions, and not using them.

These are the results with the Intel(R) Xeon(R) Silver 4110 processor:

nanosecs/sym for reg_sum_red : 6.01793
nanosecs/sym for simd_sum_red : 5.94958

In this case, there is not a big difference.

I suppose the reason is because of the compilers I am using; gnu-gcc for Intel versus the Apple gcc. So what kind of compiler flags should I pass to gnu-gcc (Intel) to see a performance difference between the SIMD reduction and the regular reduction as good as the one I see in Apple?

Update:

I realized that g++ in OSx is an alias for Clang (also pointed out by @CodyGray), so I used different compilers in my previous experiments. Now, I tried Clang in the Intel Architecture, and indeed I obtained reductions similar to Apple. However, the question remains; is there any modification I can make in either the source code or the compiler flags to make my gcc-compiled source code as efficient as that of Clang?.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文