当前位置：文江博客话题详情

SSE 乘法 16 x uint8_t

发布于 2024-12-16 23:02:32 字数 106 浏览 0 评论 0原文

我想用 SSE4 将 __m128i 对象与 16 个无符号 8 位整数相乘，但我只能找到用于乘法 16 位整数的内在函数。没有诸如_mm_mult_epi8之类的东西吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一页 2024-12-23 23:02:32

一种（可能）比基于 Agner Fog 的解决方案的 Marat 解决方案更快的方法：

而不是拆分 hi/低，奇数/偶数分开。这还有一个额外的好处，它可以与纯 SSE2 一起使用，而不需要 SSE4.1（对 OP 没有用，但对某些人来说是一个不错的额外好处）。如果您有 AVX2，我还添加了优化。从技术上讲，AVX2 优化仅适用于 SSE2 内在函数，但它比先左移再右移的解决方案要慢。

__m128i mullo_epi8(__m128i a, __m128i b)
{
    // unpack and multiply
    __m128i dst_even = _mm_mullo_epi16(a, b);
    __m128i dst_odd = _mm_mullo_epi16(_mm_srli_epi16(a, 8),_mm_srli_epi16(b, 8));
    // repack
#ifdef __AVX2__
    // only faster if have access to VPBROADCASTW
    return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_and_si128(dst_even, _mm_set1_epi16(0xFF)));
#else
    return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_srli_epi16(_mm_slli_epi16(dst_even,8), 8));
#endif
}

Agner 使用具有 SSE4.1 支持的 blendv_epi8 内在函数。

编辑：

有趣的是，在做了更多的反汇编工作（使用优化的构建）之后，至少我的两个实现被编译为完全相同的东西。针对“ivy-bridge”(AVX) 的反汇编示例。

vpmullw xmm2,xmm0,xmm1
vpsrlw xmm0,xmm0,0x8
vpsrlw xmm1,xmm1,0x8
vpmullw xmm0,xmm0,xmm1
vpsllw xmm0,xmm0,0x8
vpand xmm1,xmm2,XMMWORD PTR [rip+0x281]
vpor xmm0,xmm0,xmm1

它使用带有预编译的 128 位 xmm 常量的“AVX2 优化”版本。仅使用 SSE2 支持进行编译会产生类似的结果（尽管使用 SSE2 指令）。我怀疑 Agner Fog 的原始解决方案可能会针对同样的事情进行优化（如果没有的话那就太疯狂了）。不知道 Marat 的原始解决方案在优化构建中的比较如何，尽管对我来说，对所有比 SSE2 更新（包括 SSE2）的 x86 simd 扩展使用单一方法是相当不错的。

A (potentially) faster way than Marat's solution based on Agner Fog's solution:

Instead of splitting hi/low, split odd/even. This has the added benefit that it works with pure SSE2 instead of requiring SSE4.1 (of no use to the OP, but a nice added bonus for some). I also added an optimization if you have AVX2. Technically the AVX2 optimization works with only SSE2 intrinsics, but it's slower than the shift left then right solution.

__m128i mullo_epi8(__m128i a, __m128i b)
{
    // unpack and multiply
    __m128i dst_even = _mm_mullo_epi16(a, b);
    __m128i dst_odd = _mm_mullo_epi16(_mm_srli_epi16(a, 8),_mm_srli_epi16(b, 8));
    // repack
#ifdef __AVX2__
    // only faster if have access to VPBROADCASTW
    return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_and_si128(dst_even, _mm_set1_epi16(0xFF)));
#else
    return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_srli_epi16(_mm_slli_epi16(dst_even,8), 8));
#endif
}

Agner uses the blendv_epi8 intrinsic with SSE4.1 support.

Edit:

Interestingly, after doing more disassembly work (with optimized builds), at least my two implementations get compiled to exactly the same thing. Example disassembly targeting "ivy-bridge" (AVX).

vpmullw xmm2,xmm0,xmm1
vpsrlw xmm0,xmm0,0x8
vpsrlw xmm1,xmm1,0x8
vpmullw xmm0,xmm0,xmm1
vpsllw xmm0,xmm0,0x8
vpand xmm1,xmm2,XMMWORD PTR [rip+0x281]
vpor xmm0,xmm0,xmm1

It uses the "AVX2-optimized" version with a pre-compiled 128-bit xmm constant. Compiling with only SSE2 support produces a similar results (though using SSE2 instructions). I suspect Agner Fog's original solution might get optimized to the same thing (would be crazy if it didn't). No idea how Marat's original solution compares in an optimized build, though for me having a single method for all x86 simd extensions newer than and including SSE2 is quite nice.

回复收藏 0 原文

玩心态 2024-12-23 23:02:32

MMX/SSE/AVX 中没有 8 位乘法。但是，您可以使用 16 位乘法来模拟 8 位乘法内在函数，如下所示：

inline __m128i _mm_mullo_epi8(__m128i a, __m128i b)
{
    __m128i zero = _mm_setzero_si128();
    __m128i Alo = _mm_cvtepu8_epi16(a);
    __m128i Ahi = _mm_unpackhi_epi8(a, zero);
    __m128i Blo = _mm_cvtepu8_epi16(b);
    __m128i Bhi = _mm_unpackhi_epi8(b, zero);
    __m128i Clo = _mm_mullo_epi16(Alo, Blo);
    __m128i Chi = _mm_mullo_epi16(Ahi, Bhi);
    __m128i maskLo = _mm_set_epi8(0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 14, 12, 10, 8, 6, 4, 2, 0);
    __m128i maskHi = _mm_set_epi8(14, 12, 10, 8, 6, 4, 2, 0, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80);
    __m128i C = _mm_or_si128(_mm_shuffle_epi8(Clo, maskLo), _mm_shuffle_epi8(Chi, maskHi));

     return C;
}

There is no 8-bit multiplication in MMX/SSE/AVX. However, you can emulate 8-bit multiplication intrinsic using 16-bit multiplication as follows:

inline __m128i _mm_mullo_epi8(__m128i a, __m128i b)
{
    __m128i zero = _mm_setzero_si128();
    __m128i Alo = _mm_cvtepu8_epi16(a);
    __m128i Ahi = _mm_unpackhi_epi8(a, zero);
    __m128i Blo = _mm_cvtepu8_epi16(b);
    __m128i Bhi = _mm_unpackhi_epi8(b, zero);
    __m128i Clo = _mm_mullo_epi16(Alo, Blo);
    __m128i Chi = _mm_mullo_epi16(Ahi, Bhi);
    __m128i maskLo = _mm_set_epi8(0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 14, 12, 10, 8, 6, 4, 2, 0);
    __m128i maskHi = _mm_set_epi8(14, 12, 10, 8, 6, 4, 2, 0, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80);
    __m128i C = _mm_or_si128(_mm_shuffle_epi8(Clo, maskLo), _mm_shuffle_epi8(Chi, maskHi));

     return C;
}

回复收藏 0 原文

戏蝶舞 2024-12-23 23:02:32

唯一的 8 位 SSE 乘法指令是 PMADDUBSW （SSSE3 及更高版本，C/C++ 内在:<一href="https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=4653,4653,4653&text=_mm_maddubs_epi16" rel="nofollow noreferrer"> _mm_maddubs_epi16）。这会将 16 x 8 位无符号值乘以 16 x 8 位有符号值，然后将相邻对相加得到 8 x 16 位有符号结果。如果您不能使用这个相当专业的指令，那么您需要解压缩为 16 位向量对并使用常规 16 位乘法指令。显然，这意味着至少有 2 倍的吞吐量，因此如果可以的话请使用 8 位乘法。

回复收藏 0 原文

~没有更多了~