用X64 Simd换nibble

发布于 2025-01-22 13:50:22 字数 203 浏览 1 评论 0 原文

我知道 byte shuffling 指令，但是我想对Nibbles做同样的事情（4位值），我想用一个64位单词将16个小吃洗净。我的洗牌索引也被存储为16个小吃。最有效的实施是什么？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱格式化 2025-01-29 13:50:22

使用以这种方式存储的控制向量的任意洗牌？ gh，很难与之合作。我想您必须将两者都解开以喂食SSSE3 PSHUFB ，然后重新包装结果。

可能只是 punpcklbw 针对右移副本，然后掩盖以将每个字节中的低4位保留。然后 PSHUFB 。

有时，奇数/偶数比扩大每个元素要容易（因此位仅留在其原始字节或单词中）。在这种情况下，如果我们可以更改您的nibble索引编号，则 punpcklqdq 可能会将奇数甚至nibbles放在高处，准备将它们放回原处。

但是没有这样做，重新包装是一个单独的问题。我想将相邻的字节对组合到低字节中的单词中，也许如果吞吐量比潜伏期更重要。然后，您可以 packuswd （反对零或自身）或 PSHUFB （具有恒定控制向量）。

如果您进行了多次这样的混乱，则可以将两个向量打包到一个，以存储 movhps / movq 。使用AVX2，可能会使所有其他指令在两个128位车道中的两个独立的散装上工作。

// UNTESTED, requires only SSSE3
#include <stdint.h>
#include <immintrin.h>

uint64_t shuffle_nibbles(uint64_t data, uint64_t control)
{
  __m128i vd = _mm_cvtsi64_si128(data);    // movq
  __m128i vd_hi = _mm_srli_epi32(vd, 4);   // x86 doesn't have a SIMD byte shift
  vd = _mm_unpacklo_epi8(vd, vd_hi);       // every nibble at the bottom of a byte, with high garbage
  vd = _mm_and_si128(vd, _mm_set1_epi8(0x0f));  // clear high garbage for later merging

  __m128i vc = _mm_cvtsi64_si128(control);
  __m128i vc_hi = _mm_srli_epi32(vc, 4);
  vc = _mm_unpacklo_epi8(vc, vc_hi);

  vc = _mm_and_si128(vc, _mm_set1_epi8(0x0f));  // make sure high bit is clear, else pshufb zeros that element.
       //  AVX-512VBMI  vpermb doesn't have that problem, if you have it available
  vd = _mm_shuffle_epi8(vd, vc);

       // left-hand input is the unsigned one, right hand is treated as signed bytes.
  vd = _mm_maddubs_epi16(vd, _mm_set1_epi16(0x1001));  // hi nibbles << 4 (*= 0x10), lo nibbles *= 1.

  // vd has nibbles merged into bytes, but interleaved with zero bytes
  vd = _mm_packus_epi16(vd, vd);  // duplicate vd into low & high halves.
  //  Pack against _mm_setzero_si128() if you're not just going to movq into memory or a GPR and you want the high half of the vector to be zero.
  return _mm_cvtsi128_si64(vd);
}

用 0x0f 掩盖数据（而不是之后），可以在CPU上使用两个随机拨动单元进行更多的ILP。至少如果它们已经在向量寄存器中具有UINT64_T值，或者数据和控制值来自内存，因此可以将两者都加载在同一周期中。如果来自GPRS，则为 vmovq XMM的1/时钟吞吐量，reg 意味着DEP链之间存在资源冲突，因此它们都不能在同一周期中开始。但是，由于我们的数据可能在控制之前就已经准备就绪，因此提早掩盖它会使它远离控制延迟的关键路径。

如果延迟是瓶颈而不是通常的吞吐量，请考虑用右移 pmaddubsw pmaddubsw 和/and/pack替换。或 pshufb 在奇数字节中忽略垃圾时打包。由于无论如何您都需要另一个常数，因此也可能使其成为 pshufb 常数，而不是和。

如果您有AVX-512，则使用 vpternlogd 进行偏移和搅拌，可以避免在改组之前掩盖数据，而 vpermb 而不是 vpshufb 避免需要掩盖控件，因此您将避免完全完全常数 set1_epi8（0x0f）常数。

Clang的Shuffle Optimizer没有发现任何内容，只需像GCC一样将其编译为撰写（ https：//godbolt.orgg/ z/xz7ttbm1d ），即使使用 -march = sapphirerapids 。没有发现它可以使用 vpermb 而不是 vpand / vpshufb 。

shuffle_nibbles(unsigned long, unsigned long):
        vmovq   xmm0, rdi
        vpsrld  xmm1, xmm0, 4
        vpunpcklbw      xmm0, xmm0, xmm1        # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
        vmovq   xmm1, rsi
        vpsrld  xmm2, xmm1, 4
        vpunpcklbw      xmm1, xmm1, xmm2        # xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
        vmovdqa xmm2, xmmword ptr [rip + .LCPI0_0] # xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
        vpand   xmm0, xmm0, xmm2
        vpand   xmm1, xmm1, xmm2
        vpshufb xmm0, xmm0, xmm1
        vpmaddubsw      xmm0, xmm0, xmmword ptr [rip + .LCPI0_1]
        vpackuswb       xmm0, xmm0, xmm0
        vmovq   rax, xmm0
        ret

（没有AVX，它需要2个额外的 movdqa 寄存器 - 复制指令。）

Arbitrary shuffles with a control vector that has to be stored this way? Ugh, hard to work with. I guess you'd have to unpack both to feed SSSE3 pshufb and then re-pack that result.

Probably just punpcklbw against a right-shifted copy, then AND mask to keep only the low 4 bits in each byte. Then pshufb.

Sometimes an odd/even split is easier than widening each element (so bits just stay within their original byte or word). In this case, if we could change your nibble index numbering, punpcklqdq could put the odd or even nibbles in the high half, ready to bring them back down and OR.

But without doing that, re-packing is a separate problem. I guess combine adjacent pairs of bytes into a word in the low byte, perhaps with pmaddubsw if throughput is more important than latency. Then you can packuswd (against zero or itself) or pshufb (with a constant control vector).

If you were doing multiple such shuffles, you could pack two vectors down to one, to store with movhps / movq. Using AVX2, it might be possible to have all the other instructions working on two independent shuffles in the two 128-bit lanes.

// UNTESTED, requires only SSSE3
#include <stdint.h>
#include <immintrin.h>

uint64_t shuffle_nibbles(uint64_t data, uint64_t control)
{
  __m128i vd = _mm_cvtsi64_si128(data);    // movq
  __m128i vd_hi = _mm_srli_epi32(vd, 4);   // x86 doesn't have a SIMD byte shift
  vd = _mm_unpacklo_epi8(vd, vd_hi);       // every nibble at the bottom of a byte, with high garbage
  vd = _mm_and_si128(vd, _mm_set1_epi8(0x0f));  // clear high garbage for later merging

  __m128i vc = _mm_cvtsi64_si128(control);
  __m128i vc_hi = _mm_srli_epi32(vc, 4);
  vc = _mm_unpacklo_epi8(vc, vc_hi);

  vc = _mm_and_si128(vc, _mm_set1_epi8(0x0f));  // make sure high bit is clear, else pshufb zeros that element.
       //  AVX-512VBMI  vpermb doesn't have that problem, if you have it available
  vd = _mm_shuffle_epi8(vd, vc);

       // left-hand input is the unsigned one, right hand is treated as signed bytes.
  vd = _mm_maddubs_epi16(vd, _mm_set1_epi16(0x1001));  // hi nibbles << 4 (*= 0x10), lo nibbles *= 1.

  // vd has nibbles merged into bytes, but interleaved with zero bytes
  vd = _mm_packus_epi16(vd, vd);  // duplicate vd into low & high halves.
  //  Pack against _mm_setzero_si128() if you're not just going to movq into memory or a GPR and you want the high half of the vector to be zero.
  return _mm_cvtsi128_si64(vd);
}

Masking the data with 0x0f ahead of the shuffle (instead of after) allows more ILP on CPUs with two shuffle units. At least if they already had the uint64_t values in vector registers, or if the data and control values are coming from memory so both can be loaded in the same cycle. If coming from GPRs, 1/clock throughput for vmovq xmm, reg means there's a resource conflict between the dep chains so they can't both start in the same cycle. But since we the data might be ready before the control, masking early keeps it off the critical path for control->output latency.

If latency is a bottleneck instead of the usual throughput, consider replacing pmaddubsw with right-shift by 4, por, and AND/pack. Or pshufb to pack while ignoring garbage in odd bytes. Since you'd need another constant anyway, might as well make it a pshufb constant instead of and.

If you had AVX-512, a shift and bit-blend with vpternlogd could avoid needing to mask the data before shuffling, and vpermb instead of vpshufb would avoid needing to mask the control, so you'd avoid the set1_epi8(0x0f) constant entirely.

clang's shuffle optimizer didn't spot anything, just compiling it as-written like GCC does (https://godbolt.org/z/xz7TTbM1d), even with -march=sapphirerapids. Not spotting that it could use vpermb instead of vpand / vpshufb.

shuffle_nibbles(unsigned long, unsigned long):
        vmovq   xmm0, rdi
        vpsrld  xmm1, xmm0, 4
        vpunpcklbw      xmm0, xmm0, xmm1        # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
        vmovq   xmm1, rsi
        vpsrld  xmm2, xmm1, 4
        vpunpcklbw      xmm1, xmm1, xmm2        # xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3],xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
        vmovdqa xmm2, xmmword ptr [rip + .LCPI0_0] # xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
        vpand   xmm0, xmm0, xmm2
        vpand   xmm1, xmm1, xmm2
        vpshufb xmm0, xmm0, xmm1
        vpmaddubsw      xmm0, xmm0, xmmword ptr [rip + .LCPI0_1]
        vpackuswb       xmm0, xmm0, xmm0
        vmovq   rax, xmm0
        ret

(Without AVX, it requires 2 extra movdqa register-copy instructions.)

回复收藏 0 原文

何止钟意 2025-01-29 13:50:22

今天我遇到了这个问题。在AVX-512中，您可以使用 vpmultishiftqb （ 8位的块。以下是实现。

#include <immintrin.h>
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

// Convention: (a & (0xf << (4 * i))) >> (4 * i) is the ith nibble of a
// (i.e., lowest-significant is 0)
uint64_t shuffle_nibbles(uint64_t data, uint64_t indices) {
#if defined(__AVX512VBMI__) && defined(__AVX512VL__)
    // If your data is already in vectors, then this method also works in parallel
    const __m128i lo_nibble_msk = _mm_set1_epi8(0x0f);
    __m128i v_data = _mm_cvtsi64_si128(data);
    __m128i v_indices = _mm_cvtsi64_si128(indices);

    __m128i indices_lo = _mm_and_si128(lo_nibble_msk, v_indices);
    __m128i indices_hi = _mm_andnot_si128(lo_nibble_msk, v_indices);
    indices_lo = _mm_slli_epi32(indices_lo, 2);
    indices_hi = _mm_srli_epi32(indices_hi, 2);

    // Look up unaligned bytes
    __m128i shuffled_hi = _mm_multishift_epi64_epi8(indices_hi, v_data);
    __m128i shuffled_lo = _mm_multishift_epi64_epi8(indices_lo, v_data);

    shuffled_hi = _mm_slli_epi32(shuffled_hi, 4);
    // msk ? lo : hi
    __m128i shuffled = _mm_ternarylogic_epi32(lo_nibble_msk, shuffled_lo, shuffled_hi, 202);

    return _mm_cvtsi128_si64(shuffled);
#else
    // Fallback scalar implementation (preferably Peter Cordes's SSE solution--this is as an example)
    uint64_t result = 0;
    for (int i = 0; i < 16; ++i) {
        indices = (indices >> 60) + (indices << 4);

        int idx = indices & 0xf;
        result <<= 4;
        result |= (data >> (4 * idx)) & 0xf;
    }

    return result;
#endif
}

int main() {
        // 0xaa025411fe034102
        uint64_t r1 = shuffle_nibbles(0xfedcba9876543210, 0xaa025411fe034102);
        // 0x55fdabee01fcbefd
        uint64_t r2 = shuffle_nibbles(0x0123456789abcdef, 0xaa025411fe034102);
        // 0xaaaa00002222aaaa
        uint64_t r3 = shuffle_nibbles(0xaa025411fe034102, 0xeeee11110000ffff);

        printf("0x%" PRIx64 "\n", r1);
        printf("0x%" PRIx64 "\n", r2);
        printf("0x%" PRIx64 "\n", r3);
}

clang屈服（ 2 ）：

.LCPI0_0:
        .zero   16,60
shuffle_nibbles(unsigned long, unsigned long):
        vmovq   xmm0, rdi
        vmovq   xmm1, rsi
        vpslld  xmm2, xmm1, 2
        vpsrld  xmm1, xmm1, 2
        vmovdqa xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60]
        vpand   xmm1, xmm1, xmm3
        vpmultishiftqb  xmm1, xmm1, xmm0
        vpand   xmm2, xmm2, xmm3
        vpmultishiftqb  xmm0, xmm2, xmm0
        vpslld  xmm1, xmm1, 4
        vpternlogd      xmm1, xmm0, dword ptr [rip + .LCPI0_1]{1to4}, 216
        vmovq   rax, xmm1

就我而言，我正在将64-bit-Element element element vectors中的nibbles改组为；此方法还避免了扩大的需求。如果您的洗牌是/是恒定的，并且您保持在向量，则此方法将减少到四个说明：2x vpmultishiftqb ，1x vpslld 和1x vpternlogd 。计数µOPS表明延迟5，每2个周期的吞吐量为1个，瓶装在洗牌µOPS上，用于128位和256位载体；由于后两个说明减少了执行单元，因此3对于512位向量的吞吐量为3。

I came across this problem today. In AVX-512 you can use vpmultishiftqb (1), an amusing instruction available in Ice Lake and after (and apparently in Zen 4, according to Wikipedia), to shuffle nibbles much more quickly. Its power lies in its ability to permute bytes in an unaligned fashion: It takes the eight 8-bit chunks in each 64-bit element and selects unaligned 8-bit chunks from the corresponding element. Below is an implementation.

#include <immintrin.h>
#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

// Convention: (a & (0xf << (4 * i))) >> (4 * i) is the ith nibble of a
// (i.e., lowest-significant is 0)
uint64_t shuffle_nibbles(uint64_t data, uint64_t indices) {
#if defined(__AVX512VBMI__) && defined(__AVX512VL__)
    // If your data is already in vectors, then this method also works in parallel
    const __m128i lo_nibble_msk = _mm_set1_epi8(0x0f);
    __m128i v_data = _mm_cvtsi64_si128(data);
    __m128i v_indices = _mm_cvtsi64_si128(indices);

    __m128i indices_lo = _mm_and_si128(lo_nibble_msk, v_indices);
    __m128i indices_hi = _mm_andnot_si128(lo_nibble_msk, v_indices);
    indices_lo = _mm_slli_epi32(indices_lo, 2);
    indices_hi = _mm_srli_epi32(indices_hi, 2);

    // Look up unaligned bytes
    __m128i shuffled_hi = _mm_multishift_epi64_epi8(indices_hi, v_data);
    __m128i shuffled_lo = _mm_multishift_epi64_epi8(indices_lo, v_data);

    shuffled_hi = _mm_slli_epi32(shuffled_hi, 4);
    // msk ? lo : hi
    __m128i shuffled = _mm_ternarylogic_epi32(lo_nibble_msk, shuffled_lo, shuffled_hi, 202);

    return _mm_cvtsi128_si64(shuffled);
#else
    // Fallback scalar implementation (preferably Peter Cordes's SSE solution--this is as an example)
    uint64_t result = 0;
    for (int i = 0; i < 16; ++i) {
        indices = (indices >> 60) + (indices << 4);

        int idx = indices & 0xf;
        result <<= 4;
        result |= (data >> (4 * idx)) & 0xf;
    }

    return result;
#endif
}

int main() {
        // 0xaa025411fe034102
        uint64_t r1 = shuffle_nibbles(0xfedcba9876543210, 0xaa025411fe034102);
        // 0x55fdabee01fcbefd
        uint64_t r2 = shuffle_nibbles(0x0123456789abcdef, 0xaa025411fe034102);
        // 0xaaaa00002222aaaa
        uint64_t r3 = shuffle_nibbles(0xaa025411fe034102, 0xeeee11110000ffff);

        printf("0x%" PRIx64 "\n", r1);
        printf("0x%" PRIx64 "\n", r2);
        printf("0x%" PRIx64 "\n", r3);
}

Clang yields (2):

.LCPI0_0:
        .zero   16,60
shuffle_nibbles(unsigned long, unsigned long):
        vmovq   xmm0, rdi
        vmovq   xmm1, rsi
        vpslld  xmm2, xmm1, 2
        vpsrld  xmm1, xmm1, 2
        vmovdqa xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60]
        vpand   xmm1, xmm1, xmm3
        vpmultishiftqb  xmm1, xmm1, xmm0
        vpand   xmm2, xmm2, xmm3
        vpmultishiftqb  xmm0, xmm2, xmm0
        vpslld  xmm1, xmm1, 4
        vpternlogd      xmm1, xmm0, dword ptr [rip + .LCPI0_1]{1to4}, 216
        vmovq   rax, xmm1

In my case, I am shuffling nibbles in 64-bit-element vectors; this method also avoids the need for widening. If your shuffle(s) is/are constant and you stay in vectors, this method reduces to a measly four instructions: 2x vpmultishiftqb, 1x vpslld, and 1x vpternlogd. Counting µops suggests a latency of 5 and throughput of one every 2 cycles, bottlenecked on shuffle µops, for 128- and 256-bit vectors; and a throughput of 3 for 512-bit vectors, due to reduced execution units for the latter two instructions.