AVX2 中的 _mm_alignr_epi8 (PALIGNR) 等效项

发布于 2024-12-21 14:12:44 字数 1223 浏览 1 评论 0原文

在 SSE3 中,PALIGNR 指令执行以下操作:

PALIGNR 将目标操作数(第一个操作数)和源操作数(第二个操作数)连接成一个中间组合,以字节粒度将组合右移一个常量立即数,并将右对齐结果提取到目的地。

我目前正在移植我的 SSE4 代码以使用 AVX2 指令并使用 256 位寄存器而不是 128 位。 天真地,我相信内在函数 _mm256_alignr_epi8 (VPALIGNR) 仅在 256 位寄存器上执行与 _mm_alignr_epi8 相同的操作。然而遗憾的是,事实并非如此。事实上,_mm256_alignr_epi8将256bit寄存器视为2个128bit寄存器,并对相邻的两个128bit寄存器执行2次“对齐”操作。有效地执行与 _mm_alignr_epi8 相同的操作,但同时在 2 个寄存器上执行。这里最清楚地说明了: _mm256_alignr_epi8

目前我的解决方案是通过拆分继续使用 _mm_alignr_epi8 ymm(256位)寄存器写入两个xmm(128位)寄存器(高位和低位),如下所示:

__m128i xmm_ymm1_hi = _mm256_extractf128_si256(ymm1, 0);
__m128i xmm_ymm1_lo = _mm256_extractf128_si256(ymm1, 1);
__m128i xmm_ymm2_hi = _mm256_extractf128_si256(ymm2, 0);
__m128i xmm_ymm_aligned_lo = _mm_alignr_epi8(xmm_ymm1_lo, xmm_ymm1_hi, 1);
__m128i xmm_ymm_aligned_hi = _mm_alignr_epi8(xmm_ymm2_hi, xmm_ymm1_lo, 1);
__m256i xmm_ymm_aligned = _mm256_set_m128i(xmm_ymm_aligned_lo, xmm_ymm_aligned_hi);

这可行,但必须有更好的方法,对吧? 是否有可能更“通用”的 AVX2 指令应该用来获得相同的结果?

In SSE3, the PALIGNR instruction performs the following:

PALIGNR concatenates the destination operand (the first operand) and the source operand (the second operand) into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right-aligned result into the destination.

I'm currently in the midst of porting my SSE4 code to use AVX2 instructions and working on 256bit registers instead of 128bit.
Naively, I believed that the intrinsics function _mm256_alignr_epi8 (VPALIGNR) performs the same operation as _mm_alignr_epi8 only on 256bit registers. Sadly however, that is not exactly the case. In fact, _mm256_alignr_epi8 treats the 256bit register as 2 128bit registers and performs 2 "align" operations on the two neighboring 128bit registers. Effectively performing the same operation as _mm_alignr_epi8 but on 2 registers at once. It's most clearly illustrated here: _mm256_alignr_epi8

Currently my solution is to keep using _mm_alignr_epi8 by splitting the ymm (256bit) registers into two xmm (128bit) registers (high and low), like so:

__m128i xmm_ymm1_hi = _mm256_extractf128_si256(ymm1, 0);
__m128i xmm_ymm1_lo = _mm256_extractf128_si256(ymm1, 1);
__m128i xmm_ymm2_hi = _mm256_extractf128_si256(ymm2, 0);
__m128i xmm_ymm_aligned_lo = _mm_alignr_epi8(xmm_ymm1_lo, xmm_ymm1_hi, 1);
__m128i xmm_ymm_aligned_hi = _mm_alignr_epi8(xmm_ymm2_hi, xmm_ymm1_lo, 1);
__m256i xmm_ymm_aligned = _mm256_set_m128i(xmm_ymm_aligned_lo, xmm_ymm_aligned_hi);

This works, but there has to be a better way, right?
Is there a perhaps more "general" AVX2 instruction that should be using to get the same result?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

想你的星星会说话 2024-12-28 14:12:44

您使用 palignr 做什么?如果只是为了处理数据错位,只需使用错位加载即可;它们在现代 Intel µ 架构上通常“足够快”(并且会节省大量代码大小)。

如果出于其他原因需要类似 palignr 的行为,您可以简单地利用未对齐加载支持以无分支的方式完成此操作。除非您完全受加载存储限制,否则这可能是首选的习惯用法。

static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
    // Do whatever your compiler needs to make this buffer 64-byte aligned.
    // You want to avoid the possibility of a page-boundary crossing load.
    char buffer[64];

    // Two aligned stores to fill the buffer.
    _mm256_store_si256((__m256i *)&buffer[0], v0);
    _mm256_store_si256((__m256i *)&buffer[32], v1);

    // Misaligned load to get the data we want.
    return _mm256_loadu_si256((__m256i *)&buffer[n]);
}

如果您可以提供有关您如何使用 palignr 的更多信息,我可能会提供更多帮助。

What are you using palignr for? If it's only to handle data misalignment, simply use misaligned loads instead; they are generally "fast enough" on modern Intel µ-architectures (and will save you a lot of code size).

If you need palignr-like behavior for some other reason, you can simply take advantage of the unaligned load support to do it in a branch-free manner. Unless you're totally load-store bound, this is probably the preferred idiom.

static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
    // Do whatever your compiler needs to make this buffer 64-byte aligned.
    // You want to avoid the possibility of a page-boundary crossing load.
    char buffer[64];

    // Two aligned stores to fill the buffer.
    _mm256_store_si256((__m256i *)&buffer[0], v0);
    _mm256_store_si256((__m256i *)&buffer[32], v1);

    // Misaligned load to get the data we want.
    return _mm256_loadu_si256((__m256i *)&buffer[n]);
}

If you can provide more information about how exactly you're using palignr, I can probably be more helpful.

め七分饶幸 2024-12-28 14:12:44

我们需要 2 条指令:“vperm2i128”和“vpalignr”来将“palignr”扩展为 256 位。

请参阅:https://software.intel .com/en-us/blogs/2015/01/13/programming-using-avx2-permutations

We need 2 instructions: “vperm2i128” and “vpalignr” to extend “palignr” on 256 bits.

See: https://software.intel.com/en-us/blogs/2015/01/13/programming-using-avx2-permutations

浅唱々樱花落 2024-12-28 14:12:44

我能想到的唯一解决方案是:

static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
  if (n < 16)
  {
    __m128i v0h = _mm256_extractf128_si256(v0, 0);
    __m128i v0l = _mm256_extractf128_si256(v0, 1);
    __m128i v1h = _mm256_extractf128_si256(v1, 0);
    __m128i vouth = _mm_alignr_epi8(v0l, v0h, n);
    __m128i voutl = _mm_alignr_epi8(v1h, v0l, n);
    __m256i vout = _mm256_set_m128i(voutl, vouth);
    return vout;
  }
  else
  {
    __m128i v0h = _mm256_extractf128_si256(v0, 1);
    __m128i v0l = _mm256_extractf128_si256(v1, 0);
    __m128i v1h = _mm256_extractf128_si256(v1, 1);
    __m128i vouth = _mm_alignr_epi8(v0l, v0h, n - 16);
    __m128i voutl = _mm_alignr_epi8(v1h, v0l, n - 16);
    __m256i vout = _mm256_set_m128i(voutl, vouth);
    return vout;
  }
}

我认为这与您的解决方案几乎相同,除了它还处理 >= 16 字节的移位。

The only solution I was able to come up with for this is:

static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
  if (n < 16)
  {
    __m128i v0h = _mm256_extractf128_si256(v0, 0);
    __m128i v0l = _mm256_extractf128_si256(v0, 1);
    __m128i v1h = _mm256_extractf128_si256(v1, 0);
    __m128i vouth = _mm_alignr_epi8(v0l, v0h, n);
    __m128i voutl = _mm_alignr_epi8(v1h, v0l, n);
    __m256i vout = _mm256_set_m128i(voutl, vouth);
    return vout;
  }
  else
  {
    __m128i v0h = _mm256_extractf128_si256(v0, 1);
    __m128i v0l = _mm256_extractf128_si256(v1, 0);
    __m128i v1h = _mm256_extractf128_si256(v1, 1);
    __m128i vouth = _mm_alignr_epi8(v0l, v0h, n - 16);
    __m128i voutl = _mm_alignr_epi8(v1h, v0l, n - 16);
    __m256i vout = _mm256_set_m128i(voutl, vouth);
    return vout;
  }
}

which I think is pretty much identical to your solution except it also handles shifts of >= 16 bytes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文