SSE Interleave/Merge/合并2个向量,使用掩码,每个元素的条件移动?

发布于 2025-02-10 13:24:52 字数 765 浏览 1 评论 0 原文

本质上,我试图在2 SSE( __ M128 )向量上实现类似三元的操作。 掩码是另一个 __ M128 _MM_CMPLT_PS 获得的向量。

我要实现的是选择vector a 的元素。蒙版的元素是 0

所需操作的示例(在半伪码中):

const __m128i mask = {0xffffffff, 0, 0xffffffff, 0};  // e.g. a compare result
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};
const __m128 c = interleave(a, b, mask); // c contains {1.0, 2.1, 1.2, 2.3}

我在SIMD(SSE)内在实现此操作时遇到了困难。 我最初的想法是混合 a b 使用移动混合,然后使用掩码将元素洗牌,但是 _mm_shuffle_ps int int < /代码>由4位指数组成的掩码,而不是 __ M128 掩码。

另一个想法是使用类似于有条件举动的东西,但是SSE似乎没有有条件的举动(或者至少我没有在英特尔的指南中找到它)。

通常如何在SSE中完成?

Essentially i am trying to implement a ternary-like operation on 2 SSE (__m128) vectors.
The mask is another __m128 vector obtained from _mm_cmplt_ps.

What i want to achieve is to select element of vector a when the corresponding element of the mask is 0xffff'ffff and element of b when the mask's element is 0.

Example of the desired operation (in semi-pseudocode):

const __m128i mask = {0xffffffff, 0, 0xffffffff, 0};  // e.g. a compare result
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};
const __m128 c = interleave(a, b, mask); // c contains {1.0, 2.1, 1.2, 2.3}

I am having trouble implementing this operation in SIMD (SSE) intrinsics.
My original idea was to mix a and b using moves and then shuffle the elements using the mask, however _mm_shuffle_ps takes an int mask consisting of 4-bit indices, not an __m128 mask.

Another idea was to use something akin to a conditional move, but there does not seem to be a conditional move in SSE (or at least I did not manage to find it in Intel's guide).

How is this normally done in SSE?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

和我恋爱吧 2025-02-17 13:24:52

这就是所谓的“混合”
英特尔的内在指南小组融合说明“ nofollow noreferrer”>在“ swizzle”类别下,以及与夏普斯(Shizzle)类别。

您正在寻找sse4.1 blendvps (intinsic _mm_blendv_ps )。其他元素大小为 _mm_blendv_pd _mm_blendv_epi8 。这些使用相应元素的高位作为控件,因此您可以直接使用float(没有 _mm_cmp_ps ),如果其符号位很有趣。

__m128i mask = _mm_castps_si128(_mm_cmplt_ps(x, y));   // integer 0 / -1 bit patterns
__m128 c = _mm_blendv_ps(b, a, mask);  // copy element from 2nd op where the mask is set

请注意,我将 a,b 转换为 b,a ,因为SSE混合物将元素从第二个操作数中删除,以设置掩码的位置。就像有条件的移动,在条件为真时复制。如果您将您的常数 /变量命名,则可以编写 Blend(A,B,Mask)< / code>,而不是使它们向后。或给他们有意义的名称行 Ones


在其他情况下,您的控制操作数为常数,也有 _mm_blend_ps /pd/ _mm_blend_epi16 (一个8--位直接操作数只能控制8个单独的元素,因此8x 2字节。)

performance

Blendps XMM,XMM,IMM8 是Intel CPU上任何矢量Alu端口的单uop指令,如<代码> andps 。 ( https://uops.info/ )。 pblendw 也是单uop,但仅在英特尔的端口5上运行,并与Shuffles竞争。 AVX2 VPBlendd 与dword granularity, vblendps 的整数版本以及相同的效率相同。 Intel CPU上具有额外的旁路延迟。

(这是一项整数 - simd指令;与随机相比,如果您混合整数和fp simd,则混合物在 5)。不幸的是,AVX版本( vblendvps )仍然是Intel上的2个UOP(Alder Lake-P上的3个,Alder Lake-e上4个)。尽管UOPS至少可以在3个矢量Alu端口中的任何一个上运行。

vblendvps 版本在ASM中很时髦,因为它具有4个操作数,而不是覆盖任何输入寄存器。 (非AVX版本覆盖一个输入,并隐式使用XMM0作为掩码输入。)Intel UOPS显然无法处理4个单独的寄存器,只有3个用于FMA, ADC cmov 。 (和avx-512 vpternlogd 可以在单个UOP 中进行比切的混合物)

AMD具有完全有效的 vblendvps ,单个UOP(除外)对于Zen1上的YMM),带2/时钟吞吐量。


效仿

没有SSE4.1,您可以使用Andn/and/或(X&amp; bask) | (y&amp; bask)等于 _mm_blendv_ps(x,y,bask),除了它是纯粹的位,因此每个掩码元素的所有位都应匹配顶部位。 (例如,比较结果,或用 _mm_srai_epi32(bask,31)。)

编译器知道此技巧,如果您在没有任何ARCH选项(如<代码> -March = Haswell 或其他。 (SSE4.1在第二代核心2中是新的,因此它越来越广泛,但不是

通用 ^ y)&amp;如果您可以重复使用 x ^ y (在AKI中建议)。级别的平行性。

没有AVX的非毁灭性3-动作指令,此方式将需要一个移动XMM,XMM register-copy来保存 b mask 而不是 a 。 Andn销毁掩码( 〜mask&amp; x )。

用 关心AMD,如果可以预先计算 A^B ,您仍然可以选择一个和/XOR

。代码>

(适用于整数和FP; 0.0F 0.0 的位模式是全元素,与整数 0 。)

您无需复制零从任何地方,只有 x&amp; mask x&amp; 〜mask

(x&amp; 〜mask)|(y&amp; mask)表达式减少为x = 0或y = 0;该项变为零, z | = 0 < /code>是一个no-op

。 x +y:x ,将添加和混合的延迟放在关键路径上,您可以根据掩码 x +=选择y或零> x += y&amp; mask; 或做相反的事情, x += 〜mask&amp; y 使用 _mm_andn_ps(bask,vy)

这具有添加和一个操作(因此已经比某些CPU上的混合物便宜,并且您不需要在另一台寄存器中使用0.0源操作数)。另外,如果您是在循环中使用+= 操作,则通过 x 现在仅包括+= 操作独立 y&amp;蒙版。例如,仅将数组的元素匹配, sum += a [i]&gt; = thresh? a [i]:0.0f;

,示例是由于不必要地延长关键路径而导致额外放缓的示例,请参见 gcc优化flag -o3使代码比-o2 gcc的标量ASM使用 cmov 具有该缺陷,执行 cmov 作为循环依赖链的一部分,而不是准备 0 arr [i] 输入它。

如果您想要

a&lt;上?答:上,您可以用 _mm_min_ps 而不是 cmpps / blendvps 。 (类似地, _mm_max_ps _mm_min_pd / _mm_max_pd 。)

请参见有关其确切语义的详细信息,什么指令给出了什么指令? ,包括一个长期存在的(但最近固定)的GCC错误,其中FP内含物没有提供预期的严格FP语义,如果一个人是Nan,则可以保留哪种操作数。

或对于整数,SSE2是高度非正交的(iNT16_T的签名最小/最大,UINT8_T的未签名最小/最大)。类似于饱和包说明。 SSE4.1填充缺失的操作数大小和签名组合。

  • 签名:sse2 _mm_max_epi16 (以及所有这些的相应 min s)
    • sse4.1 _mm_max_epi32 / _mm_max_epi8 ; avx-512 _mm_max_epi64
  • unsigned:sse2 _mm_max_epu8
    • sse4.1 _mm_max_epu16 / _mm_max_epu32 ; avx-512 _mm_max_epu64

avx-512使蒙版/混合一流操作

AVX-512比较掩码寄存器, k0..k7..K7..K7..K7..K7 (内部类型 __ MMASK1616161616161616 等)。合并屏蔽或零掩蔽可能是大多数ALU指令的一部分。还有一个专用的混合说明,可以根据口罩融合。

我不会在这里详细介绍,可以说如果您有很多有条件的事情要做,AVX-512很棒(即使您仅使用256位矢量来避免涡轮时钟速度惩罚等等。)您将需要专门阅读AVX-512的详细信息。

That's called a "blend".
Intel's intrinsics guide groups blend instructions under the "swizzle" category, along with shuffles.

You're looking for SSE4.1 blendvps (intrinsic _mm_blendv_ps). The other element sizes are _mm_blendv_pd and _mm_blendv_epi8. These use the high bit of the corresponding element as the control, so you can use a float directly (without _mm_cmp_ps) if its sign bit is interesting.

__m128i mask = _mm_castps_si128(_mm_cmplt_ps(x, y));   // integer 0 / -1 bit patterns
__m128 c = _mm_blendv_ps(b, a, mask);  // copy element from 2nd op where the mask is set

Note that I reversed a, b to b, a because SSE blends take the element from the 2nd operand in positions where the mask was set. Like a conditional-move which copies when the condition is true. If you name your constants / variables accordingly, you can write blend(a,b, mask) instead of having them backwards. Or give them meaningful names line ones and twos.


In other cases where your control operand is a constant, there's also _mm_blend_ps / pd / _mm_blend_epi16 (an 8-bit immediate operand can only control 8 separate elements, so 8x 2-byte.)

Performance

blendps xmm, xmm, imm8 is a single-uop instruction for any vector ALU port on Intel CPUs, as cheap as andps. (https://uops.info/). pblendw is also single-uop, but only runs on port 5 on Intel, competing with shuffles. AVX2 vpblendd blends with dword granularity, an integer version of vblendps, and with the same very good efficiency. (It's an integer-SIMD instruction; unlike shuffles, blends have extra bypass latency on Intel CPUs if you mix integer and FP SIMD.)

But variable blendvps is 2 uops on Intel before Skylake (and only for port 5). And the AVX version (vblendvps) is unfortunately still 2 uops on Intel (3 on Alder Lake-P, 4 on Alder Lake-E). Although the uops can at least run on any of 3 vector ALU ports.

The vblendvps version is funky in asm because it has 4 operands, not overwriting any of the inputs registers. (The non-AVX version overwrites one input, and uses XMM0 implicitly as the mask input.) Intel uops apparently can't handle 4 separate registers, only 3 for stuff like FMA, adc, and cmov. (And AVX-512 vpternlogd which can do a bitwise blend as a single uop)

AMD has fully efficient handling of vblendvps, single uop (except for YMM on Zen1) with 2/clock throughput.


Without SSE4.1, you can emulate with ANDN/AND/OR

(x&~mask) | (y&mask) is equivalent to _mm_blendv_ps(x,y,mask), except it's pure bitwise so all the bits of each mask element should match the top bit. (e.g. a compare result, or broadcast the top bit with _mm_srai_epi32(mask, 31).)

Compilers know this trick and will use it when auto-vectorizing scalar code if you compile without any arch options like -march=haswell or whatever. (SSE4.1 was new in 2nd-gen Core 2, so it's increasingly widespread but not universal.)

For constant / loop-invariant a^b without SSE4.1

x ^ ((x ^ y) & mask saves one operation if you can reuse x ^ y. (Suggested in comments by Aki). Otherwise this is worse, longer critical-path latency and no instruction-level parallelism.

Without AVX non-destructive 3-operand instructions, this way would need a movaps xmm,xmm register-copy to save b, but it can choose to destroy the mask instead of a. The AND/ANDN/OR way would normally destroy its 2nd operand, the one you use with y&mask, and destroy the mask with ANDN (~mask & x).

With AVX, vblendvps is guaranteed available. Although if you're targeting Intel (especially Haswell) and don't care about AMD, you might still choose an AND/XOR if a^b can be pre-computed.

Blending with 0: just AND[N]

(Applies to integer and FP; the bit-pattern for 0.0f and 0.0 is all-zeros, same as integer 0.)

You don't need to copy a zero from anywhere, just x & mask, or x & ~mask.

(The (x & ~mask) | (y & mask) expression reduces to this for x=0 or y=0; that term becomes zero, and z|=0 is a no-op.)

For example, to implement x = mask ? x+y : x, which would put the latency of an add and blend on the critical path, you simplify to x += select y or zero according to mask, i.e. to x += y & mask; Or to do the opposite, x += ~mask & y using _mm_andn_ps(mask, vy).

This has an ADD and an AND operation (so already cheaper than blend on some CPUs, and you don't need a 0.0 source operand in another register). Also, the dependency chain through x now only includes the += operation, if you were doing this in a loop with loop-carried x but independent y & mask. e.g. summing only matching elements of an array, sum += A[i]>=thresh ? A[i] : 0.0f;

For an example of an extra slowdown due to lengthening the critical path unnecessarily, see gcc optimization flag -O3 makes code slower than -O2 where GCC's scalar asm using cmov has that flaw, doing cmov as part of the loop-carried dependency chain instead of to prepare a 0 or arr[i] input for it.

Clamping to a MIN or MAX

If you want something like a < upper ? a : upper, you can do that clamping in one instruction with _mm_min_ps instead of cmpps / blendvps. (Similarly _mm_max_ps, and _mm_min_pd / _mm_max_pd.)

See What is the instruction that gives branchless FP min and max on x86? for details on their exact semantics, including a longstanding (but recently fixed) GCC bug where the FP intrinsics didn't provide the expected strict-FP semantics of which operand would be the one to keep if one was NaN.

Or for integer, SSE2 is highly non-orthogonal (signed min/max for int16_t, unsigned min/max for uint8_t). Similar for saturating pack instructions. SSE4.1 fills in the missing operand-size and signedness combinations.

  • Signed: SSE2 _mm_max_epi16 (and corresponding mins for all of these)
    • SSE4.1 _mm_max_epi32 / _mm_max_epi8; AVX-512 _mm_max_epi64
  • Unsigned: SSE2 _mm_max_epu8
    • SSE4.1 _mm_max_epu16 / _mm_max_epu32; AVX-512 _mm_max_epu64

AVX-512 makes masking/blending a first-class operation

AVX-512 compares into a mask register, k0..k7 (intrinsic types __mmask16 and so on). Merge-masking or zero-masking can be part of most ALU instructions. There is also a dedicated blend instruction that blends according to a mask.

I won't go into the details here, suffice it to say if you have a lot of conditional stuff to do, AVX-512 is great (even if you only use 256-bit vectors to avoid the turbo clock speed penalties and so on.) And you'll want to read up on the details for AVX-512 specifically.

嗼ふ静 2025-02-17 13:24:52

正如@peter Cordes在问题的注释中所建议的那样, BlendVPS 指令( _MM_BLENDV _* intinsics)用于预先构造Interleave/condictal move操作。

It should be noted that _mm_blendv_* family select the left-hand elements if the mask contains 0 instead of 0xffffffff, thus a< /code>和 b 应以相反的顺序传递。

然后实现看起来像这样

const __m128i mask = {0xffffffff, 0, 0xffffffff, 0};  // e.g. a compare result
const __m128 m_ps = _mm_castsi128_ps(mask);
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};

#ifdef __SSE4_1__ // _mm_blendv_ps requires SSE4.1 
const __m128 c = _mm_blendv_ps(b, a, m_ps);
#else
const __m128 c = _mm_or_ps(_mm_and_ps(m_ps, a), _mm_andnot_ps(m_ps, b));
#endif
// c contains {1.0, 2.1, 1.2, 2.3}

As suggested by @Peter Cordes in the comments to the question, the blendvps instruction (_mm_blendv_* intrinsics) is used to preform the interleave/conditional move operation.

It should be noted that _mm_blendv_* family select the left-hand elements if the mask contains 0 instead of 0xffffffff, thus a and b should be passed in reverse order.

The implementation then would look like this

const __m128i mask = {0xffffffff, 0, 0xffffffff, 0};  // e.g. a compare result
const __m128 m_ps = _mm_castsi128_ps(mask);
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};

#ifdef __SSE4_1__ // _mm_blendv_ps requires SSE4.1 
const __m128 c = _mm_blendv_ps(b, a, m_ps);
#else
const __m128 c = _mm_or_ps(_mm_and_ps(m_ps, a), _mm_andnot_ps(m_ps, b));
#endif
// c contains {1.0, 2.1, 1.2, 2.3}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文