当前位置：文江博客话题详情

x86 sse c simd

SSE Interleave/Merge/合并2个向量，使用掩码，每个元素的条件移动？

发布于 2025-02-10 13:24:52 字数 765 浏览 1 评论 0 原文

本质上，我试图在2 SSE（ __ M128 ）向量上实现类似三元的操作。掩码是另一个 __ M128 从 _MM_CMPLT_PS 获得的向量。

我要实现的是选择vector a 的元素。蒙版的元素是 0 。

所需操作的示例（在半伪码中）：

const __m128i mask = {0xffffffff, 0, 0xffffffff, 0};  // e.g. a compare result
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};
const __m128 c = interleave(a, b, mask); // c contains {1.0, 2.1, 1.2, 2.3}

我在SIMD（SSE）内在实现此操作时遇到了困难。我最初的想法是混合 a 和 b 使用移动混合，然后使用掩码将元素洗牌，但是 _mm_shuffle_ps 请 int int < /代码>由4位指数组成的掩码，而不是 __ M128 掩码。

另一个想法是使用类似于有条件举动的东西，但是SSE似乎没有有条件的举动（或者至少我没有在英特尔的指南中找到它）。

通常如何在SSE中完成？

原文

Essentially i am trying to implement a ternary-like operation on 2 SSE (__m128) vectors.
The mask is another __m128 vector obtained from _mm_cmplt_ps.

What i want to achieve is to select element of vector a when the corresponding element of the mask is 0xffff'ffff and element of b when the mask's element is 0.

Example of the desired operation (in semi-pseudocode):

const __m128i mask = {0xffffffff, 0, 0xffffffff, 0};  // e.g. a compare result
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};
const __m128 c = interleave(a, b, mask); // c contains {1.0, 2.1, 1.2, 2.3}

I am having trouble implementing this operation in SIMD (SSE) intrinsics.
My original idea was to mix a and b using moves and then shuffle the elements using the mask, however _mm_shuffle_ps takes an int mask consisting of 4-bit indices, not an __m128 mask.

Another idea was to use something akin to a conditional move, but there does not seem to be a conditional move in SSE (or at least I did not manage to find it in Intel's guide).

How is this normally done in SSE?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

和我恋爱吧 2025-02-17 13:24:52

这就是所谓的“混合” 。
英特尔的内在指南小组融合说明“ nofollow noreferrer”>在“ swizzle”类别下，以及与夏普斯（Shizzle）类别。

您正在寻找sse4.1 blendvps （intinsic _mm_blendv_ps ）。其他元素大小为 _mm_blendv_pd 和 _mm_blendv_epi8 。这些使用相应元素的高位作为控件，因此您可以直接使用float（没有 _mm_cmp_ps ），如果其符号位很有趣。

__m128i mask = _mm_castps_si128(_mm_cmplt_ps(x, y));   // integer 0 / -1 bit patterns
__m128 c = _mm_blendv_ps(b, a, mask);  // copy element from 2nd op where the mask is set

请注意，我将 a，b 转换为 b，a ，因为SSE混合物将元素从第二个操作数中删除，以设置掩码的位置。就像有条件的移动，在条件为真时复制。如果您将您的常数 /变量命名，则可以编写 Blend（A，B，Mask）< / code>，而不是使它们向后。或给他们有意义的名称行 Ones 和二。

在其他情况下，您的控制操作数为常数，也有 _mm_blend_ps /pd/ _mm_blend_epi16 （一个8--位直接操作数只能控制8个单独的元素，因此8x 2字节。）

performance

Blendps XMM，XMM，IMM8 是Intel CPU上任何矢量Alu端口的单uop指令，如<代码> andps 。（ https://uops.info/ ）。 pblendw 也是单uop，但仅在英特尔的端口5上运行，并与Shuffles竞争。 AVX2 VPBlendd 与dword granularity， vblendps 的整数版本以及相同的效率相同。 Intel CPU上具有额外的旁路延迟。

（这是一项整数 - simd指令；与随机相比，如果您混合整数和fp simd，则混合物在 5）。不幸的是，AVX版本（ vblendvps ）仍然是Intel上的2个UOP（Alder Lake-P上的3个，Alder Lake-e上4个）。尽管UOPS至少可以在3个矢量Alu端口中的任何一个上运行。

vblendvps 版本在ASM中很时髦，因为它具有4个操作数，而不是覆盖任何输入寄存器。（非AVX版本覆盖一个输入，并隐式使用XMM0作为掩码输入。）Intel UOPS显然无法处理4个单独的寄存器，只有3个用于FMA， ADC 和 cmov 。（和avx-512 vpternlogd 可以在单个UOP 中进行比切的混合物）

AMD具有完全有效的 vblendvps ，单个UOP（除外）对于Zen1上的YMM），带2/时钟吞吐量。

效仿

没有SSE4.1，您可以使用Andn/and/或（X＆amp; bask） | （y＆amp; bask）等于 _mm_blendv_ps（x，y，bask），除了它是纯粹的位，因此每个掩码元素的所有位都应匹配顶部位。（例如，比较结果，或用 _mm_srai_epi32（bask，31）。）

编译器知道此技巧，如果您在没有任何ARCH选项（如<代码> -March = Haswell 或其他。（SSE4.1在第二代核心2中是新的，因此它越来越广泛，但不是

。

通用 ^ y）＆amp;如果您可以重复使用 x ^ y （在AKI中建议）。级别的平行性。

没有AVX的非毁灭性3-动作指令，此方式将需要一个移动XMM，XMM register-copy来保存 b mask 而不是 a 。 Andn销毁掩码（ 〜mask＆amp; x ）。

用关心AMD，如果可以预先计算 A^B ，您仍然可以选择一个和/XOR

。代码>

（适用于整数和FP； 0.0F 和 0.0 的位模式是全元素，与整数 0 。）

您无需复制零从任何地方，只有 x＆amp; mask 或 x＆amp; 〜mask 。

（（x＆amp; 〜mask）|（y＆amp; mask）表达式减少为x = 0或y = 0;该项变为零， z | = 0 < /code>是一个no-op

。 x +y：x ，将添加和混合的延迟放在关键路径上，您可以根据掩码 x +=选择y或零> x += y＆amp; mask; 或做相反的事情， x += 〜mask＆amp; y 使用 _mm_andn_ps（bask，vy）。

这具有添加和一个操作（因此已经比某些CPU上的混合物便宜，并且您不需要在另一台寄存器中使用0.0源操作数）。另外，如果您是在循环中使用+= 操作，则通过 x 现在仅包括+= 操作独立 y＆amp;蒙版。例如，仅将数组的元素匹配， sum += a [i]＆gt; = thresh？ a [i]：0.0f;

，示例是由于不必要地延长关键路径而导致额外放缓的示例，请参见 gcc优化flag -o3使代码比-o2 gcc的标量ASM使用 cmov 具有该缺陷，执行 cmov 作为循环依赖链的一部分，而不是准备 0 或 arr [i] 输入它。

如果您想要

a＆lt;上？答：上，您可以用 _mm_min_ps 而不是 cmpps / blendvps 。（类似地， _mm_max_ps 和 _mm_min_pd / _mm_max_pd 。）

请参见有关其确切语义的详细信息，什么指令给出了什么指令？，包括一个长期存在的（但最近固定）的GCC错误，其中FP内含物没有提供预期的严格FP语义，如果一个人是Nan，则可以保留哪种操作数。

或对于整数，SSE2是高度非正交的（iNT16_T的签名最小/最大，UINT8_T的未签名最小/最大）。类似于饱和包说明。 SSE4.1填充缺失的操作数大小和签名组合。

签名：sse2 _mm_max_epi16 （以及所有这些的相应 min s）
- sse4.1 _mm_max_epi32 / _mm_max_epi8 ; avx-512 _mm_max_epi64
unsigned：sse2 _mm_max_epu8
- sse4.1 _mm_max_epu16 / _mm_max_epu32 ; avx-512 _mm_max_epu64

avx-512使蒙版/混合一流操作

AVX-512比较掩码寄存器， k0..k7..K7..K7..K7..K7 （内部类型 __ MMASK1616161616161616 等）。合并屏蔽或零掩蔽可能是大多数ALU指令的一部分。还有一个专用的混合说明，可以根据口罩融合。

我不会在这里详细介绍，可以说如果您有很多有条件的事情要做，AVX-512很棒（即使您仅使用256位矢量来避免涡轮时钟速度惩罚等等。）您将需要专门阅读AVX-512的详细信息。

That's called a "blend".
Intel's intrinsics guide groups blend instructions under the "swizzle" category, along with shuffles.

You're looking for SSE4.1 blendvps (intrinsic _mm_blendv_ps). The other element sizes are _mm_blendv_pd and _mm_blendv_epi8. These use the high bit of the corresponding element as the control, so you can use a float directly (without _mm_cmp_ps) if its sign bit is interesting.

__m128i mask = _mm_castps_si128(_mm_cmplt_ps(x, y));   // integer 0 / -1 bit patterns
__m128 c = _mm_blendv_ps(b, a, mask);  // copy element from 2nd op where the mask is set

Note that I reversed a, b to b, a because SSE blends take the element from the 2nd operand in positions where the mask was set. Like a conditional-move which copies when the condition is true. If you name your constants / variables accordingly, you can write blend(a,b, mask) instead of having them backwards. Or give them meaningful names line ones and twos.

In other cases where your control operand is a constant, there's also _mm_blend_ps / pd / _mm_blend_epi16 (an 8-bit immediate operand can only control 8 separate elements, so 8x 2-byte.)

Performance

blendps xmm, xmm, imm8 is a single-uop instruction for any vector ALU port on Intel CPUs, as cheap as andps. (https://uops.info/). pblendw is also single-uop, but only runs on port 5 on Intel, competing with shuffles. AVX2 vpblendd blends with dword granularity, an integer version of vblendps, and with the same very good efficiency. (It's an integer-SIMD instruction; unlike shuffles, blends have extra bypass latency on Intel CPUs if you mix integer and FP SIMD.)

But variable blendvps is 2 uops on Intel before Skylake (and only for port 5). And the AVX version (vblendvps) is unfortunately still 2 uops on Intel (3 on Alder Lake-P, 4 on Alder Lake-E). Although the uops can at least run on any of 3 vector ALU ports.

The vblendvps version is funky in asm because it has 4 operands, not overwriting any of the inputs registers. (The non-AVX version overwrites one input, and uses XMM0 implicitly as the mask input.) Intel uops apparently can't handle 4 separate registers, only 3 for stuff like FMA, adc, and cmov. (And AVX-512 vpternlogd which can do a bitwise blend as a single uop)

AMD has fully efficient handling of vblendvps, single uop (except for YMM on Zen1) with 2/clock throughput.

Without SSE4.1, you can emulate with ANDN/AND/OR

(x&~mask) | (y&mask) is equivalent to _mm_blendv_ps(x,y,mask), except it's pure bitwise so all the bits of each mask element should match the top bit. (e.g. a compare result, or broadcast the top bit with _mm_srai_epi32(mask, 31).)

Compilers know this trick and will use it when auto-vectorizing scalar code if you compile without any arch options like -march=haswell or whatever. (SSE4.1 was new in 2nd-gen Core 2, so it's increasingly widespread but not universal.)

For constant / loop-invariant `a^b` without SSE4.1

x ^ ((x ^ y) & mask saves one operation if you can reuse x ^ y. (Suggested in comments by Aki). Otherwise this is worse, longer critical-path latency and no instruction-level parallelism.

Without AVX non-destructive 3-operand instructions, this way would need a movaps xmm,xmm register-copy to save b, but it can choose to destroy the mask instead of a. The AND/ANDN/OR way would normally destroy its 2nd operand, the one you use with y&mask, and destroy the mask with ANDN (~mask & x).

With AVX, vblendvps is guaranteed available. Although if you're targeting Intel (especially Haswell) and don't care about AMD, you might still choose an AND/XOR if a^b can be pre-computed.

Blending with `0`: just `AND[N]`

(Applies to integer and FP; the bit-pattern for 0.0f and 0.0 is all-zeros, same as integer 0.)

You don't need to copy a zero from anywhere, just x & mask, or x & ~mask.

(The (x & ~mask) | (y & mask) expression reduces to this for x=0 or y=0; that term becomes zero, and z|=0 is a no-op.)

For example, to implement x = mask ? x+y : x, which would put the latency of an add and blend on the critical path, you simplify to x += select y or zero according to mask, i.e. to x += y & mask; Or to do the opposite, x += ~mask & y using _mm_andn_ps(mask, vy).

This has an ADD and an AND operation (so already cheaper than blend on some CPUs, and you don't need a 0.0 source operand in another register). Also, the dependency chain through x now only includes the += operation, if you were doing this in a loop with loop-carried x but independent y & mask. e.g. summing only matching elements of an array, sum += A[i]>=thresh ? A[i] : 0.0f;

For an example of an extra slowdown due to lengthening the critical path unnecessarily, see gcc optimization flag -O3 makes code slower than -O2 where GCC's scalar asm using cmov has that flaw, doing cmov as part of the loop-carried dependency chain instead of to prepare a 0 or arr[i] input for it.

Clamping to a MIN or MAX

If you want something like a < upper ? a : upper, you can do that clamping in one instruction with _mm_min_ps instead of cmpps / blendvps. (Similarly _mm_max_ps, and _mm_min_pd / _mm_max_pd.)

See What is the instruction that gives branchless FP min and max on x86? for details on their exact semantics, including a longstanding (but recently fixed) GCC bug where the FP intrinsics didn't provide the expected strict-FP semantics of which operand would be the one to keep if one was NaN.

Or for integer, SSE2 is highly non-orthogonal (signed min/max for int16_t, unsigned min/max for uint8_t). Similar for saturating pack instructions. SSE4.1 fills in the missing operand-size and signedness combinations.

Signed: SSE2 _mm_max_epi16 (and corresponding mins for all of these)
- SSE4.1 _mm_max_epi32 / _mm_max_epi8; AVX-512 _mm_max_epi64
Unsigned: SSE2 _mm_max_epu8
- SSE4.1 _mm_max_epu16 / _mm_max_epu32; AVX-512 _mm_max_epu64

AVX-512 makes masking/blending a first-class operation

AVX-512 compares into a mask register, k0..k7 (intrinsic types __mmask16 and so on). Merge-masking or zero-masking can be part of most ALU instructions. There is also a dedicated blend instruction that blends according to a mask.

I won't go into the details here, suffice it to say if you have a lot of conditional stuff to do, AVX-512 is great (even if you only use 256-bit vectors to avoid the turbo clock speed penalties and so on.) And you'll want to read up on the details for AVX-512 specifically.

回复收藏 0 原文

嗼ふ静 2025-02-17 13:24:52

正如@peter Cordes在问题的注释中所建议的那样， BlendVPS 指令（ _MM_BLENDV _* intinsics）用于预先构造Interleave/condictal move操作。

It should be noted that _mm_blendv_* family select the left-hand elements if the mask contains 0 instead of 0xffffffff, thus a< /code>和 b 应以相反的顺序传递。

然后实现看起来像这样

const __m128i mask = {0xffffffff, 0, 0xffffffff, 0};  // e.g. a compare result
const __m128 m_ps = _mm_castsi128_ps(mask);
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};

#ifdef __SSE4_1__ // _mm_blendv_ps requires SSE4.1 
const __m128 c = _mm_blendv_ps(b, a, m_ps);
#else
const __m128 c = _mm_or_ps(_mm_and_ps(m_ps, a), _mm_andnot_ps(m_ps, b));
#endif
// c contains {1.0, 2.1, 1.2, 2.3}

As suggested by @Peter Cordes in the comments to the question, the blendvps instruction (_mm_blendv_* intrinsics) is used to preform the interleave/conditional move operation.

It should be noted that _mm_blendv_* family select the left-hand elements if the mask contains 0 instead of 0xffffffff, thus a and b should be passed in reverse order.

The implementation then would look like this

const __m128i mask = {0xffffffff, 0, 0xffffffff, 0};  // e.g. a compare result
const __m128 m_ps = _mm_castsi128_ps(mask);
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};

#ifdef __SSE4_1__ // _mm_blendv_ps requires SSE4.1 
const __m128 c = _mm_blendv_ps(b, a, m_ps);
#else
const __m128 c = _mm_or_ps(_mm_and_ps(m_ps, a), _mm_andnot_ps(m_ps, b));
#endif
// c contains {1.0, 2.1, 1.2, 2.3}

回复收藏 0 原文

~没有更多了~

关于作者

一腔孤↑勇

暂无简介

文章

30 人气

关注发私信

友情链接

文江博客

SSE Interleave/Merge/合并2个向量，使用掩码，每个元素的条件移动？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

performance

效仿

。

。代码>

如果您想要

avx-512使蒙版/混合一流操作

Performance

Without SSE4.1, you can emulate with ANDN/AND/OR

For constant / loop-invariant `a^b` without SSE4.1

Blending with `0`: just `AND[N]`

Clamping to a MIN or MAX

AVX-512 makes masking/blending a first-class operation

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

SSE Interleave/Merge/合并2个向量，使用掩码，每个元素的条件移动？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

performance

效仿

。

。代码>

如果您想要

avx-512使蒙版/混合一流操作

Performance

Without SSE4.1, you can emulate with ANDN/AND/OR

For constant / loop-invariant a^b without SSE4.1

Blending with 0: just AND[N]

Clamping to a MIN or MAX

AVX-512 makes masking/blending a first-class operation

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

For constant / loop-invariant `a^b` without SSE4.1

Blending with `0`: just `AND[N]`