本质上,我试图在2 SSE( __ M128
)向量上实现类似三元的操作。
掩码是另一个 __ M128
从 _MM_CMPLT_PS
获得的向量。
我要实现的是选择vector a
的元素。蒙版的元素是 0
。
所需操作的示例(在半伪码中):
const __m128i mask = {0xffffffff, 0, 0xffffffff, 0}; // e.g. a compare result
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};
const __m128 c = interleave(a, b, mask); // c contains {1.0, 2.1, 1.2, 2.3}
我在SIMD(SSE)内在实现此操作时遇到了困难。
我最初的想法是混合 a
和 b
使用移动混合,然后使用掩码将元素洗牌,但是 _mm_shuffle_ps
请 int int < /代码>由4位指数组成的掩码,而不是 __ M128
掩码。
另一个想法是使用类似于有条件举动的东西,但是SSE似乎没有有条件的举动(或者至少我没有在英特尔的指南中找到它)。
通常如何在SSE中完成?
Essentially i am trying to implement a ternary-like operation on 2 SSE (__m128
) vectors.
The mask is another __m128
vector obtained from _mm_cmplt_ps
.
What i want to achieve is to select element of vector a
when the corresponding element of the mask is 0xffff'ffff
and element of b
when the mask's element is 0
.
Example of the desired operation (in semi-pseudocode):
const __m128i mask = {0xffffffff, 0, 0xffffffff, 0}; // e.g. a compare result
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};
const __m128 c = interleave(a, b, mask); // c contains {1.0, 2.1, 1.2, 2.3}
I am having trouble implementing this operation in SIMD (SSE) intrinsics.
My original idea was to mix a
and b
using moves and then shuffle the elements using the mask, however _mm_shuffle_ps
takes an int
mask consisting of 4-bit indices, not an __m128
mask.
Another idea was to use something akin to a conditional move, but there does not seem to be a conditional move in SSE (or at least I did not manage to find it in Intel's guide).
How is this normally done in SSE?
发布评论
评论(2)
这就是所谓的“混合” 。
英特尔的内在指南小组融合说明“ nofollow noreferrer”>在“ swizzle”类别下,以及与夏普斯(Shizzle)类别。
您正在寻找sse4.1
blendvps
(intinsic
_mm_blendv_ps
)。其他元素大小为_mm_blendv_pd
和_mm_blendv_epi8
。这些使用相应元素的高位作为控件,因此您可以直接使用float(没有_mm_cmp_ps
),如果其符号位很有趣。请注意,我将
a,b
转换为b,a
,因为SSE混合物将元素从第二个操作数中删除,以设置掩码的位置。就像有条件的移动,在条件为真时复制。如果您将您的常数 /变量命名,则可以编写Blend(A,B,Mask)< / code>,而不是使它们向后。或给他们有意义的名称行
Ones
和二
。在其他情况下,您的控制操作数为常数,也有
_mm_blend_ps
/pd/_mm_blend_epi16
(一个8--位直接操作数只能控制8个单独的元素,因此8x 2字节。)performance
Blendps XMM,XMM,IMM8
是Intel CPU上任何矢量Alu端口的单uop指令,如<代码> andps 。 ( https://uops.info/ )。pblendw
也是单uop,但仅在英特尔的端口5上运行,并与Shuffles竞争。 AVX2VPBlendd
与dword granularity,vblendps
的整数版本以及相同的效率相同。 Intel CPU上具有额外的旁路延迟。(这是一项整数 - simd指令;与随机相比,如果您混合整数和fp simd,则混合物在 5)。不幸的是,AVX版本(
vblendvps
)仍然是Intel上的2个UOP(Alder Lake-P上的3个,Alder Lake-e上4个)。尽管UOPS至少可以在3个矢量Alu端口中的任何一个上运行。vblendvps
版本在ASM中很时髦,因为它具有4个操作数,而不是覆盖任何输入寄存器。 (非AVX版本覆盖一个输入,并隐式使用XMM0作为掩码输入。)Intel UOPS显然无法处理4个单独的寄存器,只有3个用于FMA,ADC
和cmov 。 (和avx-512
vpternlogd
可以在单个UOP 中进行比切的混合物)AMD具有完全有效的
vblendvps
,单个UOP(除外)对于Zen1上的YMM),带2/时钟吞吐量。效仿
没有SSE4.1,您可以使用Andn/and/或
(X&amp; bask) | (y&amp; bask)
等于_mm_blendv_ps(x,y,bask)
,除了它是纯粹的位,因此每个掩码元素的所有位都应匹配顶部位。 (例如,比较结果,或用_mm_srai_epi32(bask,31)
。)编译器知道此技巧,如果您在没有任何ARCH选项(如<代码> -March = Haswell 或其他。 (SSE4.1在第二代核心2中是新的,因此它越来越广泛,但不是
。
通用 ^ y)&amp;如果您可以重复使用 x ^ y (在AKI中建议)。级别的平行性。
没有AVX的非毁灭性3-动作指令,此方式将需要一个
移动XMM,XMM
register-copy来保存b
mask
而不是a
。 Andn销毁掩码(〜mask&amp; x
)。用 关心AMD,如果可以预先计算
A^B
,您仍然可以选择一个和/XOR。代码>
(适用于整数和FP;
0.0F
和0.0
的位模式是全元素,与整数0
。)您无需复制零从任何地方,只有
x&amp; mask
或x&amp; 〜mask
。(
(x&amp; 〜mask)|(y&amp; mask)
表达式减少为x = 0或y = 0;该项变为零,z | = 0 < /code>是一个no-op
。 x +y:x ,将添加和混合的延迟放在关键路径上,您可以根据掩码 x +=选择y或零> x += y&amp; mask; 或做相反的事情,
x += 〜mask&amp; y
使用_mm_andn_ps(bask,vy)
。这具有添加和一个操作(因此已经比某些CPU上的混合物便宜,并且您不需要在另一台寄存器中使用0.0源操作数)。另外,如果您是在循环中使用
+=
操作,则通过x
现在仅包括+=
操作独立y&amp;蒙版
。例如,仅将数组的元素匹配,sum += a [i]&gt; = thresh? a [i]:0.0f;
,示例是由于不必要地延长关键路径而导致额外放缓的示例,请参见 gcc优化flag -o3使代码比-o2 gcc的标量ASM使用
cmov
具有该缺陷,执行cmov
作为循环依赖链的一部分,而不是准备0
或arr [i]
输入它。如果您想要
a&lt;上?答:上
,您可以用_mm_min_ps
而不是cmpps
/blendvps
。 (类似地,_mm_max_ps
和_mm_min_pd
/_mm_max_pd
。)请参见有关其确切语义的详细信息,什么指令给出了什么指令? ,包括一个长期存在的(但最近固定)的GCC错误,其中FP内含物没有提供预期的严格FP语义,如果一个人是Nan,则可以保留哪种操作数。
或对于整数,SSE2是高度非正交的(iNT16_T的签名最小/最大,UINT8_T的未签名最小/最大)。类似于饱和包说明。 SSE4.1填充缺失的操作数大小和签名组合。
_mm_max_epi16
(以及所有这些的相应min
s)_mm_max_epi32
/_mm_max_epi8
; avx-512_mm_max_epi64
_mm_max_epu8
_mm_max_epu16
/_mm_max_epu32
; avx-512_mm_max_epu64
avx-512使蒙版/混合一流操作
AVX-512比较掩码寄存器,
k0..k7..K7..K7..K7..K7
(内部类型__ MMASK1616161616161616
等)。合并屏蔽或零掩蔽可能是大多数ALU指令的一部分。还有一个专用的混合说明,可以根据口罩融合。我不会在这里详细介绍,可以说如果您有很多有条件的事情要做,AVX-512很棒(即使您仅使用256位矢量来避免涡轮时钟速度惩罚等等。)您将需要专门阅读AVX-512的详细信息。
That's called a "blend".
Intel's intrinsics guide groups blend instructions under the "swizzle" category, along with shuffles.
You're looking for SSE4.1
blendvps
(intrinsic_mm_blendv_ps
). The other element sizes are_mm_blendv_pd
and_mm_blendv_epi8
. These use the high bit of the corresponding element as the control, so you can use a float directly (without_mm_cmp_ps
) if its sign bit is interesting.Note that I reversed
a, b
tob, a
because SSE blends take the element from the 2nd operand in positions where the mask was set. Like a conditional-move which copies when the condition is true. If you name your constants / variables accordingly, you can writeblend(a,b, mask)
instead of having them backwards. Or give them meaningful names lineones
andtwos
.In other cases where your control operand is a constant, there's also
_mm_blend_ps
/ pd /_mm_blend_epi16
(an 8-bit immediate operand can only control 8 separate elements, so 8x 2-byte.)Performance
blendps xmm, xmm, imm8
is a single-uop instruction for any vector ALU port on Intel CPUs, as cheap asandps
. (https://uops.info/).pblendw
is also single-uop, but only runs on port 5 on Intel, competing with shuffles. AVX2vpblendd
blends with dword granularity, an integer version ofvblendps
, and with the same very good efficiency. (It's an integer-SIMD instruction; unlike shuffles, blends have extra bypass latency on Intel CPUs if you mix integer and FP SIMD.)But variable
blendvps
is 2 uops on Intel before Skylake (and only for port 5). And the AVX version (vblendvps
) is unfortunately still 2 uops on Intel (3 on Alder Lake-P, 4 on Alder Lake-E). Although the uops can at least run on any of 3 vector ALU ports.The
vblendvps
version is funky in asm because it has 4 operands, not overwriting any of the inputs registers. (The non-AVX version overwrites one input, and uses XMM0 implicitly as the mask input.) Intel uops apparently can't handle 4 separate registers, only 3 for stuff like FMA,adc
, andcmov
. (And AVX-512vpternlogd
which can do a bitwise blend as a single uop)AMD has fully efficient handling of
vblendvps
, single uop (except for YMM on Zen1) with 2/clock throughput.Without SSE4.1, you can emulate with ANDN/AND/OR
(x&~mask) | (y&mask)
is equivalent to_mm_blendv_ps(x,y,mask)
, except it's pure bitwise so all the bits of each mask element should match the top bit. (e.g. a compare result, or broadcast the top bit with_mm_srai_epi32(mask, 31)
.)Compilers know this trick and will use it when auto-vectorizing scalar code if you compile without any arch options like
-march=haswell
or whatever. (SSE4.1 was new in 2nd-gen Core 2, so it's increasingly widespread but not universal.)For constant / loop-invariant
a^b
without SSE4.1x ^ ((x ^ y) & mask
saves one operation if you can reusex ^ y
. (Suggested in comments by Aki). Otherwise this is worse, longer critical-path latency and no instruction-level parallelism.Without AVX non-destructive 3-operand instructions, this way would need a
movaps xmm,xmm
register-copy to saveb
, but it can choose to destroy themask
instead ofa
. The AND/ANDN/OR way would normally destroy its 2nd operand, the one you use withy&mask
, and destroy the mask with ANDN (~mask & x
).With AVX,
vblendvps
is guaranteed available. Although if you're targeting Intel (especially Haswell) and don't care about AMD, you might still choose an AND/XOR ifa^b
can be pre-computed.Blending with
0
: justAND[N]
(Applies to integer and FP; the bit-pattern for
0.0f
and0.0
is all-zeros, same as integer0
.)You don't need to copy a zero from anywhere, just
x & mask
, orx & ~mask
.(The
(x & ~mask) | (y & mask)
expression reduces to this for x=0 or y=0; that term becomes zero, andz|=0
is a no-op.)For example, to implement
x = mask ? x+y : x
, which would put the latency of an add and blend on the critical path, you simplify tox += select y or zero according to mask
, i.e. tox += y & mask;
Or to do the opposite,x += ~mask & y
using_mm_andn_ps(mask, vy)
.This has an ADD and an AND operation (so already cheaper than blend on some CPUs, and you don't need a 0.0 source operand in another register). Also, the dependency chain through
x
now only includes the+=
operation, if you were doing this in a loop with loop-carriedx
but independenty & mask
. e.g. summing only matching elements of an array,sum += A[i]>=thresh ? A[i] : 0.0f;
For an example of an extra slowdown due to lengthening the critical path unnecessarily, see gcc optimization flag -O3 makes code slower than -O2 where GCC's scalar asm using
cmov
has that flaw, doingcmov
as part of the loop-carried dependency chain instead of to prepare a0
orarr[i]
input for it.Clamping to a MIN or MAX
If you want something like
a < upper ? a : upper
, you can do that clamping in one instruction with_mm_min_ps
instead ofcmpps
/blendvps
. (Similarly_mm_max_ps
, and_mm_min_pd
/_mm_max_pd
.)See What is the instruction that gives branchless FP min and max on x86? for details on their exact semantics, including a longstanding (but recently fixed) GCC bug where the FP intrinsics didn't provide the expected strict-FP semantics of which operand would be the one to keep if one was NaN.
Or for integer, SSE2 is highly non-orthogonal (signed min/max for int16_t, unsigned min/max for uint8_t). Similar for saturating pack instructions. SSE4.1 fills in the missing operand-size and signedness combinations.
_mm_max_epi16
(and correspondingmin
s for all of these)_mm_max_epi32
/_mm_max_epi8
; AVX-512_mm_max_epi64
_mm_max_epu8
_mm_max_epu16
/_mm_max_epu32
; AVX-512_mm_max_epu64
AVX-512 makes masking/blending a first-class operation
AVX-512 compares into a mask register,
k0..k7
(intrinsic types__mmask16
and so on). Merge-masking or zero-masking can be part of most ALU instructions. There is also a dedicated blend instruction that blends according to a mask.I won't go into the details here, suffice it to say if you have a lot of conditional stuff to do, AVX-512 is great (even if you only use 256-bit vectors to avoid the turbo clock speed penalties and so on.) And you'll want to read up on the details for AVX-512 specifically.
正如@peter Cordes在问题的注释中所建议的那样,
BlendVPS
指令(_MM_BLENDV _*
intinsics)用于预先构造Interleave/condictal move操作。It should be noted that
_mm_blendv_*
family select the left-hand elements if the mask contains0
instead of0xffffffff
, thusa< /code>和
b
应以相反的顺序传递。然后实现看起来像这样
As suggested by @Peter Cordes in the comments to the question, the
blendvps
instruction (_mm_blendv_*
intrinsics) is used to preform the interleave/conditional move operation.It should be noted that
_mm_blendv_*
family select the left-hand elements if the mask contains0
instead of0xffffffff
, thusa
andb
should be passed in reverse order.The implementation then would look like this