我有几个功能来比较填充布尔值数组的浮点数学向量(每次比较)。
当前,我正在比较它们逐元,但是我想使用SIMD操作来优化它。
但是,问题在于,诸如 _mm_cmpeq_ps
返回每个元素为32位的掩码之类的Intel内在物质。我对如何将比较面具转换为一系列布尔人(保证为8位)有些丢失。
我可以洗牌SIMD矢量的每个元素,然后提取低元素,但我认为这不会比逐元元素比较的效率提高。
有没有办法将矢量比较面具与布尔数组进行比较?
I have several functions used to compare floating-point math vectors that fill an array of booleans (with result of each comparison).
Currently, i am comparing them element-by-element, however i would like to use SIMD operations to optimize it.
The issue is, however, that intel intrinsics such as _mm_cmpeq_ps
return a mask where every element is 32-bit. I am a little lost on how to convert the comparison mask to an array of booleans (guaranteed to be 8-bit).
I could shuffle every element of the SIMD vector, then extract the low elements, but i dont think that would provide an efficiency boost over manual element-by-element comparison.
Is there a way to cast the vector compare mask to a boolean array?
发布评论
评论(1)
如果您可以使用该程序的其余部分,则位图是一种更有效的存储方式。 (例如通过最快的方法到32个字节simd vector 或如果您想与其他向量一起使用它,是否有Intel Avx2中的MoveMask指令的反向指令?)。
或者,如果您可以缓存它并最多可以使用几个KIB的蒙版向量,则可以直接存储比较结果即可重复使用而无需打包。 (在
alignas(16)的数组中,如果要从标量代码访问)。但是,只有在L1D中使用一个小的足迹来完成。或者更好,请即时将其用作另一个矢量操作的面具,因此您不会存储/重新加载掩码数据。
packssdw
/packsswb
dword将结果比较到正确的字节,如果您不希望元素包装到单个位,请不要使用
_mm_movemask_ps_ps_ps
或epi8
。相反,使用矢量包说明。cmpps < / code>产生全零 /全零位的元素,即整数0(false)或-1(true)。
签名的整数包指令保留0/-1值,因为两者都在
int8_t
1 的范围内。为了使编译器感到满意,您需要
_mm_castps_si128
重新插入__ M128
作为__ M128i
。这最有效地打包了4个浮点的4个向量,每个浮点结果将结果与16个单独字节的一个向量进行比较。 (或使用AVX2,4个vecs的8个浮子降至32个字节的1 vec,因为
_MM256_PACKS_EPI32
在车道上进行操作,这是两个单独的16字节包装操作,因此需要额外的输入。可能是_MM256_PERMUTEVAR8X32_EPI32
vpermd
以向量常数为控制操作数)获得0/1而不是0/-1取得
_mm_mm_and_and_ssi128
或ssse33_mm_abs_epi8
,如果您确实需要bool
而不是零/non-Zerouint8_t []
或int8_t []
。如果您只有一个float的一个向量,则需要ssse3
_mm_shuffle_epi8
(pshufb
)才能从每个dword中获取1个字节,for_mm_mm_mmm_storeu_si32
(当心它在早期的GCC11版本中被打破了,甚至在此之前都没有支持。但是现在它被支持是一个严格确定的安全的未对齐的商店。否则使用_mm_cvtsi128_si32
to int和memcpy
到bool
的。
数组 /x86/vpmovdb:vpmovsdb:vpmovusdb“ rel =” nofollow noreferrer“> avx-512f
vpmovdb
/code> /vpmovusdb
do at astation> do Atataivation> do Ataturation(Not Intruncation)签名的输入。这使未签名的包装说明毫无用处;我们需要先掩盖两个输入,否则它们饱和-1
to0
,而不是0xffff
to0xff
。punpcklwd
/punpckhwd
可以从两个寄存器中插入16位单词,但只能从这些寄存器的低点或高级中插入。所以不是一个很好的建筑块。截断也可以工作,但是只有SSSE3
PSHUFB
,没有2-Register Shuffles与pack一样有用。
指令直到AVX-512。vpblendw
可以在两个不同的输入寄存器中插入DWORD的一半)即使与AVX-512,
vpmovdb
只有一个输入寄存器,vs.vpack ... 指令产生全宽输出,并带有来自两个全宽输入的元素。 (分别在16字节的车道中,因此您仍然需要一个
vpermd
,以将来自4个浮子车道的4个字节块放入正确的订单中)。当然,使用512位矢量宽度的AVX-512只能比较ininto掩码。这对于存储位图非常棒,只是
vcmpps k1,zmm0,[rsi]
/kmov [rdi],k1
。但是,对于存储一个bool数组,可能您需要kunpck
与condenate比较结果,与2xkunpckwd
将16位与32位掩码相结合,然后> kunpckdq 从64个浮点上制作一个64位掩码,比较结果。然后将其与零屏蔽vmovdqu8 zmm0 {k1} {z},zmm1
一起存储到内存中。 (内存目的地仅允许合并屏蔽,而不是零屏蔽。)AVX-512仍然可能仅使用256位寄存器(以避免使用Turbo罚款等等),尽管
vpermt2w < /code> /
VPERMT2B
即使在Ice Lake上也不是单一UOP。标量源的编译器自动矢量化是次最佳
编译器 do auto-vectorize(),但工作相当差。仍然可能比标量较快,尤其是使用AVX2。
Clang分别将每个矢量分别填充,用于4个字节存储。但是个人包装效率很高。使用AVX2,它跳过了一些额外的篮球,vextractf128,然后将8个字节填充到8个字节,然后再与另外8个布尔一起洗牌,然后再用另外16个字节。因此,它最终在一个BOOL中存储了32个字节,但是请服用BOOL,但要花费32个字节 lot shuffles到达那里。
per ymm商店,clang
-march = haswell
做4个vextract,8个包,2个vinsert,1个vpunpcklqdq,1 vpermq,总共16个散套,每2个输出的每2字节。我的版本每16个字节的输出或avx2进行3个散打器,如果将所有内容扩展到__ M256I
/_mm256 ...
,并将最终的随机添加到修理巷道。 (加4xvcmpps
和1xvpabsb
flip -1至+1。)GCC使用Unsigned Pack指令,例如
packusdw
作为第一步,在每个包装指令的每个输入上执行pand
指令。而且这两个步骤之间也不是必要的,因为我认为根据SSE4.1packusdw
/sse2packuswb
segned-code>签名的包装> sse 4.1packusdw
packusdw packUsdw sse> sse> sse> apcimed;即使它坚持使用未签名的包装,仅掩盖(或pabsd
)的值将少得多,因此在2个包装步骤之前或之后不需要进一步的掩蔽。(sse2
packssdw
preserves -1或0很好,甚至没有饱和。似乎GCC并没有跟踪有限的比较结果值,因此没有意识到它可以让包装指令说明只是工作。)而没有SSE4.1,GCC的确会更糟。仅使用SSSE3,它使用一些
pshufb
和por
说明,以馈送2xpand
/sse2packuswb
。使用Word-&gt; byte包指令所有将其输入视为签名的说明,Intel省略了
packusdw
32 - &gt;直到SSE4.1的16位包装,因为将DWORD的正常第一步打包到字节为packssdw
,即使您最终想将签名的整数固定到0..255范围。A bitmap is a more efficient way to store it, if you can have the rest of your program use that. (e.g. via Fastest way to unpack 32 bits to a 32 byte SIMD vector or is there an inverse instruction to the movemask instruction in intel avx2? if you want to use it with other vectors).
Or if you can cache-block it and use at most a couple KiB of mask vectors, you could just store the compare results directly for reuse without packing them down. (In an array of
alignas(16) int32_t masks[]
, in case you want to access from scalar code). But only if you can do it with a small footprint in L1d. Or much better, use it on the fly as a mask for another vector operation so you're not storing/reloading mask data.packssdw
/packsswb
dword compare results down to bytesYou're correct, if you don't want your elements packed down to single bits, don't use
_mm_movemask_ps
orepi8
. Instead, use vector pack instructions.cmpps
produces elements of all-zero / all-one bits, i.e. integer 0 (false) or -1 (true).Signed integer pack instructions preserve 0 / -1 values, because both are in range for
int8_t
1.To keep the compiler happy, you need
_mm_castps_si128
to reinterpret a__m128
as a__m128i
.This works most efficiently packing 4 vectors of 4 float compare results each down to one vector of 16 separate bytes. (Or with AVX2, 4 vecs of 8 floats down to 1 vec of 32 bytes, requiring an extra permute at the end because
_mm256_packs_epi32
and so on operate in-lane, two separate 16-byte pack operations. Probably a_mm256_permutevar8x32_epi32
vpermd
with a vector constant as the control operand)Getting a 0/1 instead of 0/-1 takes a
_mm_and_si128
or SSSE3_mm_abs_epi8
, if you truly needbool
instead of a zero/non-zerouint8_t[]
orint8_t[]
.If you only have a single vector of float, you'd want SSSE3
_mm_shuffle_epi8
(pshufb
) to grab 1 byte from each dword, for_mm_storeu_si32
(beware it was broken in early GCC11 versions, and wasn't even supported before then. But now it is supported as a strict-aliasing-safe unaligned store. Otherwise use_mm_cvtsi128_si32
to int, andmemcpy
that to an array ofbool
.)Footnote 1: signed pack instructions are the only good choice
All pack instructions before AVX-512F
vpmovdb
/vpmovusdb
do saturation (not truncation), and treat their inputs as signed. This makes unsigned-pack instructions useless; we'd need to mask both inputs first or they'd saturate-1
to0
, not0xffff
to0xff
.punpcklwd
/punpckhwd
can interleave 16-bit words from two registers, but only from the low or high half of those registers. So not a great building-block.Truncation would also work, but there's only SSSE3
pshufb
, no 2-register shuffles as useful as thepack..
instructions until AVX-512.vpblendw
could interleave halves of dwords in two different input registers)Even with AVX-512,
vpmovdb
only has one input register, vs.vpack...
instructions that produce a full-width output with elements from two full-width inputs. (In 16-byte lanes separately, so you'd still need avpermd
at the end to put your 4-byte chunks that came from lanes of 4 floats into the right order).Of course, AVX-512 using 512-bit vector width can only compare-into-mask. This is fantastic for storing a bitmap, just
vcmpps k1, zmm0, [rsi]
/kmov [rdi], k1
. But for storing a bool array, probably you'd want tokunpck
to concatenate compare results, with 2xkunpckwd
to combine 16-bit to 32-bit masks, thenkunpckdq
to make a single 64-bit mask from 64 float compare results. Then use that with a zero-maskedvmovdqu8 zmm0{k1}{z}, zmm1
and store that to memory. (A memory destination only allows merge-masking, not zero-masking.)AVX-512 could still be potentially useful with only 256-bit registers (to avoid turbo penalties and so on), although
vpermt2w
/vpermt2b
aren't single-uop even on Ice Lake.Compiler auto-vectorization of scalar source is sub-optimal
Compilers do auto-vectorize (https://godbolt.org/z/3o58W919Y), but do a rather poor job. Still very likely faster than scalar, especially with AVX2 available.
clang packs each vector down separately, for 4-byte stores. But the individual packing is decently efficient. With AVX2, it jumps through some extra hoops, vextractf128 and then packing 8 dwords down to 8 bytes, before shuffling that together with another 8 bools and then vinserti128 with another 16. So it's eventually storing 32 bytes at a time of bools, but takes a lot of shuffles to get there.
Per YMM store, clang
-march=haswell
does 4 vextract, 8 packs, 2 vinsert, 1 vpunpcklqdq, 1 vpermq for a total of 16 shuffles, one shuffle per 2 bytes of output. My version does 3 shuffles per 16 bytes of output, or with AVX2, 4 per 32 bytes if you widen everything to__m256i
/_mm256...
and add a final shuffle to fix up for lane-crossing. (Plus 4xvcmpps
and 1xvpabsb
to flip -1 to +1.)GCC uses unsigned pack instructions like
packusdw
as the first step, doingpand
instructions on each input to each pack instruction. And also unnecessary one between the two steps because it's I think emulating unsigned->unsigned packing in terms of SSE4.1packusdw
/ SSE2packuswb
signed->unsigned packs. Even if it's stuck on using unsigned packing, it would be a lot less bad to just mask (orpabsd
) the value to 0 or 1 so no further masking is needed before or after 2 packing steps.(SSE2
packssdw
preserves -1 or 0 just fine, without even saturating. Seems GCC isn't keeping track of the limited value-range of compare results, so doesn't realize it can let the pack instructions just work.)And without SSE4.1, GCC does even worse. With only SSSE3 it uses some
pshufb
andpor
instructions, to feed 2xpand
/SSE2packuswb
.With word->byte pack instructions all treating their inputs as signed, it makes some sense that Intel omitted
packusdw
32 -> 16-bit pack until SSE4.1, since the normal first step for packing dwords to bytes ispackssdw
, even if you eventually want to clamp signed integers to a 0..255 range.