英特尔内在:向量比较结果与bool conversion的数组

发布于 2025-02-10 01:03:02 字数 252 浏览 0 评论 0 原文

我有几个功能来比较填充布尔值数组的浮点数学向量(每次比较)。 当前,我正在比较它们逐元,但是我想使用SIMD操作来优化它。

但是,问题在于,诸如 _mm_cmpeq_ps 返回每个元素为32位的掩码之类的Intel内在物质。我对如何将比较面具转换为一系列布尔人(保证为8位)有些丢失。

我可以洗牌SIMD矢量的每个元素,然后提取低元素,但我认为这不会比逐元元素比较的效率提高。

有没有办法将矢量比较面具与布尔数组进行比较?

I have several functions used to compare floating-point math vectors that fill an array of booleans (with result of each comparison).
Currently, i am comparing them element-by-element, however i would like to use SIMD operations to optimize it.

The issue is, however, that intel intrinsics such as _mm_cmpeq_ps return a mask where every element is 32-bit. I am a little lost on how to convert the comparison mask to an array of booleans (guaranteed to be 8-bit).

I could shuffle every element of the SIMD vector, then extract the low elements, but i dont think that would provide an efficiency boost over manual element-by-element comparison.

Is there a way to cast the vector compare mask to a boolean array?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

不喜欢何必死缠烂打 2025-02-17 01:03:02

如果您可以使用该程序的其余部分,则位图是一种更有效的存储方式。 (例如通过最快的方法到32个字节simd vector 如果您想与其他向量一起使用它,是否有Intel Avx2中的MoveMask指令的反向指令?)。

或者,如果您可以缓存它并最多可以使用几个KIB的蒙版向量,则可以直接存储比较结果即可重复使用而无需打包。 (在 alignas(16)的数组中,如果要从标量代码访问)。但是,只有在L1D中使用一个小的足迹来完成。或者更好,请即时将其用作另一个矢量操作的面具,因此您不会存储/重新加载掩码数据。


packssdw / packsswb dword将结果比较到

正确的字节,如果您不希望元素包装到单个位,请不要使用 _mm_movemask_ps_ps_ps epi8 。相反,使用矢量包说明 cmpps < / code>产生全零 /全零位的元素,即整数0(false)或-1(true)。

签名的整数包指令保留0/-1值,因为两者都在 int8_t 1 的范围内。

为了使编译器感到满意,您需要 _mm_castps_si128 重新插入 __ M128 作为 __ M128i

这最有效地打包了4个浮点的4个向量,每个浮点结果将结果与16个单独字节的一个向量进行比较。 (或使用AVX2,4个vecs的8个浮子降至32个字节的1 vec,因为 _MM256_PACKS_EPI32 在车道上进行操作,这是两个单独的16字节包装操作,因此需要额外的输入。可能是 _MM256_PERMUTEVAR8X32_EPI32 vpermd 以向量常数为控制操作数)

// or  bool *result  if you keep the abs value (_mm_abs_epi8) for 0 / 1 output
void cmp(int8_t *result, const float *a)
{
  __m128 cmp0 = _mm_cmp_ps(...);  // produces integer 0 or -1 elements
  __m128 cmp1 = _mm_cmp_ps(...);
  __m128 cmp2 = _mm_cmp_ps(...);
  __m128 cmp3 = _mm_cmp_ps(...);

   // 2x 32-bit dword -> 16-bit word  with signed saturation - packssdw
  __m128i lo_words = _mm_packs_epi32(_mm_castps_si128(cmp0), _mm_castps_si128(cmp1));
  __m128i hi_words  = _mm_packs_epi32(_mm_castps_si128(cmp2), _mm_castps_si128(cmp3));

  __m128i cmp_bytes = _mm_packs_epi16(lo_words, hi_words);  // packsswb: 0 / -1

 // if necessary create 0 / 1 bools.  If not, just store cmp_bytes
  cmp_bytes = _mm_abs_epi8(cmp_bytes);                        // SSSE3
  //cmp_bytes = _mm_and_si128(cmp_bytes, _mm_set1_epi8(1));   // SSE2

  _mm_storeu_si128((__m128i*)result, cmp_bytes); 
}

获得0/1而不是0/-1取得 _mm_mm_and_and_ssi128 或ssse33 _mm_abs_epi8 ,如果您确实需要 bool 而不是零/non-Zero uint8_t [] int8_t []

如果您只有一个float的一个向量,则需要ssse3 _mm_shuffle_epi8 pshufb )才能从每个dword中获取1个字节,for _mm_mm_mmm_storeu_si32 (当心它在早期的GCC11版本中被打破了,甚至在此之前都没有支持。但是现在它被支持是一个严格确定的安全的未对齐的商店。否则使用 _mm_cvtsi128_si32 to int和 memcpy bool

数组 /x86/vpmovdb:vpmovsdb:vpmovusdb“ rel =” nofollow noreferrer“> avx-512f vpmovdb /code> / vpmovusdb do at astation> do Atataivation> do Ataturation(Not Intruncation)签名的输入。这使未签名的包装说明毫无用处;我们需要先掩盖两个输入,否则它们饱和 -1 to 0 ,而不是 0xffff to 0xff

punpcklwd / punpckhwd 可以从两个寄存器中插入16位单词,但只能从这些寄存器的低点或高级中插入。所以不是一个很好的建筑块。

截断也可以工作,但是只有SSSE3 PSHUFB ,没有2-Register Shuffles与 pack一样有用。指令直到AVX-512。 vpblendw 可以在两个不同的输入寄存器中插入DWORD的一半)

即使与AVX-512, vpmovdb 只有一个输入寄存器,vs. vpack ... 指令产生全宽输出,并带有来自两个全宽输入的元素。 (分别在16字节的车道中,因此您仍然需要一个 vpermd ,以将来自4个浮子车道的4个字节块放入正确的订单中)。

当然,使用512位矢量宽度的AVX-512只能比较ininto掩码。这对于存储位图非常棒,只是 vcmpps k1,zmm0,[rsi] / kmov [rdi],k1 。但是,对于存储一个bool数组,可能您需要 kunpck 与condenate比较结果,与2x kunpckwd 将16位与32位掩码相结合,然后> kunpckdq 从64个浮点上制作一个64位掩码,比较结果。然后将其与零屏蔽 vmovdqu8 zmm0 {k1} {z},zmm1 一起存储到内存中。 (内存目的地仅允许合并屏蔽,而不是零屏蔽。)

AVX-512仍然可能仅使用256位寄存器(以避免使用Turbo罚款等等),尽管 vpermt2w < /code> / VPERMT2B 即使在Ice Lake上也不是单一UOP。


标量源的编译器自动矢量化是次最佳

编译器 do auto-vectorize(),但工作相当差。仍然可能比标量较快,尤其是使用AVX2。

Clang分别将每个矢量分别填充,用于4个字节存储。但是个人包装效率很高。使用AVX2,它跳过了一些额外的篮球,vextractf128,然后将8个字节填充到8个字节,然后再与另外8个布尔一起洗牌,然后再用另外16个字节。因此,它最终在一个BOOL中存储了32个字节,但是请服用BOOL,但要花费32个字节 lot shuffles到达那里。

per ymm商店,clang -march = haswell 做4个vextract,8个包,2个vinsert,1个vpunpcklqdq,1 vpermq,总共16个散套,每2个输出的每2字节。我的版本每16个字节的输出或avx2进行3个散打器,如果将所有内容扩展到 __ M256I / _mm256 ... ,并将最终的随机添加到修理巷道。 (加4x vcmpps 和1x vpabsb flip -1至+1。)

GCC使用Unsigned Pack指令,例如 packusdw 作为第一步,在每个包装指令的每个输入上执行 pand 指令。而且这两个步骤之间也不是必要的,因为我认为根据SSE4.1 packusdw /sse2 packuswb segned-code>签名的包装> sse 4.1 packusdw packusdw packUsdw sse> sse> sse> apcimed;即使它坚持使用未签名的包装,仅掩盖(或 pabsd )的值将少得多,因此在2个包装步骤之前或之后不需要进一步的掩蔽。

(sse2 packssdw preserves -1或0很好,甚至没有饱和。似乎GCC并没有跟踪有限的比较结果值,因此没有意识到它可以让包装指令说明只是工作。)

而没有SSE4.1,GCC的确会更糟。仅使用SSSE3,它使用一些 pshufb por 说明,以馈送2x pand /sse2 packuswb

使用Word-&gt; byte包指令所有将其输入视为签名的说明,Intel省略了 packusdw 32 - &gt;直到SSE4.1的16位包装,因为将DWORD的正常第一步打包到字节为 packssdw ,即使您最终想将签名的整数固定到0..255范围。

A bitmap is a more efficient way to store it, if you can have the rest of your program use that. (e.g. via Fastest way to unpack 32 bits to a 32 byte SIMD vector or is there an inverse instruction to the movemask instruction in intel avx2? if you want to use it with other vectors).

Or if you can cache-block it and use at most a couple KiB of mask vectors, you could just store the compare results directly for reuse without packing them down. (In an array of alignas(16) int32_t masks[], in case you want to access from scalar code). But only if you can do it with a small footprint in L1d. Or much better, use it on the fly as a mask for another vector operation so you're not storing/reloading mask data.


packssdw/packsswb dword compare results down to bytes

You're correct, if you don't want your elements packed down to single bits, don't use _mm_movemask_ps or epi8. Instead, use vector pack instructions. cmpps produces elements of all-zero / all-one bits, i.e. integer 0 (false) or -1 (true).

Signed integer pack instructions preserve 0 / -1 values, because both are in range for int8_t1.

To keep the compiler happy, you need _mm_castps_si128 to reinterpret a __m128 as a __m128i.

This works most efficiently packing 4 vectors of 4 float compare results each down to one vector of 16 separate bytes. (Or with AVX2, 4 vecs of 8 floats down to 1 vec of 32 bytes, requiring an extra permute at the end because _mm256_packs_epi32 and so on operate in-lane, two separate 16-byte pack operations. Probably a _mm256_permutevar8x32_epi32 vpermd with a vector constant as the control operand)

// or  bool *result  if you keep the abs value (_mm_abs_epi8) for 0 / 1 output
void cmp(int8_t *result, const float *a)
{
  __m128 cmp0 = _mm_cmp_ps(...);  // produces integer 0 or -1 elements
  __m128 cmp1 = _mm_cmp_ps(...);
  __m128 cmp2 = _mm_cmp_ps(...);
  __m128 cmp3 = _mm_cmp_ps(...);

   // 2x 32-bit dword -> 16-bit word  with signed saturation - packssdw
  __m128i lo_words = _mm_packs_epi32(_mm_castps_si128(cmp0), _mm_castps_si128(cmp1));
  __m128i hi_words  = _mm_packs_epi32(_mm_castps_si128(cmp2), _mm_castps_si128(cmp3));

  __m128i cmp_bytes = _mm_packs_epi16(lo_words, hi_words);  // packsswb: 0 / -1

 // if necessary create 0 / 1 bools.  If not, just store cmp_bytes
  cmp_bytes = _mm_abs_epi8(cmp_bytes);                        // SSSE3
  //cmp_bytes = _mm_and_si128(cmp_bytes, _mm_set1_epi8(1));   // SSE2

  _mm_storeu_si128((__m128i*)result, cmp_bytes); 
}

Getting a 0/1 instead of 0/-1 takes a _mm_and_si128 or SSSE3 _mm_abs_epi8, if you truly need bool instead of a zero/non-zero uint8_t[] or int8_t[].

If you only have a single vector of float, you'd want SSSE3 _mm_shuffle_epi8 (pshufb) to grab 1 byte from each dword, for _mm_storeu_si32 (beware it was broken in early GCC11 versions, and wasn't even supported before then. But now it is supported as a strict-aliasing-safe unaligned store. Otherwise use _mm_cvtsi128_si32 to int, and memcpy that to an array of bool.)

Footnote 1: signed pack instructions are the only good choice

All pack instructions before AVX-512F vpmovdb / vpmovusdb do saturation (not truncation), and treat their inputs as signed. This makes unsigned-pack instructions useless; we'd need to mask both inputs first or they'd saturate -1 to 0, not 0xffff to 0xff.

punpcklwd / punpckhwd can interleave 16-bit words from two registers, but only from the low or high half of those registers. So not a great building-block.

Truncation would also work, but there's only SSSE3 pshufb, no 2-register shuffles as useful as the pack.. instructions until AVX-512. vpblendw could interleave halves of dwords in two different input registers)

Even with AVX-512, vpmovdb only has one input register, vs. vpack... instructions that produce a full-width output with elements from two full-width inputs. (In 16-byte lanes separately, so you'd still need a vpermd at the end to put your 4-byte chunks that came from lanes of 4 floats into the right order).

Of course, AVX-512 using 512-bit vector width can only compare-into-mask. This is fantastic for storing a bitmap, just vcmpps k1, zmm0, [rsi] / kmov [rdi], k1. But for storing a bool array, probably you'd want to kunpck to concatenate compare results, with 2x kunpckwd to combine 16-bit to 32-bit masks, then kunpckdq to make a single 64-bit mask from 64 float compare results. Then use that with a zero-masked vmovdqu8 zmm0{k1}{z}, zmm1 and store that to memory. (A memory destination only allows merge-masking, not zero-masking.)

AVX-512 could still be potentially useful with only 256-bit registers (to avoid turbo penalties and so on), although vpermt2w / vpermt2b aren't single-uop even on Ice Lake.


Compiler auto-vectorization of scalar source is sub-optimal

Compilers do auto-vectorize (https://godbolt.org/z/3o58W919Y), but do a rather poor job. Still very likely faster than scalar, especially with AVX2 available.

clang packs each vector down separately, for 4-byte stores. But the individual packing is decently efficient. With AVX2, it jumps through some extra hoops, vextractf128 and then packing 8 dwords down to 8 bytes, before shuffling that together with another 8 bools and then vinserti128 with another 16. So it's eventually storing 32 bytes at a time of bools, but takes a lot of shuffles to get there.

Per YMM store, clang -march=haswell does 4 vextract, 8 packs, 2 vinsert, 1 vpunpcklqdq, 1 vpermq for a total of 16 shuffles, one shuffle per 2 bytes of output. My version does 3 shuffles per 16 bytes of output, or with AVX2, 4 per 32 bytes if you widen everything to __m256i / _mm256... and add a final shuffle to fix up for lane-crossing. (Plus 4x vcmpps and 1x vpabsb to flip -1 to +1.)

GCC uses unsigned pack instructions like packusdw as the first step, doing pand instructions on each input to each pack instruction. And also unnecessary one between the two steps because it's I think emulating unsigned->unsigned packing in terms of SSE4.1 packusdw / SSE2 packuswb signed->unsigned packs. Even if it's stuck on using unsigned packing, it would be a lot less bad to just mask (or pabsd) the value to 0 or 1 so no further masking is needed before or after 2 packing steps.

(SSE2 packssdw preserves -1 or 0 just fine, without even saturating. Seems GCC isn't keeping track of the limited value-range of compare results, so doesn't realize it can let the pack instructions just work.)

And without SSE4.1, GCC does even worse. With only SSSE3 it uses some pshufb and por instructions, to feed 2xpand/SSE2 packuswb.

With word->byte pack instructions all treating their inputs as signed, it makes some sense that Intel omitted packusdw 32 -> 16-bit pack until SSE4.1, since the normal first step for packing dwords to bytes is packssdw, even if you eventually want to clamp signed integers to a 0..255 range.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文