AVX512BW VPCMPGTB对其k结果执行指令
我想比较 zmm
向量并使用其结果并执行 vpandn
。
在 avx2
中,我这样做:
vpcmpgtb ymm0, ymm0, ymm1
vpandn ymm0, ymm0, ymm2
vpxor ymm0, ymm0, ymm3
但是在 avx512bw
中, vpcmpgtb
返回返回导致 k
。
如何我是否应该在 avx512bw
中执行 vpandn
然后 vpxor
?
vpcmpgtb k0, zmm0, zmm1
vpandn ??
vpxor ??
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
有
k
寄存器的单独说明;他们的mnemonics都从k
开始,因此在,就像>
。
以及
kunpck ...
(convatenate,而不是交织),kadd
/kshift
kshift ,kor
/kand
/knot
/kxor
,甚至是kxnor
(方便地生成全部方式用于收集/分散的方式)。当然,kmov
(包括/来自内存或GP-Integer),以及kortest
和ktest
用于分支。它们都以字节/word/dword/qword大小为受影响的掩模钻头数量,零扩展结果。 (在Xeon Phi上没有AVX-512BW,只有字节和单词大小,因为16位覆盖了一个ZMM,其元素与DWORD一样小。
有时您可以将其折叠成另一个操作,以避免需要单独的指令将口罩组合。要么将比较倒置,因此您可以直接使用
ktest
将其用于分支,或者如果要掩盖,请使用零掩蔽的比较式掩码。 (不支持将合并掩蔽的比较/测试中的第三个现有掩码。)AVX-512 Integer比较将谓词作为即时谓词,而不是仅作为
eq
eq 和gt ,因此您可以反转条件并使用
和
,而不需要andn
。 (在签名中可用vs. nofollow noreferrer“> nofoll ocignedvpcmpub
也不同。任何以前的x86 simd扩展程序。 > vpcmpub
。)等效(性能除外)to:
也等同于
kandn
,gt
比较如您的问题一样, 。k ...
蒙版指令通常只能在端口0上运行,而不是出色的性能。 https://uops.info/ 。蒙版比较(或其他说明)必须等待蒙版寄存器输入准备就绪,然后才能开始在其他操作数上工作。您可能希望它能支持tof to Blooks的后期转发,因为只能在写作中使用它们,但是IIRC并非如此。尽管如此,只有1个指令而不是2个说明仍然更好。除非具有高延迟并且掩码操作是低延迟,否则两个能够并行执行的第一个指令并没有更好。但是,使用512位寄存器时,通常执行单元吞吐量更像是瓶颈。 (由于端口1上的向量alus被关闭。)
一些
k
指令仅是当前CPU的1个周期延迟,而其他指令则是4个周期延迟。 (例如kshift
和kunpck
,以及kadd
。)interinsics 对于这些蒙版的comparare-into-hask指令是
_mm256_mask_mask_cmp_ep [iu] /64
输入操作数(以及两个向量和一个直接谓词)和一个掩码返回值。像ASM一样,他们使用..._ mask _...
而不是..._ maskz _...
,尽管这是零掩模,而不是合并屏蔽。显然,将掩码施加给向量的矢量,
想将掩码与另一个矢量一起使用,而不仅仅是
vpmovmskb
或其他东西。 AVX-512具有zmm0 {k1}
和零遮罩的合并屏蔽,例如zmm0 {k1} {z} {z}
在写入向量目标时。请参阅如果您知道AVX2 ASM,但还不了解AVX-512新内容的基础,则为他们引入一堆AVX-512功能和ASM语法。在AVX-512 CPU上使用256位向量,您可以使用
vpternlogd
替换最后2个指令(只要您避免使用avx2 comparare-into-vector,只要您避免使用ymm16..31)。不幸的是,AVX-512根本没有比较into-vector,而只能进入面具。如果您的程序不花费大量时间在SIMD循环中,则256位矢量可以是一个不错的选择,在CPU上,对于512位矢量的最大涡轮罚款较高。 (与整数向量并不是什么大不了的,除乘数以外的Simd整数是“轻”,而不是“重”),对于512位矢量而言,我们必须使用口罩。完全幼稚的下水道方法是将蒙版扩展回到使用
vpmovm2b zmm0,k1
,然后vpandnq
/vpxorq
而无需掩盖 。或vpternlogd
而无需掩盖的情况下,在这种情况下,仍然可以将总计降至4个说明,并将其组合在一起。零掩模
vmovdqu8 zmm0 {k1} {z},zmm4
是替换vpandn
的更好方法。或在XOR之后的混合物,使用掩码作为控制操作数。那仍然是所有需要执行单元的4个说明。如果可能的话,例如,在32位元素 1 的另一个问题中,合并屏蔽的XOR会很好(在复制寄存器不变后,以便Mov-Elimination可以使用 2 如果您不能破坏ZMM1)。
但是 avx-512对于位 - 树状的没有字节遮盖;只有
vpxord
和vpxorq
允许在32或64位元素中掩盖。 AVX-512BW仅添加了vmovdqu
的字节/单词元素大小指令,即使不掩盖屏蔽,也要关心边界的说明,例如vpaddb
and code>vpshufb
。我们对指令级并行的最佳选择是与比较并行XOR,然后在准备好的掩码结果准备就绪后修复该结果。
最终指令同样可以是
vpblendmb zmm0 {k1},zmm0,zmm1
(手动),它与合并遮罩vmovdqu8
仅在能够将混合结果写入第三寄存器的过程中。根据您将要处理的
vpxord
结果,您也许可以进一步优化周围的代码,也许使用vpternlogd
,如果它是更多的布尔值。或者,也许是通过将掩盖或零掩蔽合并到其他方面。例如,也许复制zmm1
,然后将合并掩盖vpaddb
中进行,而不是进行混合。的顺序(其中更高的iLP方式需要一个更昂贵的
vpblendvb
。使用较少的指令级并行性的另一种更糟糕的方法是使用与您的AVX2代码相同 指令取决于上一个的结果,因此
k1
的总延迟准备为最终zmm0
准备就绪的是3个周期而不是4个。(早期版本可以运行 并行,假设ZMM4已经足够早。vpxord
与vpcmpb
在Skylake-X和Alder Lake上( https://uops.info/ )。与vpblendmb
相同,但是vmovdqu32
和64具有1周期延迟。vpxord
即使掩盖也具有1周期的延迟,但是vpaddb
具有3周期延迟,带有屏蔽vs. 1没有。因此,似乎字节掩模始终是3周期延迟,而DWord/Qword屏蔽保持与未掩盖的指令相同的延迟。但是,吞吐量不会受到影响,因此,只要您有足够的指令级并行性,如果不是长时间的循环链链,则订购延迟可以隐藏延迟。脚注1:较宽的元素允许掩盖布尔值
这是为了利用不同元素大小的未来读者的好处。您绝对不想将您的字节元素扩大到dword,如果您不需要,那将获得1/4的工作,每次矢量完成的工作只是为了通过mov-eLimination节省1个后端UOP:
footnote 2 :
vmovdqu8 zmm0,zmm1
不需要执行单元。但是vmovdqu8 zmm0 {k1} {z},zmm1
也可以,并且像其他512位UOPS一样,只能在当前Intel CPU上的端口0或5上运行,包括Ice Lake和Alder Lake Lake-P(在没有禁用其AVX-512支持的系统上)。冰湖“ nofollow noreferrer”>仅针对GP-Integer,而不是GP-Integer,而不是GP-Integer ,因此,寄存器的确切副本仍然比进行任何蒙版或其他工作便宜。与使用256位向量的代码相比,只有两个SIMD执行端口使后端成为更普通的瓶颈急流。
不过,大多数代码具有重大的负载/商店和整数工作。
There are separate instructions for
k
registers; their mnemonics all start withk
so they're easy to find in the table of instructions, likekandnq k0, k0, k1
.As well as
kunpck...
(concatenate, not interleave),kadd
/kshift
,kor
/kand
/knot
/kxor
, and even akxnor
(handy way to generate all-ones for gather/scatter). Also of coursekmov
(including to/from memory or GP-integer), andkortest
andktest
for branching.They all come in byte/word/dword/qword sizes for the number of mask bits affected, zero-extending the result. (Without AVX-512BW on a Xeon Phi, only byte and word sizes, since 16 bits covers a ZMM with elements as small as dword. But all mainstream CPUs with AVX-512 have AVX-512BW and thus 64-bit mask registers.)
You can sometimes fold that into another operation to avoid needing a separate instruction to combine masks. Either invert the compare so you can use
ktest
directly to branch, or if you want to mask, use a zero-masked compare-into-mask. (Merge-masked compare/test into a 3rd existing mask is not supported.)AVX-512 integer compares take a predicate as an immediate, rather than only existing as
eq
andgt
, so you can invert the condition and useand
instead of needingandn
. (Available in signed vs. unsignedvpcmpub
, also unlike any previous x86 SIMD extension. So if you'd previously been adding128
to flip the high bit forpcmpgtb
, you don't need that anymore and can just dovpcmpub
.)equivalent (except for performance) to:
Also equivalent to
kandn
with agt
compare as the NOTed (first) operand, like in your question.k...
mask instructions can usually only run on port 0, not great performance. https://uops.info/.A masked compare (or other instruction) has to wait for the mask register input to be ready before starting to work on the other operands. You might hope it would support late forwarding for masks since to only use them at write-back, but IIRC it doesn't. Still, only 1 instruction instead of 2 is still better. Having the first instruction of two able to execute in parallel isn't better unless it was high latency and the mask operation is low latency, and you're latency bound. But often execution-unit throughput is more of a bottleneck when using 512-bit registers. (Since the vector ALUs on port 1 are shut down.)
Some
k
instructions are only 1 cycle latency on current CPUs, while others are 4 cycle latency. (Likekshift
andkunpck
, andkadd
.)The intrinsics for these masked compare-into-mask instructions are
_mm256_mask_cmp_ep[iu]_mask
, with a__mmask8/16/32/64
input operand (as well as two vectors and an immediate predicate) and a mask return value. Like the asm, they use..._mask_...
instead of..._maskz_...
despite this being zero-masking not merge-masking.Applying a mask to a vector
Apparently this question wanted to use the mask with another vector, not just get a mask for
vpmovmskb
or something. AVX-512 has merge-masking likezmm0{k1}
and zero-masking likezmm0{k1}{z}
when writing to a vector destination. See slides from Kirill Yukhin introducing a bunch of AVX-512 features and the asm syntax for them if you know AVX2 asm but don't already know the basics of AVX-512 new stuff.Using 256-bit vectors on an AVX-512 CPU, you can use
vpternlogd
to replace the last 2 instructions (still using AVX2 compare-into-vector as long as you avoid ymm16..31). Unfortunately AVX-512 doesn't have compare-into-vector at all, only into mask. 256-bit vectors can be a good option if your program doesn't spend a lot of its time in SIMD loops, especially on CPUs where the max-turbo penalty is higher for 512-bit vectors. (Not a huge deal with integer vectors, SIMD integer other than multiply is "light", not "heavy")For 512-bit vectors, we have to use masks. The fully naive drop-in way would be to expand the mask back to a vector with
vpmovm2b zmm0, k1
and thenvpandnq
/vpxorq
without masking. Orvpternlogd
without masking could still keep the total down to 4 instructions in this case, combining the andn/xor.A zero-masking
vmovdqu8 zmm0{k1}{z}, zmm4
is a better way to replacevpandn
. Or a blend after the xor, using a mask as the control operand. That would still be 4 instructions that all need an execution unit.If it were possible, e.g. in a different problem with 32-bit elements1, merge-masked XOR would be good (after copying a register unchanged so mov-elimination could work2 if you can't destroy zmm1).
But AVX-512 doesn't have byte-masking for bitwise-booleans; there's only
vpxord
andvpxorq
which allow masking in 32 or 64-bit elements. AVX-512BW only added byte/word-element size instructions forvmovdqu
, and for instructions that care about boundaries even without masking, likevpaddb
andvpshufb
.Our best bet for instruction-level parallelism is to XOR in parallel with the compare, then fix up that result once the compare mask result is ready.
The final instruction could equally have been
vpblendmb zmm0{k1}, zmm0, zmm1
(manual), which differs from a merge-maskingvmovdqu8
only in being able to write the blend result to a 3rd register.Depending on what you're going to do with that
vpxord
result, you might be able to optimize further into the surrounding code, perhaps withvpternlogd
if it's more bitwise booleans. Or perhaps by merge-masking or zero-masking into something else. e.g. perhaps copyzmm1
and do a merge-maskedvpaddb
into it, instead of doing the blend.Another worse way, with less instruction-level parallelism, is to use the same order as your AVX2 code (where the more-ILP way would have required a
vpblendvb
which is more expensive.)In this, each instruction depends on the result of the previous, so the total latency from
k1
being ready to the finalzmm0
being ready is 3 cycles instead of 4. (The earlier version can runvpxord
in parallel withvpcmpb
, assuming ZMM4 is ready early enough.)Zero-masking (and merge-masking)
vmovdqu8
have 3-cycle latency on Skylake-X and Alder Lake (https://uops.info/). Same asvpblendmb
, butvmovdqu32
and 64 have 1-cycle latency.vpxord
has 1-cycle latency even with masking, butvpaddb
has 3-cycle latency with masking vs. 1 without. So it seems byte-masking is consistently 3-cycle latency, while dword/qword masking keeps the same latency as the unmasked instruction. Throughput isn't affected though, so as long as you have enough instruction-level parallelism, out-of-order exec can hide latency if it's not a long loop-carried dep chain.Footnote 1: wider elements allow masked booleans
This is for the benefit of future readers who are using a different element size. You definitely don't want to widen your byte elements to dword if you don't have to, that would get 1/4 the work done per vector just to save 1 back-end uop via mov-elimination:
Footnote 2:
vmovdqu8 zmm0, zmm1
doesn't need an execution unit. Butvmovdqu8 zmm0{k1}{z}, zmm1
does, and like other 512-bit uops, can only run on port 0 or 5 on current Intel CPUs, including Ice Lake and Alder Lake-P (on systems where its AVX-512 support isn't disabled).Ice Lake broke mov-elimination only for GP-integer, not vectors, so an exact copy of a register is still cheaper than doing any masking or other work. Only having two SIMD execution ports makes the back-end a more common bottleneck than for code using 256-bit vectors, especially on Ice Lake and later with the 5-wide front-end in Ice Lake, 6-wide in Alder Lake / Sapphire Rapids.
Most code has significant load/store and integer work, though.