当前位置：文江博客话题详情

assembly avx512

AVX512BW VPCMPGTB对其k结果执行指令

发布于 2025-02-01 05:47:37 字数 427 浏览 3 评论 0 原文

我想比较 zmm 向量并使用其结果并执行 vpandn 。
在 avx2 中，我这样做：

vpcmpgtb ymm0, ymm0, ymm1
vpandn  ymm0, ymm0, ymm2
vpxor   ymm0, ymm0, ymm3

但是在 avx512bw 中， vpcmpgtb 返回返回导致 k 。
如何我是否应该在 avx512bw 中执行 vpandn 然后 vpxor ？

vpcmpgtb k0, zmm0, zmm1
vpandn ??
vpxor ??

原文

I want to compare a ZMM vector and using its result and performing vpandn.
in AVX2, i do this :

vpcmpgtb ymm0, ymm0, ymm1
vpandn  ymm0, ymm0, ymm2
vpxor   ymm0, ymm0, ymm3

But in AVX512BW, vpcmpgtb returns result in a K.
How should I perform vpandn then vpxor on its result in AVX512BW?

vpcmpgtb k0, zmm0, zmm1
vpandn ??
vpxor ??

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吖咩 2025-02-08 05:47:37

有 k 寄存器的单独说明;他们的mnemonics都从 k 开始，因此在，就像 > 。

以及 kunpck ... （convatenate，而不是交织）， kadd / kshift kshift ， kor / kand / knot / kxor ，甚至是 kxnor （方便地生成全部方式用于收集/分散的方式）。当然， kmov （包括/来自内存或GP-Integer），以及 kortest 和 ktest 用于分支。

它们都以字节/word/dword/qword大小为受影响的掩模钻头数量，零扩展结果。（在Xeon Phi上没有AVX-512BW，只有字节和单词大小，因为16位覆盖了一个ZMM，其元素与DWORD一样小。

有时您可以将其折叠成另一个操作，以避免需要单独的指令将口罩组合。要么将比较倒置，因此您可以直接使用 ktest 将其用于分支，或者如果要掩盖，请使用零掩蔽的比较式掩码。（不支持将合并掩蔽的比较/测试中的第三个现有掩码。）

AVX-512 Integer比较将谓词作为即时谓词，而不是仅作为 eq eq 和 gt ，因此您可以反转条件并使用和，而不需要 andn 。（在签名中可用vs. nofollow noreferrer“> nofoll ocigned vpcmpub 也不同。任何以前的x86 simd扩展程序。 > vpcmpub 。）

vpcmpngtb   k1,    zmm3, zmm1     ; k0 can't be used for masking, only with k instructions
vpcmpeqb   k2{k1}, zmm4, zmm1     ; This is zero-masking even without {z}, because merge masking isn't supported for this

等效（性能除外）to：

vpcmpngtb  k1,    zmm3, zmm1
vpcmpeqb   k2,    zmm4, zmm1
kand       k2,    k2, k1

也等同于 kandn ， gt 比较如您的问题一样，。

k ... 蒙版指令通常只能在端口0上运行，而不是出色的性能。 https://uops.info/ 。

蒙版比较（或其他说明）必须等待蒙版寄存器输入准备就绪，然后才能开始在其他操作数上工作。您可能希望它能支持tof to Blooks的后期转发，因为只能在写作中使用它们，但是IIRC并非如此。尽管如此，只有1个指令而不是2个说明仍然更好。除非具有高延迟并且掩码操作是低延迟，否则两个能够并行执行的第一个指令并没有更好。但是，使用512位寄存器时，通常执行单元吞吐量更像是瓶颈。（由于端口1上的向量alus被关闭。）

一些 k 指令仅是当前CPU的1个周期延迟，而其他指令则是4个周期延迟。（例如 kshift 和 kunpck ，以及 kadd 。）

interinsics 对于这些蒙版的comparare-into-hask指令是 _mm256_mask_mask_cmp_ep [iu] /64 输入操作数（以及两个向量和一个直接谓词）和一个掩码返回值。像ASM一样，他们使用 ..._ mask _... 而不是 ..._ maskz _... ，尽管这是零掩模，而不是合并屏蔽。

显然，将掩码施加给向量的矢量，

想将掩码与另一个矢量一起使用，而不仅仅是 vpmovmskb 或其他东西。 AVX-512具有 zmm0 {k1} 和零遮罩的合并屏蔽，例如 zmm0 {k1} {z} {z} 在写入向量目标时。请参阅如果您知道AVX2 ASM，但还不了解AVX-512新内容的基础，则为他们引入一堆AVX-512功能和ASM语法。

;; original code
  vpaddb       ymm0, ymm1, ymm2
  vpcmpgtb     ymm0, ymm0, ymm3    ; sum > y3 (signed compare)
  vpandn       ymm0, ymm0, ymm4    ; masked y4
  vpxor        ymm0, ymm1, ymm0    ; y0 = y1^y4 in bytes where compare was false
                                   ; y0 = y1  where it was true

在AVX-512 CPU上使用256位向量，您可以使用 vpternlogd 替换最后2个指令（只要您避免使用avx2 comparare-into-vector，只要您避免使用ymm16..31）。不幸的是，AVX-512根本没有比较into-vector，而只能进入面具。如果您的程序不花费大量时间在SIMD循环中，则256位矢量可以是一个不错的选择，在CPU上，对于512位矢量的最大涡轮罚款较高。（与整数向量并不是什么大不了的，除乘数以外的Simd整数是“轻”，而不是“重”）

，对于512位矢量而言，我们必须使用口罩。完全幼稚的下水道方法是将蒙版扩展回到使用 vpmovm2b zmm0，k1 ，然后 vpandnq / vpxorq 而无需掩盖。或 vpternlogd 而无需掩盖的情况下，在这种情况下，仍然可以将总计降至4个说明，并将其组合在一起。

零掩模 vmovdqu8 zmm0 {k1} {z}，zmm4 是替换 vpandn 的更好方法。或在XOR之后的混合物，使用掩码作为控制操作数。那仍然是所有需要执行单元的4个说明。

如果可能的话，例如，在32位元素¹的另一个问题中，合并屏蔽的XOR会很好（在复制寄存器不变后，以便Mov-Elimination可以使用²如果您不能破坏ZMM1）。

但是 avx-512对于位 - 树状的没有字节遮盖;只有 vpxord 和 vpxorq 允许在32或64位元素中掩盖。 AVX-512BW仅添加了 vmovdqu 的字节/单词元素大小指令，即使不掩盖屏蔽，也要关心边界的说明，例如 vpaddb and code> vpshufb 。

我们对指令级并行的最佳选择是与比较并行XOR，然后在准备好的掩码结果准备就绪后修复该结果。

  vpaddb     zmm0,    zmm1, zmm2
  vpcmpgtb   k1,      zmm0, zmm3   ; (sum > z3) signed compare, same as yours
  vpxord     zmm0,    zmm1, zmm4
  vmovdqu8   zmm0{k1}, zmm1        ; replace with z1 in bytes where (z1+z2 > z3)
        ; z0 = z1^z4 in bytes where compare was false
        ; z0 = z1 where it was true.

最终指令同样可以是 vpblendmb zmm0 {k1}，zmm0，zmm1 （手动），它与合并遮罩 vmovdqu8 仅在能够将混合结果写入第三寄存器的过程中。

根据您将要处理的 vpxord 结果，您也许可以进一步优化周围的代码，也许使用 vpternlogd ，如果它是更多的布尔值。或者，也许是通过将掩盖或零掩蔽合并到其他方面。例如，也许复制 zmm1 ，然后将合并掩盖 vpaddb 中进行，而不是进行混合。

的顺序（其中更高的iLP方式需要一个更昂贵的 vpblendvb 。

; Worse ILP version, direct port of your AVX2 logic
  vpaddb     zmm0,    zmm1, zmm2
  vpcmpngtb  k1,      zmm0, zmm3   ; !(sum > z3) signed compare
  vmovdqu8   zmm0{k1}{z}, zmm4     ; zmm4 or 0, like your vpandn result
  vpxord     zmm0, zmm0, zmm1      ; z0 = z1^z4 in bytes where compare was true
                       ; leaving z0=z1 bytes where the mask was zero  (k1[i]==0)
         ; this is for the inverted compare, ngt = le condition

使用较少的指令级并行性的另一种更糟糕的方法是使用与您的AVX2代码相同指令取决于上一个的结果，因此 k1 的总延迟准备为最终 zmm0 准备就绪的是3个周期而不是4个。（早期版本可以运行并行，假设ZMM4已经足够早。

vpxord 与 vpcmpb 在Skylake-X和Alder Lake上（ https://uops.info/ ）。与 vpblendmb 相同，但是 vmovdqu32 和64具有1周期延迟。

vpxord 即使掩盖也具有1周期的延迟，但是 vpaddb 具有3周期延迟，带有屏蔽vs. 1没有。因此，似乎字节掩模始终是3周期延迟，而DWord/Qword屏蔽保持与未掩盖的指令相同的延迟。但是，吞吐量不会受到影响，因此，只要您有足够的指令级并行性，如果不是长时间的循环链链，则订购延迟可以隐藏延迟。

脚注1：较宽的元素允许掩盖布尔值

这是为了利用不同元素大小的未来读者的好处。您绝对不想将您的字节元素扩大到dword，如果您不需要，那将获得1/4的工作，每次矢量完成的工作只是为了通过mov-eLimination节省1个后端UOP：

; 32-bit elements would allow masked xor
; but there is no vpxorb

vpaddd     zmm0,    zmm1, zmm2
vpcmpngtd  k1,      zmm0, zmm3   ; !(sum > z3) signed compare
 ;vpxord    zmm1{k1}, zmm1, zmm4   ; if destroying ZMM1 is ok

vmovdqa64  zmm0,    zmm1         ; if not, copy first
vpxord     zmm0{k1}, zmm1, zmm4  ; z0 = z1^z4 in dwords where compare was true
                    ; leaving z0=z1 dwords where the mask was zero  (k1[i]==0)

footnote 2 ：

vmovdqu8 zmm0，zmm1 不需要执行单元。但是 vmovdqu8 zmm0 {k1} {z}，zmm1 也可以，并且像其他512位UOPS一样，只能在当前Intel CPU上的端口0或5上运行，包括Ice Lake和Alder Lake Lake-P（在没有禁用其AVX-512支持的系统上）。

冰湖“ nofollow noreferrer”>仅针对GP-Integer，而不是GP-Integer，而不是GP-Integer ，因此，寄存器的确切副本仍然比进行任何蒙版或其他工作便宜。与使用256位向量的代码相比，只有两个SIMD执行端口使后端成为更普通的瓶颈急流。

不过，大多数代码具有重大的负载/商店和整数工作。

There are separate instructions for k registers; their mnemonics all start with k so they're easy to find in the table of instructions, like kandnq k0, k0, k1.

As well as kunpck... (concatenate, not interleave), kadd/kshift, kor/kand/knot/kxor, and even a kxnor (handy way to generate all-ones for gather/scatter). Also of course kmov (including to/from memory or GP-integer), and kortest and ktest for branching.

They all come in byte/word/dword/qword sizes for the number of mask bits affected, zero-extending the result. (Without AVX-512BW on a Xeon Phi, only byte and word sizes, since 16 bits covers a ZMM with elements as small as dword. But all mainstream CPUs with AVX-512 have AVX-512BW and thus 64-bit mask registers.)

You can sometimes fold that into another operation to avoid needing a separate instruction to combine masks. Either invert the compare so you can use ktest directly to branch, or if you want to mask, use a zero-masked compare-into-mask. (Merge-masked compare/test into a 3rd existing mask is not supported.)

AVX-512 integer compares take a predicate as an immediate, rather than only existing as eq and gt, so you can invert the condition and use and instead of needing andn. (Available in signed vs. unsigned vpcmpub, also unlike any previous x86 SIMD extension. So if you'd previously been adding 128 to flip the high bit for pcmpgtb, you don't need that anymore and can just do vpcmpub.)

vpcmpngtb   k1,    zmm3, zmm1     ; k0 can't be used for masking, only with k instructions
vpcmpeqb   k2{k1}, zmm4, zmm1     ; This is zero-masking even without {z}, because merge masking isn't supported for this

equivalent (except for performance) to:

vpcmpngtb  k1,    zmm3, zmm1
vpcmpeqb   k2,    zmm4, zmm1
kand       k2,    k2, k1

Also equivalent to kandn with a gt compare as the NOTed (first) operand, like in your question.

k... mask instructions can usually only run on port 0, not great performance. https://uops.info/.

A masked compare (or other instruction) has to wait for the mask register input to be ready before starting to work on the other operands. You might hope it would support late forwarding for masks since to only use them at write-back, but IIRC it doesn't. Still, only 1 instruction instead of 2 is still better. Having the first instruction of two able to execute in parallel isn't better unless it was high latency and the mask operation is low latency, and you're latency bound. But often execution-unit throughput is more of a bottleneck when using 512-bit registers. (Since the vector ALUs on port 1 are shut down.)

Some k instructions are only 1 cycle latency on current CPUs, while others are 4 cycle latency. (Like kshift and kunpck, and kadd.)

The intrinsics for these masked compare-into-mask instructions are _mm256_mask_cmp_ep[iu]_mask, with a __mmask8/16/32/64 input operand (as well as two vectors and an immediate predicate) and a mask return value. Like the asm, they use ..._mask_... instead of ..._maskz_... despite this being zero-masking not merge-masking.

Applying a mask to a vector

Apparently this question wanted to use the mask with another vector, not just get a mask for vpmovmskb or something. AVX-512 has merge-masking like zmm0{k1} and zero-masking like zmm0{k1}{z} when writing to a vector destination. See slides from Kirill Yukhin introducing a bunch of AVX-512 features and the asm syntax for them if you know AVX2 asm but don't already know the basics of AVX-512 new stuff.

;; original code
  vpaddb       ymm0, ymm1, ymm2
  vpcmpgtb     ymm0, ymm0, ymm3    ; sum > y3 (signed compare)
  vpandn       ymm0, ymm0, ymm4    ; masked y4
  vpxor        ymm0, ymm1, ymm0    ; y0 = y1^y4 in bytes where compare was false
                                   ; y0 = y1  where it was true

Using 256-bit vectors on an AVX-512 CPU, you can use vpternlogd to replace the last 2 instructions (still using AVX2 compare-into-vector as long as you avoid ymm16..31). Unfortunately AVX-512 doesn't have compare-into-vector at all, only into mask. 256-bit vectors can be a good option if your program doesn't spend a lot of its time in SIMD loops, especially on CPUs where the max-turbo penalty is higher for 512-bit vectors. (Not a huge deal with integer vectors, SIMD integer other than multiply is "light", not "heavy")

For 512-bit vectors, we have to use masks. The fully naive drop-in way would be to expand the mask back to a vector with vpmovm2b zmm0, k1 and then vpandnq/vpxorq without masking. Or vpternlogd without masking could still keep the total down to 4 instructions in this case, combining the andn/xor.

A zero-masking vmovdqu8 zmm0{k1}{z}, zmm4 is a better way to replace vpandn. Or a blend after the xor, using a mask as the control operand. That would still be 4 instructions that all need an execution unit.

If it were possible, e.g. in a different problem with 32-bit elements¹, merge-masked XOR would be good (after copying a register unchanged so mov-elimination could work² if you can't destroy zmm1).

But AVX-512 doesn't have byte-masking for bitwise-booleans; there's only vpxord and vpxorq which allow masking in 32 or 64-bit elements. AVX-512BW only added byte/word-element size instructions for vmovdqu, and for instructions that care about boundaries even without masking, like vpaddb and vpshufb.

Our best bet for instruction-level parallelism is to XOR in parallel with the compare, then fix up that result once the compare mask result is ready.

  vpaddb     zmm0,    zmm1, zmm2
  vpcmpgtb   k1,      zmm0, zmm3   ; (sum > z3) signed compare, same as yours
  vpxord     zmm0,    zmm1, zmm4
  vmovdqu8   zmm0{k1}, zmm1        ; replace with z1 in bytes where (z1+z2 > z3)
        ; z0 = z1^z4 in bytes where compare was false
        ; z0 = z1 where it was true.

The final instruction could equally have been vpblendmb zmm0{k1}, zmm0, zmm1 (manual), which differs from a merge-masking vmovdqu8 only in being able to write the blend result to a 3rd register.

Depending on what you're going to do with that vpxord result, you might be able to optimize further into the surrounding code, perhaps with vpternlogd if it's more bitwise booleans. Or perhaps by merge-masking or zero-masking into something else. e.g. perhaps copy zmm1 and do a merge-masked vpaddb into it, instead of doing the blend.

Another worse way, with less instruction-level parallelism, is to use the same order as your AVX2 code (where the more-ILP way would have required a vpblendvb which is more expensive.)

; Worse ILP version, direct port of your AVX2 logic
  vpaddb     zmm0,    zmm1, zmm2
  vpcmpngtb  k1,      zmm0, zmm3   ; !(sum > z3) signed compare
  vmovdqu8   zmm0{k1}{z}, zmm4     ; zmm4 or 0, like your vpandn result
  vpxord     zmm0, zmm0, zmm1      ; z0 = z1^z4 in bytes where compare was true
                       ; leaving z0=z1 bytes where the mask was zero  (k1[i]==0)
         ; this is for the inverted compare, ngt = le condition

In this, each instruction depends on the result of the previous, so the total latency from k1 being ready to the final zmm0 being ready is 3 cycles instead of 4. (The earlier version can run vpxord in parallel with vpcmpb, assuming ZMM4 is ready early enough.)

Zero-masking (and merge-masking) vmovdqu8 have 3-cycle latency on Skylake-X and Alder Lake (https://uops.info/). Same as vpblendmb, but vmovdqu32 and 64 have 1-cycle latency.

vpxord has 1-cycle latency even with masking, but vpaddb has 3-cycle latency with masking vs. 1 without. So it seems byte-masking is consistently 3-cycle latency, while dword/qword masking keeps the same latency as the unmasked instruction. Throughput isn't affected though, so as long as you have enough instruction-level parallelism, out-of-order exec can hide latency if it's not a long loop-carried dep chain.

Footnote 1: wider elements allow masked booleans

This is for the benefit of future readers who are using a different element size. You definitely don't want to widen your byte elements to dword if you don't have to, that would get 1/4 the work done per vector just to save 1 back-end uop via mov-elimination:

; 32-bit elements would allow masked xor
; but there is no vpxorb

vpaddd     zmm0,    zmm1, zmm2
vpcmpngtd  k1,      zmm0, zmm3   ; !(sum > z3) signed compare
 ;vpxord    zmm1{k1}, zmm1, zmm4   ; if destroying ZMM1 is ok

vmovdqa64  zmm0,    zmm1         ; if not, copy first
vpxord     zmm0{k1}, zmm1, zmm4  ; z0 = z1^z4 in dwords where compare was true
                    ; leaving z0=z1 dwords where the mask was zero  (k1[i]==0)

Footnote 2:

vmovdqu8 zmm0, zmm1 doesn't need an execution unit. But vmovdqu8 zmm0{k1}{z}, zmm1 does, and like other 512-bit uops, can only run on port 0 or 5 on current Intel CPUs, including Ice Lake and Alder Lake-P (on systems where its AVX-512 support isn't disabled).

Ice Lake broke mov-elimination only for GP-integer, not vectors, so an exact copy of a register is still cheaper than doing any masking or other work. Only having two SIMD execution ports makes the back-end a more common bottleneck than for code using 256-bit vectors, especially on Ice Lake and later with the 5-wide front-end in Ice Lake, 6-wide in Alder Lake / Sapphire Rapids.

Most code has significant load/store and integer work, though.

回复收藏 0 原文

~没有更多了~

关于作者

泼猴你往哪里跑

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

AVX512BW VPCMPGTB对其k结果执行指令

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

显然，将掩码施加给向量的矢量，

脚注1：较宽的元素允许掩盖布尔值

Applying a mask to a vector

Footnote 1: wider elements allow masked booleans

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

AVX512BW VPCMPGTB对其k结果执行指令

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

显然，将掩码施加给向量的矢量，

脚注1：较宽的元素允许掩盖布尔值

Applying a mask to a vector

Footnote 1: wider elements allow masked booleans

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。