Maskstore在幕后做什么？

发布于 2025-01-31 18:22:10 字数 669 浏览 5 评论 0 原文

我的主要编程语言是C＃，最近我一直在尝试了解矢量编程和Intel X86 AXV2上的一些SIMD说明，以进行自学习。我遇到了指令 MaskStore 映射到AXV2指令：

VPMASKMOVD m256, ymm, ymm

我只是想知道该指令在幕后如何工作，以伪代码为编程的是：

for n in vector.values
    if (highest bit of mask is set for vector n)
    {
        address = source vector[n]
    }

原文

my main programming language is C# and lately I've been trying to learn about vector programming and some simd instructions on the intel x86 axv2 for self-learning purposes. I came across the instruction MaskStore which maps to the axv2 instruction:

VPMASKMOVD m256, ymm, ymm

I'm just wondering how does this instruction work behind the scenes, programmatically in pseudo code is it something like:

for n in vector.values
    if (highest bit of mask is set for vector n)
    {
        address = source vector[n]
    }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

王权女流氓 2025-02-07 18:22:10

是的，这是正确的。

ASM手册用类似的伪代码记录了它。英特尔的C/C ++内在指南的详细信息较少但类似的文档： htpps：htpps：htpps：htpps：/////////////////////////g .//////////////////////////// /content/www/us/en/docs/intrinsics-guide/index.html#techs = SSE，SSE2,SSE2,SSE3,SSE3,SSE3,SSE4_1,SSE4_1,SSE4_2,AVX，phote 4_2,avx，phote＆otho＆phote＆amp；，5039＆amp; text = maskmov 。

请注意，这是AVX1指令，而不是AVX2。自从Sandybridge在Intel上以来就得到了支持，自推土机以来，AMD。

在AMD CPU上不是很有效，尽管根据 https://uops.info/ 测试数据。蒙面商店不容易效仿，与掩盖负载不同，它只能执行定期负载，然后掩盖，只需要特殊的硬件和/或微码辅助工具，如果必须在负载的掩盖部分触摸时进行故障抑制一个未绘制的页面。

在英特尔上，蒙版商店也是一流的运营，仅3个用于存储器目的地的UOPS vmaskMovpd mem，ymm，ymm 自Skylake（p0 + p23 + p4）上，从sandybridge中的4个下降。 Haswell（P0+P1+P23+P4）

可能不是Skylake具有AVX-512硬件的巧合，即使它未在“客户端”芯片中启用/暴露；也许内部可以用作比较的面具UOP，然后是本地蒙面商店的UOP。如果没有微融合商店地址和商店数据，那总共是3个UOPS。 Alu UOP只能在Skylake上的port-0上运行，” VPMOVQ2M K，YMM ，与 vptestmq K，YMM，YMM 不同，因此我们可以推断出它可能使用UOP，例如 vpmovq2m 从蒙版矢量操作数生成内部掩码登录值。

在AMD ZEN2上，Mask Load是单一UOP（比Zen1的改进），但蒙版存储的XMM或YMM矢量宽度分别为10或19个UOPS。 4C或6C吞吐量，如果您可以安全地执行非原子RMW，则比负载 /或混合 /存储慢得多。（包括您不修改的元素的负载/存储。）

MaskMove有一些用例的角案例，可能是灾难性的，例如在每条指令上使用Microcode辅助（如果在读取页面上与全false面膜一起使用）。

如果一页的内存尚未弄脏，并且仍在编写映射的情况下，则可能会发生这种情况，因此实际上仅读取HW（在页面表中）。然后，如果您使用加载 /比较 /掩码店循环循环以有条件地替换某些值之类的东西，则如果不需要替换值，您可能永远不会弄脏页面，因此多个指令需要缓慢的微码辅助。

但是OTOH，如果您不击中该糟糕的情况，则可能比Skylake上的商店/蒙版/重新加载要快一些。

Yes, that's correct.

The asm manual https://www.felixcloutier.com/x86/vmaskmov#vmaskmovpd---256-bit-store documents it with pseudo-code like that. Intel's C/C++ intrinsics guide has less detailed but similar documentation: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,Other&ig_expand=5420,5420,5623,5626,7359,5403,5039&text=maskmov.

Note that it's an AVX1 instruction, not AVX2. Supported since Sandybridge on Intel, AMD since Bulldozer.

Not very efficiently on AMD CPUs, though as per https://uops.info/ testing data. Masked stores aren't easy to emulate, unlike maskload which can just do a regular load and then mask, only needing special hardware and/or a microcode assist if it has to do fault suppression when a masked-out part of the load would touch an unmapped page.

On Intel, masked stores are also first-class operations, only 3 uops for memory-destination vmaskmovpd mem, ymm, ymm on Intel since Skylake (p0 + p23+p4), down from 4 in Sandybridge/Haswell (p0+p1 + p23+p4)

Probably not a coincidence that Skylake has AVX-512 hardware, even though it's not enabled/exposed in the "client" chips; perhaps internally works as a compare-into-mask uop, and then a native masked-store uop. Without micro-fusion of store-address and store-data, that's 3 uops total. The ALU uop can only run on port-0 on Skylake, the same port required by vpmovq2m k, ymm, unlike vptestmq k, ymm,ymm, so we can infer that it probably uses a uop like vpmovq2m to generate the internal mask-register value from the mask vector operand.

On AMD Zen2, mask load is single-uop (much improved over Zen1), but mask store is still 10 or 19 uops for XMM or YMM vector width, respectively. 4c or 6c throughput, so much slower than load / AND or blend / store if you can safely do a non-atomic RMW. (Including a load/store of elements you don't modify.)

Maskmove has some use-case corner cases that can be disastrous, like taking a microcode assist on every instruction if used with an all-false mask on a read-only page.

That can happen if a page of memory hasn't been dirties yet and is still copy-on-write mapped so it's actually read-only as far as HW is concerned (in the page tables). And then if you loop over it using load / compare / maskstore to conditionally replace some values or something, you might never dirty the page if no values need replacing, so multiple instructions take a slow microcode assist.

But OTOH, it can be a bit faster than store/mask/reload on Skylake for the same array-replacement task if you don't hit that bad case.

回复收藏 0 原文

~没有更多了~