Maskstore在幕后做什么?
我的主要编程语言是C#,最近我一直在尝试了解矢量编程和Intel X86 AXV2上的一些SIMD说明,以进行自学习。我遇到了指令 MaskStore 映射到AXV2指令:
VPMASKMOVD m256, ymm, ymm
我只是想知道该指令在幕后如何工作,以伪代码为编程的是:
for n in vector.values
if (highest bit of mask is set for vector n)
{
address = source vector[n]
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的,这是正确的。
ASM手册用类似的伪代码记录了它。英特尔的C/C ++内在指南的详细信息较少但类似的文档: htpps:htpps:htpps:htpps://///////////////////////g .//////////////////////////// /content/www/us/en/docs/intrinsics-guide/index.html#techs = SSE,SSE2,SSE2,SSE3,SSE3,SSE3,SSE4_1,SSE4_1,SSE4_2,AVX,phote 4_2,avx,phote&otho&phote& ,5039& text = maskmov 。
请注意,这是AVX1指令,而不是AVX2。自从Sandybridge在Intel上以来就得到了支持,自推土机以来,AMD。
在AMD CPU上不是很有效,尽管根据 https://uops.info/ 测试数据。蒙面商店不容易效仿,与掩盖负载不同,它只能执行定期负载,然后掩盖,只需要特殊的硬件和/或微码辅助工具,如果必须在负载的掩盖部分触摸时进行故障抑制一个未绘制的页面。
在英特尔上,蒙版商店也是一流的运营,仅3个用于存储器目的地的UOPS
vmaskMovpd mem,ymm,ymm
自Skylake(p0 + p23 + p4)上,从sandybridge中的4个下降。 Haswell(P0+P1+P23+P4)可能不是Skylake具有AVX-512硬件的巧合,即使它未在“客户端”芯片中启用/暴露;也许内部可以用作比较的面具UOP,然后是本地蒙面商店的UOP。如果没有微融合商店地址和商店数据,那总共是3个UOPS。 Alu UOP只能在Skylake上的port-0上运行,”
VPMOVQ2M K,YMM
,与vptestmq K,YMM,YMM
不同,因此我们可以推断出它可能使用UOP,例如vpmovq2m
从蒙版矢量操作数生成内部掩码登录值。在AMD ZEN2上,Mask Load是单一UOP(比Zen1的改进),但蒙版存储的XMM或YMM矢量宽度分别为10或19个UOPS。 4C或6C吞吐量,如果您可以安全地执行非原子RMW,则比负载 /或混合 /存储慢得多。 (包括您不修改的元素的负载/存储。)
MaskMove有一些用例的角案例,可能是灾难性的,例如在每条指令上使用Microcode辅助(如果在读取页面上与全false面膜一起使用) 。
如果一页的内存尚未弄脏,并且仍在编写映射的情况下,则可能会发生这种情况,因此实际上仅读取HW(在页面表中)。然后,如果您使用加载 /比较 /掩码店循环循环以有条件地替换某些值之类的东西,则如果不需要替换值,您可能永远不会弄脏页面,因此多个指令需要缓慢的微码辅助。
但是OTOH,如果您不击中该糟糕的情况,则可能比Skylake上的商店/蒙版/重新加载要快一些。
Yes, that's correct.
The asm manual https://www.felixcloutier.com/x86/vmaskmov#vmaskmovpd---256-bit-store documents it with pseudo-code like that. Intel's C/C++ intrinsics guide has less detailed but similar documentation: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2,AVX,Other&ig_expand=5420,5420,5623,5626,7359,5403,5039&text=maskmov.
Note that it's an AVX1 instruction, not AVX2. Supported since Sandybridge on Intel, AMD since Bulldozer.
Not very efficiently on AMD CPUs, though as per https://uops.info/ testing data. Masked stores aren't easy to emulate, unlike maskload which can just do a regular load and then mask, only needing special hardware and/or a microcode assist if it has to do fault suppression when a masked-out part of the load would touch an unmapped page.
On Intel, masked stores are also first-class operations, only 3 uops for memory-destination
vmaskmovpd mem, ymm, ymm
on Intel since Skylake (p0 + p23+p4), down from 4 in Sandybridge/Haswell (p0+p1 + p23+p4)Probably not a coincidence that Skylake has AVX-512 hardware, even though it's not enabled/exposed in the "client" chips; perhaps internally works as a compare-into-mask uop, and then a native masked-store uop. Without micro-fusion of store-address and store-data, that's 3 uops total. The ALU uop can only run on port-0 on Skylake, the same port required by
vpmovq2m k, ymm
, unlikevptestmq k, ymm,ymm
, so we can infer that it probably uses a uop likevpmovq2m
to generate the internal mask-register value from the mask vector operand.On AMD Zen2, mask load is single-uop (much improved over Zen1), but mask store is still 10 or 19 uops for XMM or YMM vector width, respectively. 4c or 6c throughput, so much slower than load / AND or blend / store if you can safely do a non-atomic RMW. (Including a load/store of elements you don't modify.)
Maskmove has some use-case corner cases that can be disastrous, like taking a microcode assist on every instruction if used with an all-false mask on a read-only page.
That can happen if a page of memory hasn't been dirties yet and is still copy-on-write mapped so it's actually read-only as far as HW is concerned (in the page tables). And then if you loop over it using load / compare / maskstore to conditionally replace some values or something, you might never dirty the page if no values need replacing, so multiple instructions take a slow microcode assist.
But OTOH, it can be a bit faster than store/mask/reload on Skylake for the same array-replacement task if you don't hit that bad case.