将Intel -03转换为__M256D指令对__M512D
将编写为256矢量化寄存器编写的代码使用(2019)Intel编译器具有O3优化级别的512个说明?
例如,在两个__M256D对象上操作是否会转换为蒙版__M512D对象的相同数量的操作,或者分组以最大程度地使用寄存器,在最佳情况下,操作总数降低了一个因子2?
拱门:骑士着陆
Will a code written for a 256 vectorization register will be compiled to use 512 instructions using the (2019) intel compiler with O3 level of optimization?
e.g. will operations on two __m256d objects be either converted to the same amount of operations over masked __m512d objects or grouped to make the most use out of the register, in the best case the total number of operations dropping by a factor 2?
arch: Knights Landing
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不幸的是, no :编写用于使用AVX/AVX-2内在的代码不是由ICC重写的,因此请使用AVX-512(ICC 2019和ICC 2021)。没有指示融合。这是一个示例(请参阅 godbolt )。
生成的代码:
为ICC的两个版本生成相同的代码。
请注意,使用AVX-512不应始终以两倍的速度加快代码。例如,在Skylake SP(服务器端处理器)上,有2个AVX/AVX-2 SIMD单元可以融合以执行AVX-512指令,但Fusing不能改善吞吐量(假设SIMD单位是瓶颈)。但是,Skylake SP还支持不支持AVX/AVX-2(仅在某些处理器上可用)的可选其他512位SIMD单元。在这种情况下,AVX-512可以使您的代码更快两倍。
Unfortunately, no: a code written to use AVX/AVX-2 intrinsics is not rewritten by ICC so to use AVX-512 yet (with both ICC 2019 and ICC 2021). There is no instruction fusing. Here is an example (see on GodBolt).
Generated code:
The same code is generated for both version of ICC.
Note that using AVX-512 should not always speed up your code by a factor of two. For example, on Skylake SP (server-side processors) there is 2 AVX/AVX-2 SIMD units that can be fused to execute AVX-512 instructions but fusing does not improve throughput (assuming the SIMD units are the bottleneck). However, Skylake SP also supports an optional additional 512-bits SIMD units that does not support AVX/AVX-2 (only available on some processors). In this case, AVX-512 can make your code twice faster.