当前位置：文江博客话题详情

指令SQRTPD是否同时计算SQRT？

发布于 2025-02-04 10:02:49 字数 367 浏览 2 评论 0原文

我正在学习SIMD内在和并行计算。我不确定是否 intel x86指令的定义sqrtpd 说传递给它的两个数字的平方根将同时计算：

在源操作数（第二操作数）中对两个，四个或八个包装的双重浮点数的平方根进行SIMD计算，并将包装的双精度浮点数存储在目标中操作数（第一个操作数）。

我知道它明确地说 simd计算，但这是否意味着对于此操作，这两个数字都将同时计算根？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

抚你发端 2025-02-11 10:02:49

对于sqrtpd xmm，是的，现代CPU确实并行执行，而不是一次通过较窄的执行单元运行。较旧的（尤其是低功率）CPU确实这样做。对于avx vsqrtpd ymm，有些CPU确实将其分为两半。

但是，如果您只是将性能号码与较窄的操作进行比较，请注意，Skylake（Skylake）的某些CPU可以使用其宽的Div/SQRT单元的不同一半用于单独的sqrtpd/sd XMM，因此这些量是吞吐量的两倍ymm，即使它可以在并行中进行完整的vsqrtpd ymm。

对于AVX-512 VSQRTPD ZMM，甚至Ice Lake都将其分为两半，正如我们可以从中看到的是3个UOPS（2对于端口0，英特尔将DIV/SQRT单位放置，并且端口上运行。

可以在其他为了饲养较窄的操作，可以独立地对宽执行单元进行不同的管道。

唯一的区别是性能;目标X/Y/ZMM寄存器绝对具有每个输入元素的平方根。在 https：//uops.info/ （目前是下降但通常很好），和/或 https://agner.org/optimize/ 。

允许但不能保证CPU在内部具有较宽的执行单元，就像它们支持的最宽的向量一样宽，因此可以在平行管道中真正计算出所有结果。

除了划分和平方根以外的指令，全宽执行单元是常见的，尽管在Zen1支持AVX/AVX2之前，AMD仅使用128位执行单元，因此，VADDPS YMM解码为2 UOPS，对2 UOPS进行了解码，单独做每半。英特尔·奥尔德湖电子访问以相同的方式工作。

一些古老的和/或低功率CPU（例如奔腾M，K8和山猫）只有64位宽的执行单元，分为两半（对于所有指示，不仅是“硬”/诸如Div/div/ SQRT）。

到目前为止，只有英特尔在任何CPU上都支持AVX-512，除了DIV/SQRT以外，他们都拥有全宽的执行单元。不幸的是，他们没有提出一种方法来揭露在没有完整的AVX-512的CPU上，在CPU上以128和256位的矢量揭露了强大的新功能，并获得了更好的散装。 AVX-512中有一些非常好的东西与更广泛的向量完全分开。

SIMD DIV / SQRT单元通常比其他

分裂较窄，而平方根本质上很慢，无法真正使低延迟。管道也很昂贵；当前的CPU每个时钟周期都无法启动新操作。但是最近的CPU一直在这样做，至少是在一部分操作中：我认为它们通常以牛顿 - 拉夫森（Newton-Raphson）精炼的几个步骤结束，并且该部分可以被管道插入，因为它仅涉及多重/添加/FMA类型的操作。

自桑德布里奇（Sandybridge）以来，英特尔（Intel）一直支持AVX，但直到Skylake才将FP Div/SQRT单元扩大到256位。

例如，Haswell运行vsqrtpd ymm AS 3 UOPS，端口0（div/sqrt单元为）和一个用于任何端口的位置，大概是重新组合结果。延迟仅约2倍，吞吐量为一半。（读取结果需要等待两个半部分准备就绪。）

Agner Fog可能已经用vsqrtpd ymm读取自己的结果； IDK如果英特尔可以让一半的操作开始在另一半就准备就绪之前开始，如果合并的UOP（或任何内容）最终会强迫它等待两个半部分，然后才能开始sqrt。除了DIV/SQRT以外的指令具有全宽的执行单元，并且总是需要等待两个部分。

我还收集了浮点部门与浮点乘法

For sqrtpd xmm, yes, modern CPUs do that truly in parallel, not running it through a narrower execution unit one at a time. Older (especially low-power) CPUs did do that. For AVX vsqrtpd ymm, some CPUs do perform it in two halves.

But if you're just comparing performance numbers against narrower operations, note that some CPUs like Skylake can use different halves of their wide div/sqrt unit for separate sqrtpd/sd xmm, so those have twice the throughput of YMM, even though it can do a full vsqrtpd ymm in parallel.

Same for AVX-512 vsqrtpd zmm, even Ice Lake splits it up into two halves, as we can see from it being 3 uops (2 for port 0 where Intel puts the div/sqrt unit, and that can run on other ports.)

Being 3 uops is the key tell-tale for a sqrt instruction being wider than the execution unit on Intel, but you can look at the throughput of YMM vs. XMM vs. scalar XMM to see how it's able to feed narrower operations do different pipes of a wide execution unit independently.

The only difference is performance; the destination x/y/zmm register definitely has the square roots of each input element. Check performance numbers (and uop counts) on https://uops.info/ (currently down but normally very good), and/or https://agner.org/optimize/.

It's allowed but not guaranteed that CPUs internally have wide execution units, as wide as the widest vectors they support, and thus truly compute all results in parallel pipes.

Full-width execution units are common for instructions other than divide and square root, although AMD from Bulldozer through before Zen1 supported AVX/AVX2 with only 128-bit execution units, so vaddps ymm decoded to 2 uops, doing each half separately. Intel Alder Lake E-cores work the same way.

Some ancient and/or low-power CPUs (like Pentium-M and K8, and Bobcat) have had only 64-bit wide execution units, running SSE instructions in two halves (for all instructions, not just "hard" ones like div/sqrt).

So far only Intel has supported AVX-512 on any CPUs, and (other than div/sqrt) they've all had full-width execution units. And unfortunately they haven't come up with a way to expose the powerful new capabilities like masking and better shuffles for 128 and 256-bit vectors on CPUs without the full AVX-512. There's some really nice stuff in AVX-512 totally separate from wider vectors.

The SIMD div / sqrt unit is often narrower than others

Divide and square root are inherently slow, not really possible to make low latency. It's also expensive to pipeline; no current CPUs can start a new operation every clock cycle. But recent CPUs have been doing that, at least for part of the operation: I think they normally end with a couple steps of Newton-Raphson refinement, and that part can be pipelined as it only involves multiply/add/FMA type of operations.

Intel has supported AVX since Sandybridge, but it wasn't until Skylake that they widened the FP div/sqrt unit to 256-bit.

For example, Haswell runs vsqrtpd ymm as 3 uops, 2 for port 0 (where the div/sqrt unit is) and one for any port, presumably to recombine the results. The latency is just about a factor of 2 longer, and throughput is half. (A uop reading the result needs to wait for both halves to be ready.)

Agner Fog may have tested latency with vsqrtpd ymm reading its own result; IDK if Intel can let one half of the operation start before the other half is ready, of if the merging uop (or whatever it is) would end up forcing it to wait for both halves to be ready before starting either half of another div or sqrt. Instructions other than div/sqrt have full-width execution units and would always need to wait for both halves.

I also collected divps / pd / sd / ss throughputs and latencies for YMM and XMM on various CPUs in a table on Floating point division vs floating point multiplication

回复收藏 0 原文

天涯沦落人 2025-02-11 10:02:49

为了完成@petercordes的好答案，这确实是依赖建筑的。人们可以期望在最近的主流处理器上并行计算两个平方根（或可能在ALU级别上有效管道）计算。这是Intel体系结构的延迟和吞吐量（您可以从

，	sse4_2	“	吞吐量单	吞噬XMM
Skylake	18	18	6	6
骑士着陆	40	38	33	10
Broadwell	20	20	7	13
Haswell	20	20	13 13	13
Ivy Bridge	21	14	14	14

吞吐量（每次指令的循环数）通常是SIMD代码中的重要性，例如长期以来，越常的执行官可以重叠独立迭代的延迟链。如您所见，在Skylake，Haswell和Ivy Bridge上，吞吐量与sqrtsd和SQRTPD XMM同样快的含义相同。 pd版本的完成工作是两倍，因此必须并行计算两个元素。请注意，咖啡湖，Cannon Lake和Ice Lake的时机与Skylake具有相同的时间，以进行此特定指示。

对于Broadwell，sqrtpd不会在两个车道上并行执行操作。取而代之的是，它管道操作，并且大多数计算都是序列化的（sqrtpd的1个周期小于两个sqrtsd）。或者它具有平行的2x 64位div/sqrt单元，但可以独立使用它的半级SQRT，这可以解释潜伏期相同，但对于标量指令，吞吐量更好（例如，Skylake对于SQRT ymm vs vs s skylake是如何的。

对于KNL Xeon Phi，结果有些令人惊讶，因为sqrtpd XMM比sqrtsd快得多，同时并行计算更多项目。 Agner Fog的测试证实了这一点，并且需要更多的UOPS。很难想象为什么；只需将标量结果合并到XMM寄存器的底部，与将XMM合并到ZMM的底部没有太大不同，XMM与完整的vsqrtpd Zmm的速度相同。（它针对具有512位寄存器的AVX-512进行了优化，但在Div/SQRT上也很慢；您旨在在Xeon Phi CPU上使用vrsqrt28pd一个牛顿迭代靠近double精度。 -512er extension ）

：//en.wikipedia.org/wiki/avx-512#exponential_and_reciprocal“ rel =” nofollow noreferrer“> avx 它是可变的。（0.0是最好的情况之一）。延迟与Agner Fog的指令表报告的延迟有所不同。总体分析仍然相同。

To complete the great answer of @PeterCordes, this is indeed dependent of architecture. One can expect the two square roots to be computed in parallel (or possibly efficiently pipelined at the ALU level) on most recent mainstream processors though. Here is the latency and throughput for intel architectures (you can get it from Intel):

Architecture	Latency single	Latency packed XMM	Throughput single	Throughput packed XMM
Skylake	18	18	6	6
Knights Landing	40	38	33	10
Broadwell	20	20	7	13
Haswell	20	20	13	13
Ivy Bridge	21	21	14	14

The throughput (number of cycle per instruction) is generally what matter in SIMD codes, as long as out-of-order exec can overlap the latency chains for independent iterations. As you can see, on Skylake, Haswell and Ivy Bridge, the throughput is the same meaning that sqrtsd and sqrtpd xmm are equally fast. The pd version gets twice as much work done, so it must be computing two elements in parallel. Note that Coffee Lake, Cannon Lake and Ice Lake have the same timings as Skylake for this specific instruction.

For Broadwell, sqrtpd does not execute the operation in parallel on the two lanes. Instead, it pipelines the operation and most of the computation is serialized (sqrtpd takes 1 cycle less than two sqrtsd). Or it has a parallel 2x 64-bit div/sqrt unit, but can independently use halves of it for scalar sqrt, which would explain the latency being the same but the throughput being better for scalar instructions (like how Skylake is for sqrt ymm vs. xmm).

For KNL Xeon Phi, the results are a bit surprising as sqrtpd xmm is much faster than sqrtsd while computing more items in parallel. Agner Fog's testing confirmed that, and that it takes many more uops. It's hard to imagine why; just merging the scalar result into the bottom of an XMM register shouldn't be much different from merging an XMM into the bottom of a ZMM, which is the same speed as a full vsqrtpd zmm. (It's optimized for AVX-512 with 512-bit registers, but it's also slow at div/sqrt in general; you're intended to use vrsqrt28pd on Xeon Phi CPUs, to get an approximation that only needs one Newton iteration to get close to double precision. Other AVX-512 CPUs only support vrsqrt14pd/ps, lacking the AVX-512ER extension)

PS: It turns out that Intel reports the maximum throughput cost (worst case) when it is variable. (0.0 is one of the best cases, for example). The latency is a bit different from the one reported from Agner Fog's instruction table. The overall analysis remains the same though.

回复收藏 0 原文