当前位置：文江博客话题详情

为什么SIMD在称为SIMD时具有单个数据指令？

发布于 2025-02-02 14:04:23 字数 212 浏览 3 评论 0 原文

我一直在想..它被称为SIMD，如单个指令中的多重数据。那么，为什么它具有单个数据指令？

例如， vaddss 是多个数据 vaddps 的单个数据。几乎每个SIMD指令都有一个数据版本。

为什么？

当称为SIMD时，为什么SIMD有单个数据指令？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

感情旳空白 2025-02-09 14:04:23

不是在这种意义上

vaddsss 是标量FP数学指令，可在FP/SIMD寄存器中的数据（XMM0..15）中运行。之所以存在，是因为X87不是非常方便的编译器目标，其基于堆栈的寄存器通常需要 fxch 和其他怪癖。英特尔添加了一种与SSE1（float）和SSE2（double）一起进行标量FP数学的新方法，这是X86-64的基线，因此每个人都可以使用它。

称为SIMD指令的人正在谈论其中之一：

它运行的是哪个注册。（XMM0是16个字节宽，显然是SIMD寄存器，即使您只关心具有标量值的低元素。）
这是AVX指令的事实，因此它是用ISA扩展程序引入的，它主要针对SIMD使用。，因此称为SIMD扩展名或指令集。
这也意味着它使用MXCSR进行四舍五入模式和FP异常记录 /删除，并且它可以使用的异常类型与其他SSE / AVX指令相同它来自传统X87。
或者，当高元素具有实际数据时，他们正在谈论对低元素做某事的用例。标量double更有可能

（非常罕见，但是您可以做的事情。也许使用 SD 实际上，它的意思是 flynn的分类法我高度怀疑任何人实际上都意味着这一点。 SS 和 SD 标量FP数学指令是SISD，单个指令单数据。顺便说一句，它们仅用于 fp 数学； X86已经具有添加EAX，ECX 的指令，用于标量整数数学，并且没有 PADDB 的标量版本，甚至没有 xorps 。

拥有单独的标量fp数学指令的一个原因是，使用 addps 也将在XMM寄存器的高元素中可能操作任何垃圾。将陷入os。）

这可以提高额外的fp异常（通常被掩盖，因此仅记录在mxcsr（ fenv.h ）中，但是如果未掩盖的话， 0.0 （呼叫约定，btw不需要）， addps 不会提出任何额外的例外，但是 divps 会通过零。

对于像小整数一样的非零垃圾，对于亚正常浮点而言，它可能有点图案，或者结果可能是亚正常的，导致巨大的放缓（〜100倍），因为CPU需要一个微码来获得手柄下正常输入或hander subsormal normalal输入或在许多情况下（或SSE1在奔腾III中是新的，可能是所有亚非正式案例）的输出。除非您将FTZ和DAZ设置为gcc -ffast-Math dive否则将ftz和daz（齐射为零，否定型）。

对于 XORPS 或 PADDQ 之类的说明，该指令不做实际的FP数学，没有FP异常或微型助攻。即使您只关心低32或64位的XMM，也可以使用它们。

MMX或SSE2偶尔在32位代码中用于执行标量64位整数数学，上部字节中带有零或垃圾。 MMX PADDQ MM0，MM1 是SISD指令，但是SSE2 PADDQ XMM0，XMM1 是SIMD指令。

SSE1在奔腾3中是新的，SIMD执行单元和登记册只有64位。 addps 解码为2个UOPS； addss 解码为1。因此，即使在最好的情况下，也存在性能动机。

这也可能是英特尔不幸设计的原因，在 sqrtss 和 cvtsi2ss 以及其他人合并到目的地，需要在XOR-Zero上花费额外的前端带宽，或false依赖性：使此函数。。这是一个短视的设计决定，是使其在奔腾3上进行单一UOP，不幸的是，他们在SSE2中遵循 double 精确度，并在AVX和AVX-512上坚持使用，当他们有机会介绍时具有不同语义的更好版本。至少AVX版本使用第二个源寄存器与之合并，因此您可以选择“冷” reg作为解决方法，请参阅我在链接的副本上的答案。

标量FP与SIMD共享寄存器是正常的，

对于标量FP还有另一套寄存器并不是必要或有用的，并且与X87 FPU或通用用途整数寄存器共享每个人都会出于单独的原因而更糟。

SIMD寄存器重叠或与标量FP寄存器相同的其他ISA是完全正常的；一些没有像X87这样的怪异设计的ISA（例如ARM）不需要新的建筑状态来介绍Simd。例如 norefrowl noreferrer“> arm anm” 。

（不过，我不确定部分注册的混叠是否实际上在其他ISA的SIMD扩展中很常见。可能有些人可能引入了新的建筑状态，或者只是使用FP Double-Eccision寄存器作为64位整数SIMD而不是128位。）

在OS内核中，您经常谈论在上下文开关上保存“ FPU状态”（而不是仅通用整数寄存器），如今，FPU和SIMD State短暂地进行了。例如，在Linux内核中，您需要使用 kernel_fpu_begin（）在运行使用XMM/YMM/ZMM寄存器的指令之前。（例如在RAID5 / RAID6驱动程序中）。

It isn't a SIMD instruction in that sense

vaddss is a scalar FP math instruction that operates on data in the FP/SIMD registers (XMM0..15). It exists because x87 is not a very convenient compiler target with its stack-based registers that often need fxch, and other quirks. Intel added a new way to do scalar FP math along with SSE1 (float) and SSE2 (double), which is fortunately baseline for x86-64 so everyone can just use it.

People who call that a SIMD instruction are talking about one of:

Which registers it operates on. (XMM0 is 16 bytes wide and clearly a SIMD register, even when you only care about the low element holding a scalar value.)
The fact that it's an AVX instruction, so it was introduced with an ISA extension that was primarily aimed at SIMD usage, and thus is called a SIMD extension or instruction set.
Which also means it uses the MXCSR for rounding mode and FP exception recording / unmasking, and the kinds of exceptions it can take are the same as other SSE/AVX instructions which Intel documents as "SIMD Floating-Point Exceptions" as concise terminology to distinguish it from legacy x87.
Or they're talking about the use-case of doing something to just the low element when the high elements have actual data. (Quite rare, but something you could do. Maybe more likely with sd scalar double, where the low double is one half of an XMM register.)

Or they're just plain wrong if they actually mean it in terms of Flynn's taxonomy of SISD vs. SIMD vs. MIMD etc. I highly doubt anyone would actually mean that, though. The ss and sd scalar FP math instructions are SISD, single-instruction single-data. And BTW, they only exist for FP math; x86 already has instructions like add eax, ecx for scalar integer math, and doesn't have scalar versions of paddb or even xorps.

One reason for having separate scalar FP math instructions is that using addps would also operate on whatever garbage might be in the high elements of XMM registers. This can raise extra FP exceptions (usually masked, so only recorded in MXCSR (fenv.h), but if unmasked would trap to the OS.)

With the upper elements all 0.0 (which isn't required by the calling convention, BTW), addps wouldn't raise any extra exceptions, but divps would divide by zero.

With non-zero garbage like small integers, it might be a bit-pattern for a subnormal float, or a result might be subnormal, causing huge slowdowns (factor of ~100) as the CPU takes a microcode assist to get handle subnormal input or output in many cases (or when SSE1 was new in Pentium III, probably all cases of subnormals). Unless you set FTZ and DAZ (flush to zero, denormal are zero) like gcc -ffast-math does.

For instructions like xorps or paddq which don't do actual FP math, no FP exceptions or microcode assists are possible. You can just use them even if you only care about the low 32 or 64 bits of an XMM.

MMX or SSE2 had occasional uses in 32-bit code for doing scalar 64-bit integer math, with zeros or garbage in the upper bytes. MMX paddq mm0, mm1 is a SISD instruction, but SSE2 paddq xmm0, xmm1 is a SIMD instruction.

SSE1 was new in Pentium 3, where the SIMD execution units and registers were only 64 bits wide. addps decoded to 2 uops; addss decoded to 1. So there was a performance motivation, too, even in the best case.

This is also likely the reason for Intel's unfortunate design where sqrtss and cvtsi2ss and others merge into the destination, requiring either spending extra front-end bandwidth on xor-zeroing, or risking false dependencies: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? . It's a short-sighted design decision to make them single-uop on Pentium 3, which they unfortunately followed in SSE2 for double precision, and stuck to for AVX and AVX-512 when they had a chance to introduce better versions with different semantics. At least the AVX versions take a 2nd source register to merge with, so you can pick a "cold" reg as a workaround, see my answer on the linked duplicate.

It's normal for scalar FP to share registers with SIMD

It isn't necessary or useful to have yet another set of registers for scalar FP, and sharing with the x87 FPU or the general-purpose integer registers would each be worse for separate reasons.

It's totally normal on other ISAs for the SIMD registers to overlap or be the same as the scalar FP registers; Some ISAs (like ARM) that didn't have weirdo designs like x87 didn't need new architectural state to introduce SIMD. e.g. ARM's NEON q0..q15 16-byte registers map to pairs of d0..d31 double-precision FP registers that existed with VFPv3.

(I'm not sure if the partial-register aliasing was actually common in SIMD extensions for other ISAs, though. Probably some introduced new architectural state, or just used FP double-precision registers as 64-bit integer SIMD instead of 128-bit.)

In an OS kernel you often talk about saving "FPU state" on context switch (as opposed to just the general-purpose integer registers), and these days that's short-hand for FPU and SIMD state. e.g. in the Linux kernel, you need to use kernel_fpu_begin() before running instructions that use XMM/YMM/ZMM registers. (e.g. in the RAID5 / RAID6 drivers).

回复收藏 0 原文

~没有更多了~

关于作者

野の

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

为什么SIMD在称为SIMD时具有单个数据指令？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

不是在这种意义上

标量FP与SIMD共享寄存器是正常的，

It isn't a SIMD instruction in that sense

It's normal for scalar FP to share registers with SIMD

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

为什么SIMD在称为SIMD时具有单个数据指令？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

不是在这种意义上

标量FP与SIMD共享寄存器是正常的，

It isn't a SIMD instruction in that sense

It's normal for scalar FP to share registers with SIMD

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。