为什么SIMD在称为SIMD时具有单个数据指令?
我一直在想..它被称为SIMD,如单个指令中的多重数据。那么,为什么它具有单个数据指令?
例如, vaddss
是多个数据 vaddps
的单个数据。几乎每个SIMD指令都有一个数据版本。
为什么?
当称为SIMD时,为什么SIMD有单个数据指令?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不是在这种意义上
vaddsss
是标量FP数学指令,可在FP/SIMD寄存器中的数据(XMM0..15)中运行。之所以存在,是因为X87不是非常方便的编译器目标,其基于堆栈的寄存器通常需要fxch
和其他怪癖。英特尔添加了一种与SSE1(float)和SSE2(double)一起进行标量FP数学的新方法,这是X86-64的基线,因此每个人都可以使用它。称为SIMD指令的人正在谈论其中之一:
(非常罕见,但是您可以做的事情。也许使用
SD
实际上,它的意思是 flynn的分类法 我高度怀疑任何人实际上都意味着这一点。SS
和SD
标量FP数学指令是SISD,单个指令单数据。顺便说一句,它们仅用于 fp 数学; X86已经具有添加EAX,ECX
的指令,用于标量整数数学,并且没有PADDB
的标量版本,甚至没有xorps
。拥有单独的标量fp数学指令的一个原因是,使用
addps
也将在XMM寄存器的高元素中可能操作任何垃圾。 将陷入os。)这可以提高额外的fp异常(通常被掩盖,因此仅记录在mxcsr(
fenv.h
)中,但是如果未掩盖的话,0.0
(呼叫约定,btw不需要),addps
不会提出任何额外的例外,但是divps
会通过零。对于像小整数一样的非零垃圾,对于亚正常浮点而言,它可能有点图案,或者结果可能是亚正常的,导致巨大的放缓(〜100倍),因为CPU需要一个微码来获得手柄下正常输入或hander subsormal normalal输入或在许多情况下(或SSE1在奔腾III中是新的,可能是所有亚非正式案例)的输出。除非您将FTZ和DAZ设置为gcc
-ffast-Math
dive否则将ftz和daz(齐射为零,否定型)。对于
XORPS
或PADDQ
之类的说明,该指令不做实际的FP数学,没有FP异常或微型助攻。即使您只关心低32或64位的XMM,也可以使用它们。MMX或SSE2偶尔在32位代码中用于执行标量64位整数数学,上部字节中带有零或垃圾。 MMX
PADDQ MM0,MM1
是SISD指令,但是SSE2PADDQ XMM0,XMM1
是SIMD指令。SSE1在奔腾3中是新的,SIMD执行单元和登记册只有64位。
addps
解码为2个UOPS;addss
解码为1。因此,即使在最好的情况下,也存在性能动机。这也可能是英特尔不幸设计的原因,在
sqrtss
和cvtsi2ss
以及其他人合并到目的地,需要在XOR-Zero上花费额外的前端带宽,或false依赖性:使此函数。。这是一个短视的设计决定,是使其在奔腾3上进行单一UOP,不幸的是,他们在SSE2中遵循double
精确度,并在AVX和AVX-512上坚持使用,当他们有机会介绍时具有不同语义的更好版本。至少AVX版本使用第二个源寄存器与之合并,因此您可以选择“冷” reg作为解决方法,请参阅我在链接的副本上的答案。标量FP与SIMD共享寄存器是正常的,
对于标量FP还有另一套寄存器并不是必要或有用的,并且与X87 FPU或通用用途整数寄存器共享每个人都会出于单独的原因而更糟。
SIMD寄存器重叠或与标量FP寄存器相同的其他ISA是完全正常的;一些没有像X87这样的怪异设计的ISA(例如ARM)不需要新的建筑状态来介绍Simd。例如 norefrowl noreferrer“> arm anm” 。
(不过,我不确定部分注册的混叠是否实际上在其他ISA的SIMD扩展中很常见。可能有些人可能引入了新的建筑状态,或者只是使用FP Double-Eccision寄存器作为64位整数SIMD而不是128位。)
在OS内核中,您经常谈论在上下文开关上保存“ FPU状态”(而不是仅通用整数寄存器),如今,FPU和SIMD State短暂地进行了。例如,在Linux内核中,您需要使用
kernel_fpu_begin()
在运行使用XMM/YMM/ZMM寄存器的指令之前。 (例如在RAID5 / RAID6驱动程序中)。It isn't a SIMD instruction in that sense
vaddss
is a scalar FP math instruction that operates on data in the FP/SIMD registers (XMM0..15). It exists because x87 is not a very convenient compiler target with its stack-based registers that often needfxch
, and other quirks. Intel added a new way to do scalar FP math along with SSE1 (float) and SSE2 (double), which is fortunately baseline for x86-64 so everyone can just use it.People who call that a SIMD instruction are talking about one of:
sd
scalar double, where the low double is one half of an XMM register.)Or they're just plain wrong if they actually mean it in terms of Flynn's taxonomy of SISD vs. SIMD vs. MIMD etc. I highly doubt anyone would actually mean that, though. The
ss
andsd
scalar FP math instructions are SISD, single-instruction single-data. And BTW, they only exist for FP math; x86 already has instructions likeadd eax, ecx
for scalar integer math, and doesn't have scalar versions ofpaddb
or evenxorps
.One reason for having separate scalar FP math instructions is that using
addps
would also operate on whatever garbage might be in the high elements of XMM registers. This can raise extra FP exceptions (usually masked, so only recorded in MXCSR (fenv.h
), but if unmasked would trap to the OS.)With the upper elements all
0.0
(which isn't required by the calling convention, BTW),addps
wouldn't raise any extra exceptions, butdivps
would divide by zero.With non-zero garbage like small integers, it might be a bit-pattern for a subnormal float, or a result might be subnormal, causing huge slowdowns (factor of ~100) as the CPU takes a microcode assist to get handle subnormal input or output in many cases (or when SSE1 was new in Pentium III, probably all cases of subnormals). Unless you set FTZ and DAZ (flush to zero, denormal are zero) like gcc
-ffast-math
does.For instructions like
xorps
orpaddq
which don't do actual FP math, no FP exceptions or microcode assists are possible. You can just use them even if you only care about the low 32 or 64 bits of an XMM.MMX or SSE2 had occasional uses in 32-bit code for doing scalar 64-bit integer math, with zeros or garbage in the upper bytes. MMX
paddq mm0, mm1
is a SISD instruction, but SSE2paddq xmm0, xmm1
is a SIMD instruction.SSE1 was new in Pentium 3, where the SIMD execution units and registers were only 64 bits wide.
addps
decoded to 2 uops;addss
decoded to 1. So there was a performance motivation, too, even in the best case.This is also likely the reason for Intel's unfortunate design where
sqrtss
andcvtsi2ss
and others merge into the destination, requiring either spending extra front-end bandwidth on xor-zeroing, or risking false dependencies: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? . It's a short-sighted design decision to make them single-uop on Pentium 3, which they unfortunately followed in SSE2 fordouble
precision, and stuck to for AVX and AVX-512 when they had a chance to introduce better versions with different semantics. At least the AVX versions take a 2nd source register to merge with, so you can pick a "cold" reg as a workaround, see my answer on the linked duplicate.It's normal for scalar FP to share registers with SIMD
It isn't necessary or useful to have yet another set of registers for scalar FP, and sharing with the x87 FPU or the general-purpose integer registers would each be worse for separate reasons.
It's totally normal on other ISAs for the SIMD registers to overlap or be the same as the scalar FP registers; Some ISAs (like ARM) that didn't have weirdo designs like x87 didn't need new architectural state to introduce SIMD. e.g. ARM's NEON
q0..q15
16-byte registers map to pairs ofd0..d31
double-precision FP registers that existed with VFPv3.(I'm not sure if the partial-register aliasing was actually common in SIMD extensions for other ISAs, though. Probably some introduced new architectural state, or just used FP double-precision registers as 64-bit integer SIMD instead of 128-bit.)
In an OS kernel you often talk about saving "FPU state" on context switch (as opposed to just the general-purpose integer registers), and these days that's short-hand for FPU and SIMD state. e.g. in the Linux kernel, you need to use
kernel_fpu_begin()
before running instructions that use XMM/YMM/ZMM registers. (e.g. in the RAID5 / RAID6 drivers).