超级缩放和流水线有什么区别?
嗯,这个问题看起来太简单了,但我在看了几个 ppts 后才问。
两种方法都增加了指令吞吐量。超级扩展几乎也总是利用管道。超级缩放有多个执行单元,管道也是如此,还是我错了?
Well looks too simple a question to be asked but i asked after going through few ppts on both.
Both methods increase instruction throughput. And Superscaling almost always makes use of pipelining as well. Superscaling has more than one execution unit and so does pipelining or am I wrong here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
超标量设计涉及处理器能够在单个时钟内发出多个指令,并具有执行指令的冗余设施。请注意,我们谈论的是单核内的情况——多核处理是不同的。
流水线将一条指令划分为多个步骤,并且由于每个步骤在处理器的不同部分中执行,因此多个指令可以在每个时钟处于不同的“阶段”。
它们几乎总是一起使用。来自维基百科的这张图片显示了正在使用的两个概念,因为这些概念最好以图形方式解释:
这里,在五级流水线中一次执行两条指令。
为了进一步细分,考虑到您最近的编辑:
在上面的示例中,一条指令要经过 5 个阶段才能“执行”。它们是 IF(指令获取)、ID(指令解码)、EX(执行)、MEM(更新内存)、WB(写回高速缓存)。
在一个非常简单的处理器设计中,每个时钟都会完成不同的阶段,因此我们有:
这将在五个时钟内执行一条指令。如果我们添加一个冗余执行单元并引入超标量设计,对于两条指令 A 和 B,我们就会得到这样的结果:
五个时钟内的两条指令——理论上最大增益为 100%。
流水线允许同时执行各个部分,因此我们最终会得到类似的结果(对于十条指令 A 到 J):
在 9 个时钟周期内,我们执行了 10 条指令——您可以看到流水线真正推动事物前进的地方。这是对示例图形的解释,而不是它在现场的实际实现方式(这是黑魔法)。
维基百科关于 Superscalar 和 指令管道 非常好。
Superscalar design involves the processor being able to issue multiple instructions in a single clock, with redundant facilities to execute an instruction. We're talking about within a single core, mind you -- multicore processing is different.
Pipelining divides an instruction into steps, and since each step is executed in a different part of the processor, multiple instructions can be in different "phases" each clock.
They're almost always used together. This image from Wikipedia shows both concepts in use, as these concepts are best explained graphically:
Here, two instructions are being executed at a time in a five-stage pipeline.
To break it down further, given your recent edit:
In the example above, an instruction goes through 5 stages to be "performed". These are IF (instruction fetch), ID (instruction decode), EX (execute), MEM (update memory), WB (writeback to cache).
In a very simple processor design, every clock a different stage would be completed so we'd have:
Which would do one instruction in five clocks. If we then add a redundant execution unit and introduce superscalar design, we'd have this, for two instructions A and B:
Two instructions in five clocks -- a theoretical maximum gain of 100%.
Pipelining allows the parts to be executed simultaneously, so we would end up with something like (for ten instructions A through J):
In nine clocks, we've executed ten instructions -- you can see where pipelining really moves things along. And that is an explanation of the example graphic, not how it's actually implemented in the field (that's black magic).
The Wikipedia articles for Superscalar and Instruction pipeline are pretty good.
很久以前,CPU一次只执行一条机器指令。只有当它完全完成时,CPU 才会从内存(或者后来的指令缓存)中获取下一条指令。
最终,有人注意到这意味着大多数CPU在大部分时间里什么都不做,因为有多个执行子单元(例如指令译码器、整数运算单元和FP运算单元等)并且不断执行一条指令一次只有一个人忙碌。
因此,“简单”流水线诞生了:一旦一条指令完成解码并继续执行下一个执行子单元,为什么不先获取并解码下一条指令呢?如果您有 10 个这样的“阶段”,那么通过让每个阶段处理不同的指令,理论上您可以将指令吞吐量提高十倍,而根本不需要增加 CPU 时钟!当然,只有当代码中没有条件跳转时,这才可以完美地工作(这导致需要花费大量额外的精力来专门处理条件跳转)。
后来,随着摩尔定律持续正确的时间超出预期,CPU 制造商发现自己需要使用越来越多的晶体管,并思考“为什么每个执行子单元只有一个?”。因此,具有多个执行子单元能够并行执行相同事情的超标量 CPU 诞生了,CPU 设计变得更加复杂将指令分布在这些完全并行的单元上,同时确保结果与指令按顺序执行时相同。
A long time ago, CPUs executed only one machine instruction at a time. Only when it was completely finished did the CPU fetch the next instruction from memory (or, later, the instruction cache).
Eventually, someone noticed that this meant that most of a CPU did nothing most of the time, since there were several execution subunits (such as the instruction decoder, the integer arithmetic unit, and FP arithmetic unit, etc.) and executing an instruction kept only one of them busy at a time.
Thus, "simple" pipelining was born: once one instruction was done decoding and went on towards the next execution subunit, why not already fetch and decode the next instruction? If you had 10 such "stages", then by having each stage process a different instruction you could theoretically increase the instruction throughput tenfold without increasing the CPU clock at all! Of course, this only works flawlessly when there are no conditional jumps in the code (this led to a lot of extra effort to handle conditional jumps specially).
Later, with Moore's law continuing to be correct for longer than expected, CPU makers found themselves with ever more transistors to make use of and thought "why have only one of each execution subunit?". Thus, superscalar CPUs with multiple execution subunits able to do the same thing in parallel were born, and CPU designs became much, much more complex to distribute instructions across these fully parallel units while ensuring the results were the same as if the instructions had been executed sequentially.
类比:洗衣服
想象一家干洗店拥有以下设施:一个用于悬挂脏衣服或干净衣服的架子、一台洗衣机和一台烘干机(每台设备一次可以洗一件衣服)、一个折叠桌和熨衣板。
负责所有实际洗涤和烘干工作的服务员相当愚蠢,因此接受干洗订单的店主特别小心,非常仔细且明确地写下了每条说明。
在典型的一天中,这些说明可能是这样的:
服务员按照这些说明穿 T 恤,非常小心,不要做任何事情出故障了。可以想象,完成一天的衣服需要很长时间,因为充分洗涤、干燥和折叠每件衣服都需要很长时间,而且必须一次一件完成。
然而,有一天,服务员辞职了,并雇用了一位新的、更聪明的服务员,他注意到大多数设备在一天中的任何给定时间都处于闲置状态。烘干裤子时,熨衣板和洗衣机都没有使用。因此,他决定更好地利用自己的时间。因此,他会这样做,而不是执行上述一系列步骤:
这就是流水线。对不相关的活动进行排序,以便它们同时使用不同的组件。通过同时保持尽可能多的不同组件处于活动状态,您可以最大限度地提高效率并加快执行时间,在这种情况下,将 16 个“周期”减少到 9 个,加速超过 40%。
现在,这家小干洗店开始赚更多的钱,因为他们可以更快地工作,所以店主额外购买了洗衣机、烘干机、熨衣板、折叠台,甚至还雇用了另一个服务员。现在事情变得更快了,而不是上面的,你必须:
这是超标量设计。 多个子组件能够同时执行相同的任务,但具有处理器决定如何做。在这种情况下,速度提升了近 50%(在 18 个“周期”内,新架构可以运行该“程序”的 3 次迭代,而之前的架构只能运行 2 次)。
较旧的处理器(例如 386 或 486)是简单的标量处理器,它们一次完全按照接收指令的顺序执行一条指令。自 PowerPC/Pentium 以来的现代消费处理器都是流水线和超标量的。 Core2 CPU 能够运行为 486 编译的相同代码,同时仍然利用指令级并行性,因为它包含自己的内部逻辑,可以分析机器代码并确定如何重新排序和运行它(什么可以并行运行) ,什么不能,等等)这是超标量设计的本质以及它如此实用的原因。
相比之下,向量并行处理器一次对多个数据(向量)执行操作。因此,向量处理器不是仅仅添加 x 和 ya,而是将 x0,x1,x2 添加到 y0,y1,y2(结果为 z0,z1,z2)。这种设计的问题在于它与处理器的特定并行度紧密耦合。如果您在向量处理器上运行标量代码(假设可以),您将看不到向量并行化的优势,因为它需要显式使用,类似地,如果您想利用具有更多并行处理单元的较新向量处理器(例如能够添加 12 个数字的向量,而不是仅 3 个),您需要重新编译代码。矢量处理器设计在最古老的一代超级计算机中很流行,因为它们易于设计,并且科学和工程中存在大量具有大量自然并行性的问题。
超标量处理器还具有执行推测执行的能力。处理器不是让处理单元空闲并在分支之前等待代码路径完成执行,而是可以做出最佳猜测并在先前代码完成处理之前开始执行经过分支的代码。当先前代码的执行赶上分支点时,处理器可以将实际分支与分支猜测进行比较,如果猜测正确则继续(已经远远领先于仅等待的位置),或者可以使推测执行的结果无效并运行正确分支的代码。
An Analogy: Washing Clothes
Imagine a dry cleaning store with the following facilities: a rack for hanging dirty or clean clothes, a washer and a dryer (each of which can wash one garment at a time), a folding table, and an ironing board.
The attendant who does all of the actual washing and drying is rather dim-witted so the store owner, who takes the dry cleaning orders, takes special care to write out each instruction very carefully and explicitly.
On a typical day these instructions may be something along the lines of:
The attendant follows these instructions to the tee, being very careful not to ever do anything out of order. As you can imagine, it takes a long time to get the day's laundry done because it takes a long time to fully wash, dry, and fold each piece of laundry, and it must all be done one at a time.
However, one day the attendant quits and a new, smarter, attendant is hired who notices that most of the equipment is laying idle at any given time during the day. While the pants were drying neither the ironing board nor the washer were in use. So he decided to make better use of his time. Thus, instead of the above series of steps, he would do this:
This is pipelining. Sequencing unrelated activities such that they use different components at the same time. By keeping as much of the different components active at once you maximize efficiency and speed up execution time, in this case reducing 16 "cycles" to 9, a speedup of over 40%.
Now, the little dry cleaning shop started to make more money because they could work so much faster, so the owner bought an extra washer, dryer, ironing board, folding station, and even hired another attendant. Now things are even faster, instead of the above, you have:
This is superscalar design. Multiple sub-components capable of doing the same task simultaneously, but with the processor deciding how to do it. In this case it resulted in a nearly 50% speed boost (in 18 "cycles" the new architecture could run through 3 iterations of this "program" while the previous architecture could only run through 2).
Older processors, such as the 386 or 486, are simple scalar processors, they execute one instruction at a time in exactly the order in which it was received. Modern consumer processors since the PowerPC/Pentium are pipelined and superscalar. A Core2 CPU is capable of running the same code that was compiled for a 486 while still taking advantage of instruction level parallelism because it contains its own internal logic that analyzes machine code and determines how to reorder and run it (what can be run in parallel, what can't, etc.) This is the essence of superscalar design and why it's so practical.
In contrast a vector parallel processor performs operations on several pieces of data at once (a vector). Thus, instead of just adding x and y a vector processor would add, say, x0,x1,x2 to y0,y1,y2 (resulting in z0,z1,z2). The problem with this design is that it is tightly coupled to the specific degree of parallelism of the processor. If you run scalar code on a vector processor (assuming you could) you would see no advantage of the vector parallelization because it needs to be explicitly used, similarly if you wanted to take advantage of a newer vector processor with more parallel processing units (e.g. capable of adding vectors of 12 numbers instead of just 3) you would need to recompile your code. Vector processor designs were popular in the oldest generation of super computers because they were easy to design and there are large classes of problems in science and engineering with a great deal of natural parallelism.
Superscalar processors can also have the ability to perform speculative execution. Rather than leaving processing units idle and waiting for a code path to finish executing before branching a processor can make a best guess and start executing code past the branch before prior code has finished processing. When execution of the prior code catches up to the branch point the processor can then compare the actual branch with the branch guess and either continue on if the guess was correct (already well ahead of where it would have been by just waiting) or it can invalidate the results of the speculative execution and run the code for the correct branch.
流水线是汽车公司在汽车制造过程中所做的事情。他们将汽车组装的过程分解为多个阶段,并在由不同人完成的装配线上的不同点执行不同的阶段。最终结果是汽车完全按照最慢阶段的速度制造。
在 CPU 中,流水线过程是完全相同的。 “指令”被分解为执行的各个阶段,通常类似于 1. 获取指令、2. 获取操作数(读取的寄存器或内存值)、2. 执行计算、3. 写入结果(到内存或寄存器) 。其中最慢的可能是计算部分,在这种情况下,通过该管道的指令的总体吞吐量速度只是计算部分的速度(就好像其他部分是“免费的”。)
微处理器中的超标量是指能够同时从单个执行流运行多个指令。因此,如果一家汽车公司运营两条装配线,那么显然他们可以生产两倍的汽车。但如果在汽车上添加序列号的过程是在最后阶段并且必须由一个人完成,那么他们就必须在两条管道之间交替,并保证他们可以在一半的时间内完成每条管道。最慢的阶段,以避免自己成为最慢的阶段。
微处理器中的超标量类似,但通常有更多的限制。因此,指令获取阶段通常会在其阶段产生多个指令——这就是微处理器中超标量成为可能的原因。然后将有两个获取阶段、两个执行阶段和两个写回阶段。这显然可以推广到不仅仅是两条管道。
这一切都很好,但从健全执行的角度来看,如果盲目地这样做,这两种技术都可能会导致问题。为了正确执行程序,假设指令是按顺序一个接一个地完整执行的。如果两条连续指令具有相互依赖的计算或使用相同的寄存器,则可能会出现问题,后一条指令需要等待前一条指令的写回完成才能执行操作数获取阶段。因此,您需要在执行第二条指令之前将其延迟两个阶段,这首先违背了这些技术所获得的目的。
有许多技术可以用来减少需要停顿的问题,描述起来有点复杂,但我将列出它们:1.寄存器转发,(也存储加载转发)2.寄存器重命名,3.记分板,4乱序执行。 5. 带有回滚(和退出)的推测执行 所有现代 CPU 几乎都使用所有这些技术来实现超标量和流水线。然而,在停顿变得不可避免之前,这些技术往往会相对于处理器中的管线数量而产生收益递减。实际上,没有一家 CPU 制造商在单个内核中生产超过 4 个流水线。
多核与这些技术无关。这基本上是将两个微处理器组合在一起,在单个芯片上实现对称多处理,并仅共享那些有意义的组件(通常是 L3 缓存和 I/O)。然而,英特尔称之为“超线程”的技术是一种试图在单核的超标量框架内虚拟实现多核语义的方法。因此,单个微架构包含两个(或更多)虚拟核心的寄存器,并从两个(或更多)不同的执行流获取指令,但从公共超标量系统执行。这个想法是,因为寄存器不能互相干扰,所以往往会有更多的并行性,从而减少停顿。因此,与其简单地以一半的速度执行两个虚拟核心执行流,不如整体减少停顿。这似乎表明英特尔可以增加管道数量。然而,已发现该技术在实际实施中有些缺乏。不过,由于它是超标量技术不可或缺的一部分,所以我还是提到了它。
Pipelining is what a car company does in the manufacturing of their cars. They break down the process of putting together a car into stages and perform the different stages at different points along an assembly line done by different people. The net result is that the car is manufactured at exactly the speed of the slowest stage alone.
In CPUs the pipelining process is exactly the same. An "instruction" is broken down into various stages of execution, usually something like 1. fetch instruction, 2. fetch operands (registers or memory values that are read), 2. perform computation, 3. write results (to memory or registers). The slowest of this might be the computation part, in which case the overall throughput speed of the instructions through this pipeline is just the speed of the computation part (as if the other parts were "free".)
Super-scalar in microprocessors refers to the ability to run several instructions from a single execution stream at once in parallel. So if a car company ran two assembly lines then obviously they could produce twice as many cars. But if the process of putting a serial number on the car was at the last stage and had to be done by a single person, then they would have to alternate between the two pipelines and guarantee that they could get each done in half the time of the slowest stage in order to avoid becoming the slowest stage themselves.
Super-scalar in microprocessors is similar but usually has far more restrictions. So the instruction fetch stage will typically produce more than one instruction during its stage -- this is what makes super-scalar in microprocessors possible. There would then be two fetch stages, two execution stages, and two write back stages. This obviously generalizes to more than just two pipelines.
This is all fine and dandy but from the perspective of sound execution both techniques could lead to problems if done blindly. For correct execution of a program, it is assumed that the instructions are executed completely one after another in order. If two sequential instructions have inter-dependent calculations or use the same registers then there can be a problem, The later instruction needs to wait for the write back of the previous instruction to complete before it can perform the operand fetch stage. Thus you need to stall the second instruction by two stages before it is executed, which defeats the purpose of what was gained by these techniques in the first place.
There are many techniques use to reduce the problem of needing to stall that are a bit complicated to describe but I will list them: 1. register forwarding, (also store to load forwarding) 2. register renaming, 3. score-boarding, 4. out-of-order execution. 5. Speculative execution with rollback (and retirement) All modern CPUs use pretty much all these techniques to implement super-scalar and pipelining. However, these techniques tend to have diminishing returns with respect to the number of pipelines in a processor before stalls become inevitable. In practice no CPU manufacturer makes more than 4 pipelines in a single core.
Multi-core has nothing to do with any of these techniques. This is basically ramming two micro-processors together to implement symmetric multiprocessing on a single chip and sharing only those components which make sense to share (typically L3 cache, and I/O). However a technique that Intel calls "hyperthreading" is a method of trying to virtually implement the semantics of multi-core within the super-scalar framework of a single core. So a single micro-architecture contains the registers of two (or more) virtual cores and fetches instructions from two (or more) different execution streams, but executing from a common super-scalar system. The idea is that because the registers cannot interfere with each other, there will tend to be more parallelism leading to fewer stalls. So rather than simply executing two virtual core execution streams at half the speed, it is better due to the overall reduction in stalls. This would seem to suggest that Intel could increase the number of pipelines. However this technique has been found to be somewhat lacking in practical implementations. As it is integral to super-scalar techniques, though, I have mentioned it anyway.
流水线是在同一周期同时执行多条指令的不同阶段。它基于将指令处理分为多个阶段,并为每个阶段提供专门的单元以及用于存储中间结果的寄存器。
超级扩展是将多条指令(或微指令)分派给 CPU 中存在的多个执行单元。因此,它基于 CPU 中的冗余单元。
当然,这些方法可以相辅相成。
Pipelining is simultaneous execution of different stages of multiple instructions at the same cycle. It is based on splitting instruction processing into stages and having specialized units for each stage and registers for storing intermediate results.
Superscaling is dispatching multiple instructions (or microinstructions) to multiple executing units existing in CPU. It is based thus on redundant units in CPU.
Of course, this approaches can complement each other.