为什么只有四个寄存器?
为什么最常见的CPU(x86)只有四个寄存器?如果增加更多的寄存器,速度不是会大幅提升吗?什么时候会添加更多寄存器?
Why are there only four registers in the most common CPU (x86)? Wouldn't there be a huge increase in speed if more registers were added? When will more registers be added?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
x86总是有四个以上的寄存器。原来有CS、DS、ES、SS、AX、BX、CX、DX、SI、DI、BP、SP、IP和Flags。其中,七个(AX、BX、CX、DX、SI、DI 和 BP)支持大多数通用运算(加法、减法等),BP 和 BX 还支持用作“基址”寄存器(即,保存间接)。 SI和DI也可以用作索引寄存器,它们与基址寄存器大致相同,只是一条指令可以从一个基址寄存器和一个索引寄存器生成地址,但不能从两个索引寄存器或两个基址寄存器生成地址。至少在典型使用中,SP 专门充当堆栈指针。
从那时起,寄存器变得越来越大,添加了更多寄存器,其中一些寄存器变得更加通用,因此(例如)您现在可以在 2 寄存器寻址模式中使用任意 2 个通用寄存器。有点奇怪的是,386 中添加了两个段寄存器(FS 和 GS),这也允许 32 位段,这使得所有段寄存器几乎变得无关紧要。它们有时用于线程本地存储。
我还应该补充一点,当你进行多任务、多线程等操作时,大量寄存器可能会产生相当严重的惩罚——因为你不知道哪些寄存器正在使用,当你进行上下文切换时,你有保存一个任务中的所有寄存器,并为下一个任务加载所有保存的寄存器。在像 Itanium 或 SPARC 这样具有 200 多个寄存器的 CPU 中,这可能会相当慢。最近的 SPARC 投入了大量的芯片面积来优化这一点,但它们的任务切换仍然相对较慢。 Itanium 的情况甚至更糟,这是它在典型服务器任务上表现不佳的原因之一,尽管它在科学计算方面表现出色,任务切换很少。
最后,当然,所有这一切与 x86 的现代实现的工作方式确实有很大不同。从 Pentium Pro 开始,Intel 将架构寄存器(即可以在指令中寻址的寄存器)与实现解耦。为了支持并发、乱序执行,Pentium Pro 有(如果内存可用)一组 40 个内部寄存器,并使用“寄存器重命名”,因此其中两个(或更多)可能对应于给定时间的给定架构寄存器。例如,如果您操作一个寄存器,然后存储它,加载不同的值,然后对其进行操作,处理器可以检测到加载破坏了这两组指令之间的依赖链,因此它可以同时执行这两个操作。
当然,Pentium Pro 现在已经相当老了——当然,AMD 也已经存在了一段时间(尽管他们的设计在这方面相当相似)。虽然新处理器的细节发生了变化,但具有将架构寄存器与物理寄存器分离的重命名功能现在或多或少已成为现实。
The x86 has always had more than four registers. Originally, it has CS, DS, ES, SS, AX, BX, CX, DX, SI, DI, BP, SP, IP and Flags. Of those, seven (AX, BX, CX, DX, SI, DI, and BP) supported most general operations (addition, subtraction, etc.) BP and BX also supported use as "Base" register (i.e., to hold addresses for indirection). SI and DI can also be used as index registers, which are about the same as base registers, except that an instruction can generate an address from one base register and one index register, but NOT from two index registers or two base registers. At least in typical use, SP is devoted to acting as the stack pointer.
Since then, the registers have gotten larger, more have been added, and some of them have become more versatile, so (for example) you can now use any 2 general-purpose registers in 2-register addressing modes. Somewhat strangely, two segment registers (FS and GS) were added in the 386, which also allowed 32-bit segments, which mostly rendered all the segment registers nearly irrelevant. They are sometimes used for thread-local storage.
I should also add that when you do multi-tasking, multi-threading, etc., lots of registers can have a pretty serious penalty -- since you don't know which registers are in use, when you do a context switch you have to save all the registers in one task, and load all the saved registers for the next task. In a CPU like the Itanium or the SPARC with 200+ registers, this can be rather slow. Recent SPARCs devote a fair amount of chip area to optimizing this, but their task switches are still relatively slow. It's even worse on the Itanium -- one reason it's less than impressive on typical server tasks, even though it blazes on scientific computing with (very) few task switches.
Finally, of course, all this is really quite different from how a reasonably modern implementation of x86 works. Starting with the Pentium Pro, Intel decoupled the architectural registers (i.e., the ones that can be addressed in an instruction) from the implementation. To support concurrent, out of order execution, the Pentium Pro had (if memory serves) a set of 40 internal registers, and used "register renaming" so two (or more) of those might correspond to a given architectural register at a given time. For example, if you manipulate a register, then store it, load a different value, and manipulate that, the processor can detect that the load breaks the dependency chain between those two sets of instructions, so it can execute both of those manipulations simultaneously.
The Pentium Pro is now quite old, of course--and of course, AMD has also been around for a while (though their designs are reasonably similar in this respect). While the details change with new processors, having renaming capability that decouples architectural registers from physical registers is now more or less a fact of life.
现在有超过4个。如果您查看x86 架构的历史,您会发现它是从 8086 指令演变而来的放。英特尔一直希望在其处理器系列中保持一定程度的向后兼容性,因此所有后续处理器都只是将原始的 A、B、C、D 寄存器扩展到更广泛的位数。最初的段寄存器现在可用于通用目的,因为不再存在真正的段(这是过于简单化,但大致正确)。新的 x64 架构还提供了一些额外的寄存器。
There are more than 4 nowadays. If you look at the history of the x86 architecture, you see that it has evolved from the 8086 instruction set. Intel has always wanted to keep some degree of backwards compatibility in its processor line, so all subsequent processors simply extended the original A,B,C,D registers to wider numbers of bits. The original segment registers can be used for general purposes today, since there aren't really segments anymore (this is an oversimplification, but roughly true). The new x64 architecture provides some extra registers as well.
X86 实际上是一个 8 寄存器机器(eax/ebx/ecx/edx/esi/edi/ebp/esp)。你会因为堆栈指针/基指针而丢失其中的 1 个,因此在实际使用中你会得到 7 个,这有点偏低,但即使某些 RISC 机器也有 8 个(THUMB 模式下的 SuperH 和 ARM,因为它们有 16 位指令)大小和更多的寄存器将太长而无法编码!)。对于 64 位代码,您可以从 8 位升级到 16 位(据我所知,他们在指令编码中使用了一些剩余位)。
尽管如此,8 个寄存器对于 CPU 的流水线来说已经足够了,这对于 486 和 pentium 来说是完美的。其他一些架构,如 6502/65816,在早期的 32 位时代就消亡了,因为你无法制作一个快速的有序流水线版本(你只有 3 个寄存器,并且只有 1 个用于一般数学,所以一切都会导致停顿! )。一旦你到了所有寄存器都被重命名并且一切都乱序的一代(pentium 2等),那么它就不再重要了,如果你一遍又一遍地重复使用相同的寄存器,你将不会得到停顿,并且那么8个寄存器就完全没问题了。
更多寄存器的另一个用途是将循环常量保存在寄存器中,而在 x86 上则不需要这样做,因为每条指令都可以执行内存加载,因此您可以将所有常量保存在内存中。这是 RISC 所缺少的一个功能(根据定义),虽然它们通过更容易流水线化(最长延迟是 2 个周期而不是 3 个周期)和稍微超标量来弥补这一点,但代码大小仍然增加了一点。 ..
添加更多寄存器会产生一些不明显的成本。你的指令会变得更长,因为你需要更多的位,这会增加程序的大小,如果你的代码速度受到读取指令的内存带宽的限制,这会减慢你的程序!
还有一个事实是,寄存器文件越大,读取值所需的多路复用器级别/通用电路就越多,这会增加延迟,从而可能降低时钟速度!
这就是为什么 atm 的传统观点是超过 32 个寄存器并不是一个好主意(没有用,特别是在乱序 CPU 上),而 8 个寄存器太低了(内存读取仍然很昂贵!),为什么理想的架构被认为是 75% RISC 25% CISC,以及为什么 ARM 很受欢迎(平衡得恰到好处!),几乎所有 RISC 架构仍然有一些 CISC 部分(每个内存 OP 中的地址计算、32 位操作码)但不是更多!),为什么安腾失败了(128位操作码?64个寄存器?内存操作中没有地址计算???)。
由于所有这些原因,x86 尚未被超越 - 当然指令编码是完全疯狂的,但除此之外,它为保持效率而进行的所有疯狂的重新排序和重命名以及推测性加载存储实际上都是非常有用的功能,并且正是它比各种更简单的有序设计(例如 POWER6)具有优势。一旦你重新排序和重命名所有内容,所有指令集或多或少都是相同的,因此很难做出实际上更快的设计,除了特定情况(本质上是 GPU)。一旦 ARM cpu 的速度与 x86 一样快,它们就会像 Intel 推出的那样疯狂和复杂。
X86 is really an 8 register machine (eax/ebx/ecx/edx/esi/edi/ebp/esp). You lose 1 of those to the stack pointer/base pointer, so in practical usage you get 7, which is a bit on the low side, but even some RISC machines have 8 (SuperH and ARM in THUMB mode, because they have 16bit instruction size and more registers would be too long to encode!). For 64bit code, you upgrade from 8 to 16 (they used some leftover bits in instruction encoding AFAIK).
Still, 8 registers is just about right just enough to pipeline the CPU, which is perfect for 486s and pentiums. Some other architectures, like 6502/65816, died off in the early 32bit era because you just can't make a fast in-order pipelined version (you only have 3 registers, and only 1 for general math, so everything causes a stall!). Once you get to the generation where all your registers are renamed and everything is out of order (pentium 2 etc), then it doesn't really matter anymore and you won't get stalls if you reuse the same register over and over, and then 8 registers is quite allright.
The other use for more registers is to keep loop constants in registers, and you don't need to on x86 because every instruction can do a memory load, so you can keep all your constants in memory. This is the one feature missing from RISCs (by definition), and while they make up for it by being easier to pipeline (your longest latency is 2 cycles instead of 3) and being slightly more superscalar, your code size still increases a bit...
There are some non obvious costs to adding more registers. Your instructions get longer because you need more bits, which increases program size, which slows down your program if your code speed is limited by the memory bandwidth of reading instructions!
There's also the fact that the larger your register file is, the more multiplexer levels/general circuitry you have to go through to read a value, which increases latency, which can potentially reduce the clock speed!
This is why atm the conventional wisdom is that more than 32 registers is not really a good idea (not useful, especially on an out-of-order CPU), and 8 is just about too low (memory reads are still expensive!), and why the ideal architecture is considered to be something like 75% RISC 25% CISC, and why ARM is popular (balanced just about right!), almost all RISC architectures still have some CISC parts (address calculation in every memory OP, 32bit opcodes but not more!), why Itanium failed (128bit opcodes? 64 registers? no address calculation in memory ops???).
For all of these reasons, x86 hasn't been surpassed - sure the instruction encoding is totally insane, but aside from that, all the crazy reordering and renaming and speculative load-store insanity it does to stay efficient is actually all really useful features and are exactly what gives it its edge over various simpler in-order designs such as the POWER6. Once you reorder and rename everything, all instruction sets are more or less the same anyways, so it's very hard to make a design that's actually faster in any way, except specific cases (GPUs essentially). Once ARM cpus get as fast as x86s, they will be just as crazy and complicated as the ones Intel puts out.
有许多架构具有更多寄存器(ARM、PowerPC 等)。有时,它们可以实现更高的指令吞吐量,因为操作堆栈时完成的工作更少,并且指令可能更短(无需引用堆栈变量)。与此相反的是,由于更多的寄存器保存,函数调用变得更加昂贵。
There are many architectures with more registers (ARM, PowerPC, etc). At times, they can achieve higher instruction throughput as less work is done in manipulating the stack, and instructions may be shorter (no need to reference stack variables). The counter-point is function calls become more expensive due to more register saving.
更多寄存器并不一定会使速度更快,它们会使 CPU 架构更加复杂,因为寄存器必须靠近其他组件,并且许多指令仅在特定寄存器上工作。
但是现代CPU有四个以上的寄存器,从我的脑海里看有AX,BX,CX,DX,SI,DI,BP,...然后CPU有内部寄存器,例如PIC(处理器指令计数器)
More registers doesn't necessarily make things faster, they make the CPU architecture more complicated, as the registers have to be close to other components and many instructions work only on specific registers.
But modern CPUs have more than four registers, from top of my head there are AX, BX, CX, DX, SI, DI, BP, ... then a CPU has internalregisters, for instance for PIC (processor instruction counters)
嗯,还有更多,这四个只是特殊的,我认为它们是“通用”,所有这一切的原因以及为什么其余的没有那么多使用的是:
Well, there are more, the four are just special, they are 'general purpose' I think, the reasons for all this and why the rest isn't used that much is:
在 CPU 中设计寄存器使用的内存确实非常昂贵。除了这样做的设计困难之外,增加可用寄存器的数量也会使 CPU 芯片变得更加昂贵。
另外:
The memory that registers use is really expensive to engineer in the CPU. Aside from the design difficulties in doing so, increasing the number of available registers make CPU chips more expensive.
In addition:
嗯……(E/R)AX、(E/R)BX、(E/R)CX、(E/R)DX、(E/R)SI、(E/R)DI、(E /R)SP、(E/R)BP、(E/R)IP。我算起来超过4个了。:)
Um..... (E/R)AX, (E/R)BX, (E/R)CX, (E/R)DX, (E/R)SI, (E/R)DI, (E/R)SP, (E/R)BP, (E/R)IP. I count that as more than 4. :)
它仅取决于架构决策。 Intel Itanium 有 128 个通用寄存器和 128 个浮点寄存器,而 Intel x86 只有 8 个通用寄存器和一个 8 个浮点数的堆栈。
It simply depends on architectural descisions. Intel Itanium has 128 general purpose and 128 floating point registers, while Intel x86 only has 8 general purpose registers and a stack of 8 floats.