如果寄存器的速度如此之快,为什么我们不拥有更多寄存器呢?

发布于 2024-11-09 04:14:05 字数 152 浏览 0 评论 0原文

在 32 位中,我们有 8 个“通用”寄存器。对于 64 位,数量翻倍,但似乎与 64 位变化本身无关。
现在,如果寄存器如此快(无需内存访问),为什么不自然地有更多寄存器呢? CPU 构建者不应该在 CPU 中使用尽可能多的寄存器吗?为什么我们只有我们拥有的数量的逻辑限制是什么?

In 32bit, we had 8 "general purpose" registers. With 64bit, the amount doubles, but it seems independent of the 64bit change itself.
Now, if registers are so fast (no memory access), why aren't there more of them naturally? Shouldn't CPU builders work as many registers as possible into the CPU? What is the logical restriction to why we only have the amount we have?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

独孤求败 2024-11-16 04:14:05

您不只是拥有大量寄存器的原因有很多:

  • 它们与大多数管道阶段高度相关。对于初学者来说,您需要跟踪它们的生命周期,并将结果转发回之前的阶段。复杂性很快就会变得棘手,并且涉及的电线数量(字面意思)也以同样的速度增长。它的面积昂贵,这最终意味着在某一点之后它的功耗、价格和性能都昂贵。
  • 它占用指令编码空间。 16 个寄存器占用 4 位用于源和目标,如果您有 3 操作数指令(例如 ARM),则另外占用 4 位。仅指定寄存器就占用了大量的指令集编码空间。这最终会影响解码、代码大小以及复杂性。
  • 有更好的方法来实现相同的结果......

现在我们确实有很多寄存器 - 它们只是没有明确编程。我们有“寄存器重命名”。虽然您只能访问一小部分寄存器(8-32 个寄存器),但它们实际上有更大的集合(例如 64-256)支持。然后,CPU 跟踪每个寄存器的可见性,并将它们分配给重命名的集合。例如,您可以连续多次加载、修改然后存储到寄存器,并且根据缓存未命中等情况实际独立执行每个操作。在 ARM 中:

ldr r0, [r4]
add r0, r0, #1
str r0, [r4]
ldr r0, [r5]
add r0, r0, #1
str r0, [r5]

Cortex A9 内核会进行寄存器重命名,因此第一次加载到“r0”实际上进入一个重命名的虚拟寄存器——我们称之为“v0”。加载、增量和存储发生在“v0”上。同时,我们还再次对 r0 执行加载/修改/存储,但这将被重命名为“v1”,因为这是使用 r0 的完全独立的序列。假设来自“r4”中指针的加载由于缓存未命中而停止。没关系 - 我们不需要等待“r0”准备好。因为它被重命名,我们可以使用“v1”(也映射到 r0)运行下一个序列 - 也许这是一个缓存命中,我们刚刚获得了巨大的性能胜利。

ldr v0, [v2]
add v0, v0, #1
str v0, [v2]
ldr v1, [v3]
add v1, v1, #1
str v1, [v3]

我认为如今 x86 的重命名寄存器数量达到了巨大(大约 256)。这意味着每条指令都有 8 位乘以 2,只是为了说明源和目标是什么。它将大量增加核心所需的电线数量及其尺寸。因此,大多数设计人员都已经接受了 16-32 个寄存器的最佳位置,而对于无序 CPU 设计,寄存器重命名是缓解这种情况的方法。

编辑:乱序执行和寄存器重命名对此的重要性。一旦有了 OOO,寄存器的数量就不再那么重要了,因为它们只是“临时标签”,并且会被重命名为更大的虚拟寄存器集。您不希望数字太小,因为编写小的代码序列会变得困难。对于 x86-32 来说这是一个问题,因为有限的 8 个寄存器意味着很多临时寄存器最终会通过堆栈,并且核心需要额外的逻辑来将读/写转发到内存。如果您没有 OOO,那么您通常谈论的是小内核,在这种情况下,大寄存器集的成本/性能优势很差。

因此,寄存器组大小有一个天然的最佳点,对于大多数类别的 CPU 来说,最大数量约为 32 个架构寄存器。 x86-32有8个寄存器,肯定太小了。 ARM 采用了 16 个寄存器,这是一个很好的折衷方案。 32 个寄存器有点太多了——你最终不需要最后 10 个左右。

这些都不会涉及 SSE 和其他向量浮点协处理器的额外寄存器。这些作为额外的集合是有意义的,因为它们独立于整数核心运行,并且不会以指数方式增加 CPU 的复杂性。

There's many reasons you don't just have a huge number of registers:

  • They're highly linked to most pipeline stages. For starters, you need to track their lifetime, and forward results back to previous stages. The complexity gets intractable very quickly, and the number of wires (literally) involved grows at the same rate. It's expensive on area, which ultimately means it's expensive on power, price and performance after a certain point.
  • It takes up instruction encoding space. 16 registers takes up 4 bits for source and destination, and another 4 if you have 3-operand instructions (e.g ARM). That's an awful lot of instruction set encoding space taken up just to specify the register. This eventually impacts decoding, code size and again complexity.
  • There's better ways to achieve the same result...

These days we really do have lots of registers - they're just not explicitly programmed. We have "register renaming". While you only access a small set (8-32 registers), they're actually backed by a much larger set (e.g 64-256). The CPU then tracks the visibility of each register, and allocates them to the renamed set. For example, you can load, modify, then store to a register many times in a row, and have each of these operations actually performed independently depending on cache misses etc. In ARM:

ldr r0, [r4]
add r0, r0, #1
str r0, [r4]
ldr r0, [r5]
add r0, r0, #1
str r0, [r5]

Cortex A9 cores do register renaming, so the first load to "r0" actually goes to a renamed virtual register - let's call it "v0". The load, increment and store happen on "v0". Meanwhile, we also perform a load/modify/store to r0 again, but that'll get renamed to "v1" because this is an entirely independent sequence using r0. Let's say the load from the pointer in "r4" stalled due to a cache miss. That's ok - we don't need to wait for "r0" to be ready. Because it's renamed, we can run the next sequence with "v1" (also mapped to r0) - and perhaps that's a cache hit and we just had a huge performance win.

ldr v0, [v2]
add v0, v0, #1
str v0, [v2]
ldr v1, [v3]
add v1, v1, #1
str v1, [v3]

I think x86 is up to a gigantic number of renamed registers these days (ballpark 256). That would mean having 8 bits times 2 for every instruction just to say what the source and destination is. It would massively increase the number of wires needed across the core, and its size. So there's a sweet spot around 16-32 registers which most designers have settled for, and for out-of-order CPU designs, register renaming is the way to mitigate it.

Edit: The importance of out-of-order execution and register renaming on this. Once you have OOO, the number of registers doesn't matter so much, because they're just "temporary tags" and get renamed to the much larger virtual register set. You don't want the number to be too small, because it gets difficult to write small code sequences. This is a problem for x86-32, because the limited 8 registers means a lot of temporaries end up going through the stack, and the core needs extra logic to forward reads/writes to memory. If you don't have OOO, you're usually talking about a small core, in which case a large register set is a poor cost/performance benefit.

So there's a natural sweet spot for register bank size which maxes out at about 32 architected registers for most classes of CPU. x86-32 has 8 registers and it's definitely too small. ARM went with 16 registers and it's a good compromise. 32 registers is slightly too many if anything - you end up not needing the last 10 or so.

None of this touches on the extra registers you get for SSE and other vector floating point coprocessors. Those make sense as an extra set because they run independently of the integer core, and don't grow the CPU's complexity exponentially.

亢潮 2024-11-16 04:14:05

我们确实拥有更多寄存器

因为几乎每条指令都必须选择 1、2 或 3 个架构上可见的寄存器,因此扩展寄存器的数量会导致每条指令的代码大小增加几个位,从而降低代码密度。它还增加了必须保存为线程状态的上下文的数量,并部分保存在函数的激活记录中。这些操作经常发生。流水线互锁必须检查每个寄存器的记分板,这具有二次时间和空间复杂度。也许最大的原因就是与已经定义的指令集的兼容性。

但事实证明,由于寄存器重命名我们确实有很多可用的寄存器,我们甚至不需要拯救它们。 CPU 实际上有许多寄存器组,并且当代码执行时它会自动在它们之间切换。它这样做纯粹是为了让您获得更多寄存器。

示例:

load  r1, a  # x = a
store r1, x
load  r1, b  # y = b
store r1, y

在只有 r0-r7 的架构中,CPU 可能会自动重写以下代码,如下所示:

load  r1, a
store r1, x
load  r10, b
store r10, y

在本例中,r10 是一个隐藏寄存器,暂时替代 r1。 CPU 可以知道,在第一次存储之后,r1 的值就不会再被使用。这允许延迟第一次加载(即使片上缓存命中通常需要几个周期),而不需要延迟第二次加载或第二次存储。

We Do Have More of Them

Because almost every instruction must select 1, 2, or 3 architecturally visible registers, expanding the number of them would increase code size by several bits on each instruction and so reduce code density. It also increases the amount of context that must be saved as thread state, and partially saved in a function's activation record. These operations occur frequently. Pipeline interlocks must check a scoreboard for every register and this has quadratic time and space complexity. And perhaps the biggest reason is simply compatibility with the already-defined instruction set.

But it turns out, thanks to register renaming, we really do have lots of registers available, and we don't even need to save them. The CPU actually has many register sets, and it automatically switches between them as your code exeutes. It does this purely to get you more registers.

Example:

load  r1, a  # x = a
store r1, x
load  r1, b  # y = b
store r1, y

In an architecture that has only r0-r7, the following code may be rewritten automatically by the CPU as something like:

load  r1, a
store r1, x
load  r10, b
store r10, y

In this case r10 is a hidden register that is substituted for r1 temporarily. The CPU can tell that the the value of r1 is never used again after the first store. This allows the first load to be delayed (even an on-chip cache hit usually takes several cycles) without requiring the delay of the second load or the second store.

赤濁 2024-11-16 04:14:05

它们总是添加寄存器,但它们通常与专用指令(例如 SIMD、SSE2 等)相关联,或者需要编译为特定的 CPU 架构,这降低了可移植性。现有指令通常在特定寄存器上工作,并且无法利用其他可用的寄存器。遗留指令集等等。

They add registers all of the time, but they are often tied to special purpose instructions (e.g. SIMD, SSE2, etc) or require compiling to a specific CPU architecture, which lowers portability. Existing instructions often work on specific registers and couldn't take advantage of other registers if they were available. Legacy instruction set and all.

绾颜 2024-11-16 04:14:05

在这里添加一些有趣的信息,您会注意到,拥有 8 个相同大小的寄存器允许操作码保持与十六进制表示法的一致性。例如,指令 push ax 在 x86 上的操作码为 0x50,最后一个寄存器 di 的操作码为 0x57。然后指令pop ax从0x58开始,一直到0x5F pop di完成第一个base-16。每个大小有 8 个寄存器,以保持十六进制一致性。

To add a little interesting info here you'll notice that having 8 same sized registers allows opcodes to maintain consistency with hexadecimal notation. For example the instruction push ax is opcode 0x50 on x86 and goes up to 0x57 for the last register di. Then the instruction pop ax starts at 0x58 and goes up to 0x5F pop di to complete the first base-16. Hexadecimal consistency is maintained with 8 registers per a size.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文