当前位置：文江博客话题详情

为什么CPU寄存器访问速度快？

发布于 2024-09-14 15:49:47 字数 109 浏览 1 评论 0原文

寄存器变量是一种众所周知的快速访问方法（register int i）。但为什么寄存器位于层次结构的顶部（寄存器、高速缓存、主存储器、辅助存储器）？是什么使得访问寄存器如此之快？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

破晓 2024-09-21 15:49:47

寄存器是直接连接到 ALU 的电路，其中包含算术电路。每个时钟周期，CPU内核的寄存器单元可以将六个左右的变量输入到其他电路中。实际上，数据路径中的单元（ALU 等）可以通过旁路网络直接相互传送数据，这在某种程度上形成了寄存器之上的层次结构级别 - 但它们仍然使用寄存器 -互相称呼的号码。（完全流水线 CPU 的控制部分动态地将数据路径单元映射到寄存器编号。）

C 中的 register 关键字没有任何用处，您不应该使用它。编译器决定哪些变量应该在寄存器中以及何时在寄存器中。

回复收藏 0 原文

装迷糊 2024-09-21 15:49:47

寄存器是 CPU 的核心部分，CPU 的大部分指令集都将针对寄存器而不是内存位置进行定制。访问寄存器的值通常需要很少的时钟周期（可能只需要 1 个），一旦访问内存，事情就会变得更加复杂，并且涉及缓存控制器/内存总线，并且操作将花费相当多的时间。

回复收藏 0 原文

零度° 2024-09-21 15:49:47

有几个因素导致寄存器比高速缓存更快。

直接寻址与间接寻址

首先，寄存器是根据指令中的位直接寻址的。许多 ISA 将源寄存器地址编码在一个常量位置，允许在指令解码之前将它们发送到寄存器文件，推测将使用一个或两个值。最常见的存储器寻址模式是通过寄存器间接寻址。由于基址+偏移寻址的频率，许多实现针对这种情况优化了流水线。（在不同阶段访问缓存会增加复杂性。）缓存还使用标记，并且通常使用集合关联性，这往往会增加访问延迟。不必处理未命中的可能性也降低了寄存器访问的复杂性。

Complicating Factors

无序实现和具有堆叠或旋转寄存器的 ISA（例如 SPARC、Itanium、XTensa）会重命名寄存器。专用缓存，例如 Todd Austin 的背包缓存（直接使用偏移量对缓存进行索引）和一些堆栈缓存设计（例如，使用较小的堆栈帧号并使用该帧号和偏移量直接索引专用堆栈缓存的块）避免寄存器读取和加法。签名缓存将寄存器名称和偏移量与一小块存储相关联，从而为访问结构的较低成员提供较低的延迟。索引预测（例如，对偏移量和基数进行异或，避免进位传播延迟）可以减少延迟（以处理错误预测为代价）。人们还可以更早地为更简单的寻址模式（例如寄存器间接寻址）提供内存地址，但在两个不同的流水线级中访问高速缓存会增加复杂性。（安腾仅提供寄存器间接寻址 - 带有选项后增量。）方式预测（以及直接映射缓存情况下的命中推测）可以减少延迟（同样会产生错误预测处理成本）。暂存器（又名紧密耦合）存储器没有标签或关联性，因此可以稍微快一些（并且具有较低的访问能量），并且一旦确定访问该区域，就不可能错过。背包缓存的内容可以被视为上下文的一部分，并且在该缓存被填满之前上下文不被视为准备好。理论上，寄存器也可以延迟加载（特别是对于 Itanium 堆栈寄存器），因此必须处理寄存器未命中的可能性。

固定大小与可变大小

寄存器通常是固定大小的。这避免了需要移位从对齐存储检索的数据以将实际最低有效位放置到执行单元的正确位置。此外，许多加载指令对加载的值进行符号扩展，这会增加延迟。（零扩展不依赖于数据值。）

Complicating Factors

某些 ISA 确实支持子寄存器，特别是 x86 和 zArchitecture（源自 S/360），它们可能需要预移位。还可以以较低的延迟提供完全对齐的负载（可能以其他负载一个周期的额外延迟为代价）；子字加载足够常见，并且增加的延迟足够小，以至于特殊的大小写并不常见。符号扩展延迟可能隐藏在进位传播延迟之后；或者可以使用符号预测（可能只是推测的零扩展）或将符号扩展视为慢速情况。（对未对齐加载的支持会使高速缓存访问进一步复杂化。）

小容量

有序 64 位 RISC 的典型寄存器文件只有大约 256 字节（32 个 8 字节寄存器）。对于现代缓存来说 8KiB 被认为很小。这意味着将物理尺寸和静态功率相乘来提高速度对总面积和静态功率的影响要小得多。更大的晶体管具有更高的驱动强度，其他面积增加的设计因素可以提高速度。

Complicating Factors

一些 ISA 具有大量架构寄存器，并且可能具有非常宽的 SIMD 寄存器。此外，一些实现添加了额外的寄存器用于重命名或支持多线程。 GPU 使用 SIMD 并支持多线程，可以拥有特别高容量的寄存器文件； GPU 寄存器文件也不同于 CPU 寄存器文件，因为通常是单端口的，每个周期访问一个操作数/结果的向量元素的数量是执行中可以使用的四倍（例如，使用 512 位宽乘法累加执行、读取三个操作数各 2KiB，并写入 2KiB 结果）。

常见情况优化

由于寄存器访问旨在成为常见情况，因此将面积、功耗和设计工作花在提高此功能的性能上会更有利可图。如果 5% 的指令不使用源寄存器（直接跳转和调用、寄存器清除等），则 70% 使用一个源寄存器（简单加载、立即数操作等），25% 使用两个源寄存器，75% 使用两个源寄存器。 % 使用目标寄存器，而 50% 访问数据存储器（40% 加载，10% 存储）——大致基于来自 MIPS SPEC CPU2000 的数据——然后是更多（对时序更关键的）的三倍多）读取来自寄存器而不是内存（每条指令 1.3 次 vs. 0.4 次），并且

Complicating Factors

并非所有处理器都是针对“通用”工作负载而设计的。例如，使用内存中向量并使用向量起始地址、向量长度和累加器的寄存器来实现点积性能的处理器可能没有理由优化寄存器延迟（极端并行性简化了隐藏延迟），并且内存带宽比寄存器更重要带宽。

小地址空间

寄存器的最后一个较小的优点是地址空间小。这减少了索引存储阵列时地址解码的延迟。人们可以将地址解码视为一系列二进制决策（存储块的这一半或另一半）。典型的高速缓存 SRAM 阵列大约有 256 条字线（列、索引地址）——需要解码 8 位——并且 SRAM 阵列的选择通常还涉及地址解码。一个简单的有序 RISC 通常有 32 个寄存器——5 位需要解码。

Complicating Factors

现代高性能处理器可以轻松拥有 8 位寄存器地址（Itanium 在上下文中具有超过 128 个通用寄存器，而更高端的乱序处理器可以拥有更多寄存器）。相对于上述考虑因素，这也是一个不太重要的考虑因素，但不应忽视。

结论

上述许多考虑因素是重叠的，这是优化设计所期望的。如果一个特定的功能预计是通用的，那么不仅要优化实现，还要优化接口。限制灵活性（直接寻址、固定大小）自然有助于优化，而且越小越容易做得更快。

Several factors lead to registers being faster than cache.

Direct vs. Indirect Addressing

First, registers are directly addressed based on bits in the instruction. Many ISAs encode the source register addresses in a constant location, allowing them to be sent to the register file before the instruction has been decoded, speculating that one or both values will be used. The most common memory addressing modes indirect through a register. Because of the frequency of base+offset addressing, many implementations optimize the pipeline for this case. (Accessing the cache at different stages adds complexity.) Caches also use tagging and typically use set associativity, which tends to increase access latency. Not having to handle the possibility of a miss also reduces the complexity of register access.

Complicating Factors

Out-of-order implementations and ISAs with stacked or rotating registers (e.g., SPARC, Itanium, XTensa) do rename registers. Specialized caches such as Todd Austin's Knapsack Cache (which directly indexes the cache with the offset) and some stack cache designs (e.g., using a small stack frame number and directly indexing a chunk of the specialized stack cache using that frame number and the offset) avoid register read and addition. Signature caches associate a register name and offset with a small chunk of storage, providing lower latency for accesses to the lower members of a structure. Index prediction (e.g., XORing offset and base, avoiding carry propagation delay) can reduce latency (at the cost of handling mispredictions). One could also provide memory addresses earlier for simpler addressing modes like register indirect, but accessing the cache in two different pipeline stages adds complexity. (Itanium only provided register indirect addressing — with option post increment.) Way prediction (and hit speculation in the case of direct mapped caches) can reduce latency (again with misprediction handling costs). Scratchpad (a.k.a. tightly coupled) memories do not have tags or associativity and so can be slightly faster (as well as have lower access energy) and once an access is determined to be to that region a miss is impossible. The contents of a Knapsack Cache can be treated as part of the context and the context not be considered ready until that cache is filled. Registers could also be loaded lazily (particularly for Itanium stacked registers), theoretically, and so have to handle the possibility of a register miss.

Fixed vs. Variable Size

Registers are usually fixed size. This avoids the need to shift the data retrieved from aligned storage to place the actual least significant bit into its proper place for the execution unit. In addition, many load instructions sign extend the loaded value, which can add latency. (Zero extension is not dependent on the data value.)

Complicating Factors

Some ISAs do support sub-registers, notable x86 and zArchitecture (descended from S/360), which can require pre-shifting. One could also provide fully aligned loads at lower latency (likely at the cost of one cycle of extra latency for other loads); subword loads are common enough and the added latency small enough that special casing is not common. Sign extension latency could be hidden behind carry propagation latency; alternatively sign prediction could be used (likely just speculative zero extension) or sign extension treated as a slow case. (Support for unaligned loads can further complicate cache access.)

Small Capacity

A typical register file for an in-order 64-bit RISC will be only about 256 bytes (32 8-byte registers). 8KiB is considered small for a modern cache. This means that multiplying the physical size and static power to increase speed has a much smaller effect on the total area and static power. Larger transistors have higher drive strength and other area-increasing design factors can improve speed.

Complicating Factors

Some ISAs have a large number of architected registers and may have very wide SIMD registers. In addition, some implementations add additional registers for renaming or to support multithreading. GPUs, which use SIMD and support multithreading, can have especially high capacity register files; GPU register files are also different from CPU register files in typically being single ported, accessing four times as many vector elements of one operand/result per cycle as can be used in execution (e.g., with 512-bit wide multiply-accumulate execution, reading 2KiB of each of three operands and writing 2KiB of the result).

Common Case Optimization

Because register access is intended to be the common case, area, power, and design effort is more profitably spent to improve performance of this function. If 5% of instructions use no source registers (direct jumps and calls, register clearing, etc.), 70% use one source register (simple loads, operations with an immediate, etc.), 25% use two source registers, and 75% use a destination register, while 50% access data memory (40% loads, 10% stores) — a rough approximation loosely based on data from SPEC CPU2000 for MIPS —, then more than three times as many of the (more timing-critical) reads are from registers than memory (1.3 per instruction vs. 0.4) and

Complicating Factors

Not all processors are design for "general purpose" workloads. E.g., processor using in-memory vectors and targeting dot product performance using registers for vector start address, vector length, and an accumulator might have little reason to optimize register latency (extreme parallelism simplifies hiding latency) and memory bandwidth would be more important than register bandwidth.

Small Address Space

A last, somewhat minor advantage of registers is that the address space is small. This reduces the latency for address decode when indexing a storage array. One can conceive of address decode as a sequence of binary decisions (this half of a chunk of storage or the other). A typical cache SRAM array has about 256 wordlines (columns, index addresses) — 8 bits to decode — and the selection of the SRAM array will typically also involve address decode. A simple in-order RISC will typically have 32 registers — 5 bits to decode.

Complicating Factors

Modern high-performance processors can easily have 8 bit register addresses (Itanium had more than 128 general purpose registers in a context and higher-end out-of-order processors can have even more registers). This is also a less important consideration relative to those above, but it should not be ignored.

Conclusion

Many of the above considerations overlap, which is to be expected for an optimized design. If a particular function is expected to be common, not only will the implementation be optimized but the interface as well. Limiting flexibility (direct addressing, fixed size) naturally aids optimization and smaller is easier to make faster.

回复收藏 0 原文

中性美 2024-09-21 15:49:47

寄存器本质上是CPU内部存储器。因此，对寄存器的访问比任何其他类型的内存访问都更容易、更快。

回复收藏 0 原文

丿*梦醉红颜 2024-09-21 15:49:47

较小的内存通常比较大的内存更快；它们还可以需要更少的位来寻址。 32 位指令字可以容纳三个四位寄存器地址，并为操作码和其他内容提供大量空间；一个 32 位内存地址将完全填满一个指令字，没有空间容纳其他任何东西。此外，寻址存储器所需的时间以超过与存储器大小的对数成正比的速率增加。从 4 GB 内存空间访问一个字将比从 16 字寄存器文件访问一个字花费数十倍甚至数百倍的时间。

一台能够处理来自小型快速寄存器文件的大多数信息请求的机器将比使用较慢内存处理所有内容的机器更快。

回复收藏 0 原文