为什么我们要在汇编中使用 CPU 寄存器，而不是直接使用内存呢？

发布于 2024-08-23 06:05:15 字数 503 浏览 5 评论 0原文

我有一个关于装配的基本问题。

如果算术运算也可以在内存上运行，为什么我们还要费心只在寄存器上进行算术运算呢？

例如，以下两个原因（本质上）导致计算出相同的值作为答案：

代码段 1

.data
    var dd 00000400h

.code

    Start:
        add var,0000000Bh
        mov eax,var
        ;breakpoint: var = 00000B04
    End Start

代码段 2

.code

    Start:
        mov eax,00000400h
        add eax,0000000bh
        ;breakpoint: eax = 0000040B
    End Start

据我所知，大多数文本和教程主要在寄存器上进行算术运算。使用寄存器是不是更快？

原文

I have a basic question about assembly.

Why do we bother doing arithmetic operations only on registers if they can work on memory as well?

For example both of the following cause (essentially) the same value to be calculated as an answer:

Snippet 1

.data
    var dd 00000400h

.code

    Start:
        add var,0000000Bh
        mov eax,var
        ;breakpoint: var = 00000B04
    End Start

Snippet 2

.code

    Start:
        mov eax,00000400h
        add eax,0000000bh
        ;breakpoint: eax = 0000040B
    End Start

From what I can see most texts and tutorials do arithmetic operations mostly on registers. Is it just faster to work with registers?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

下壹個目標 2024-08-30 06:05:16

寄存器的访问速度比 RAM 内存快方式，因为您不必访问“慢速”内存总线！

回复收藏 0 原文

左秋 2024-08-30 06:05:16

我们使用寄存器是因为它们速度很快。通常，它们以 CPU 的速度运行。
寄存器和 CPU 缓存采用不同的技术/结构制成，并且
它们很贵。另一方面，RAM 很便宜，而且速度慢 100 倍。

回复收藏 0 原文

瑕疵 2024-08-30 06:05:16

一般来说，寄存器算术要快得多并且更受欢迎。然而，在某些情况下，直接内存算术是有用的。
如果您想要做的只是增加内存中的一个数字（至少对于几百万条指令而言没有其他任何事情），那么单个直接内存算术指令通常比加载/添加/存储稍快。

此外，如果您正在进行复杂的数组操作，您通常需要大量寄存器来跟踪您所在的位置以及数组的结束位置。在较旧的体系结构上，您可能很快就会用完寄存器，因此将两位内存添加在一起而不切换任何当前寄存器的选项非常有用。

回复收藏 0 原文

以歌曲疗慰 2024-08-30 06:05:16

是的，使用寄存器要快得多。即使您只考虑处理器到寄存器的物理距离与过程到内存的物理距离相比，您也可以通过不发送电子那么远来节省大量时间，这意味着您可以以更高的时钟速率运行。

回复收藏 0 原文

深爱成瘾 2024-08-30 06:05:16

是的 - 通常您也可以轻松地压入/弹出寄存器以调用程序、处理中断等

回复收藏 0 原文

笑，眼淚并存 2024-08-30 06:05:16

只是指令集不允许你做如此复杂的操作：

add [0x40001234],[0x40002234]

你必须通过寄存器。

It's just that the instruction set will not allow you to do such complex operations:

add [0x40001234],[0x40002234]

You have to go through the registers.

回复收藏 0 原文

や莫失莫忘 2024-08-30 06:05:15

如果您查看计算机体系结构，您会发现一系列内存级别。靠近 CPU 的设备速度快、成本高（每比特），因此体积小，而另一端则有大、慢且便宜的内存设备。在现代计算机中，这些通常是这样的：

 CPU registers (slightly complicated, but in the order of 1KB per a core - there
                are different types of registers. You might have 16 64 bit
                general purpose registers plus a bunch of registers for special
                purposes)
 L1 cache (64KB per core)
 L2 cache (256KB per core)
 L3 cache (8MB)
 Main memory (8GB)
 HDD (1TB)
 The internet (big)

随着时间的推移，越来越多的缓存级别被添加 - 我记得 CPU 没有任何板载缓存的时候，而我还没有老！如今，HDD 配备了板载缓存，并且互联网可以在任意位置进行缓存：内存中、HDD 上，也可能在缓存代理服务器上。

在远离 CPU 的每一步中，带宽都会急剧下降（通常是几个数量级），延迟也会增加。例如，HDD 可能能够以 100MB/s 的速度读取，延迟为 5ms（这些数字可能并不完全正确），而主内存可以以 6.4GB/s 的速度读取，延迟为 9ns（六阶）震级！）。延迟是一个非常重要的因素，因为您不想让 CPU 等待的时间超过其必须的时间（对于具有深层管道的架构尤其如此，但这是另一天的讨论）。

这个想法是，您经常会一遍又一遍地重复使用相同的数据，因此将其放入小型快速缓存中以供后续操作是有意义的。这称为时间局部性。局部性的另一个重要原则是空间局部性，它表示彼此靠近的内存位置可能会在大约同一时间被读取。正是由于这个原因，从 RAM 读取将导致读取更大的 RAM 块并将其放入 CPU 缓存中。如果没有这些局部性原则，那么内存中的任何位置在任何时候都有相同的机会被读取，因此无法预测接下来将访问什么，以及所有级别的缓存世界上的速度不会提高。您不妨只使用硬盘驱动器，但我确信您知道计算机在分页时突然停止的感觉（这基本上是使用 HDD 作为 RAM 的扩展）。从概念上讲，除了硬盘之外没有内存是可能的（许多小型设备只有一个内存），但这与我们熟悉的相比会慢得令人痛苦。

使用寄存器（并且只有少量寄存器）的另一个优点是它可以让您拥有更短的指令。如果您的指令包含两个（或更多）64 位地址，那么您将拥有一些长指令！

If you look at computer architectures, you find a series of levels of memory. Those that are close to the CPU are the fast, expensive (per a bit), and therefore small, while at the other end you have big, slow and cheap memory devices. In a modern computer, these are typically something like:

 CPU registers (slightly complicated, but in the order of 1KB per a core - there
                are different types of registers. You might have 16 64 bit
                general purpose registers plus a bunch of registers for special
                purposes)
 L1 cache (64KB per core)
 L2 cache (256KB per core)
 L3 cache (8MB)
 Main memory (8GB)
 HDD (1TB)
 The internet (big)

Over time, more and more levels of cache have been added - I can remember a time when CPUs didn't have any onboard caches, and I'm not even old! These days, HDDs come with onboard caches, and the internet is cached in any number of places: in memory, on the HDD, and maybe on caching proxy servers.

There is a dramatic (often orders of magnitude) decrease in bandwidth and increase in latency in each step away from the CPU. For example, a HDD might be able to be read at 100MB/s with a latency of 5ms (these numbers may not be exactly correct), while your main memory can read at 6.4GB/s with a latency of 9ns (six orders of magnitude!). Latency is a very important factor, as you don't want to keep the CPU waiting any longer than it has to (this is especially true for architectures with deep pipelines, but that's a discussion for another day).

The idea is that you will often be reusing the same data over and over again, so it makes sense to put it in a small fast cache for subsequent operations. This is referred to as temporal locality. Another important principle of locality is spatial locality, which says that memory locations near each other will likely be read at about the same time. It is for this reason that reading from RAM will cause a much larger block of RAM to be read and put into on-CPU cache. If it wasn't for these principles of locality, then any location in memory would have an equally likely chance of being read at any one time, so there would be no way to predict what will be accessed next, and all the levels of cache in the world will not improve speed. You might as well just use a hard drive, but I'm sure you know what it's like to have the computer come to a grinding halt when paging (which is basically using the HDD as an extension to RAM). It is conceptually possible to have no memory except for a hard drive (and many small devices have a single memory), but this would be painfully slow compared to what we're familiar with.

One other advantage of having registers (and only a small number of registers) is that it lets you have shorter instructions. If you have instructions that contain two (or more) 64 bit addresses, you are going to have some long instructions!

回复收藏 0 原文

那片花海 2024-08-30 06:05:15

因为内存很慢。非常慢。

寄存器放置在 CPU 内部，紧邻 ALU，因此信号几乎可以立即传输。它们也是最快的内存类型，但它们占用大量空间，因此我们只能拥有有限数量的内存。增加寄存器的数量会增加

芯片大小
信号传输所需的
距离，以便在线程之间切换时保存上下文
指令编码中的位数

读取如果寄存器的速度如此之快，为什么我们不拥有更多的寄存器呢？

更常用的数据将被放置在缓存中，以便更快地访问。过去，缓存非常昂贵，因此它们是可选部件，可以单独购买并插入 CPU 外部的插座。如今，它们通常与 CPU 位于同一个芯片中。高速缓存由 SRAM 单元构成，这些单元比寄存器单元小，但速度可能慢数十或数百倍。

主存储器将由 DRAM 制成，每个单元只需要一个晶体管，但速度比寄存器慢数千倍，因此我们不能在高性能系统中仅使用 DRAM。然而，一些嵌入式系统确实使用寄存器文件，因此寄存器也是主存储器

更多信息：< a href="https://stackoverflow.com/q/3798730/995714">我们可以拥有一台仅将寄存器作为内存的计算机吗？

回复收藏 0 原文

漆黑的白昼 2024-08-30 06:05:15

寄存器速度要快得多，而且可以直接在内存上执行的操作也更加有限。

回复收藏 0 原文

染年凉城似染瑾 2024-08-30 06:05:15

实际上，有一些微小的实现没有将寄存器与内存分开。例如，他们可以通过拥有 512 字节 RAM 的方式公开它，其中前 64 个字节作为 32 个 16 位寄存器公开，同时可作为可寻址 RAM 进行访问。或者，另一个例子，MosTek 6502“零页”（RAM 范围 0-255，使用 1 字节地址访问）是寄存器的不良替代品，因为 CPU 中的实际寄存器数量很少。但是，这对于更大的设置来说很难扩展。

寄存器的优点如下：

它们是最快的。在典型的现代系统中，它们比任何缓存都更快，甚至比 DRAM 还要快。（在上面的示例中，RAM 很可能是 SRAM。但是几 GB 的 SRAM 非常昂贵。）而且，它们靠近处理器。寄存器访问和 DRAM 访问之间的时间差可以达到 200 甚至 1000 之类的值。即使与 L1 缓存相比，寄存器访问通常也快 2-4 倍。
其数量有限。如果显式寻址任何内存位置，典型的指令集将变得过于臃肿。
寄存器分别特定于每个CPU（核心、硬件线程、hart）。（在固定 RAM 地址充当特殊寄存器的系统中，例如 zSeries 所做的那样，这需要在绝对地址中对此类服务区域进行特殊重新映射，并为每个内核单独进行映射。）
以与 (3) 相同的方式，寄存器是特定于每个进程线程，无需调整线程的代码位置。
寄存器（相对容易）允许特定的优化，如寄存器重命名。如果使用内存地址的话就太复杂了。

此外，有些寄存器无法在单独的块 RAM 中实现，因为对 RAM 的访问需要对其进行更改。我指的是最简单的CPU设计中的“执行阶段”寄存器，它的值包括“指令提取阶段”、“指令解码阶段”、“ALU阶段”、“数据写入阶段”等等，这个寄存器相当于更多复杂（管道、无序）设计；还有总线访问上的不同缓冲寄存器，等等。但是，这样的寄存器对程序员来说是不可见的，所以你可能不是指它们。

In real, there are tiny implementations that does not separate registers from memory. They can expose it, for example, in the way they have 512 bytes of RAM, and first 64 of them are exposed as 32 16-bit registers and in the same time accessible as addressable RAM. Or, another example, MosTek 6502 "zero page" (RAM range 0-255, accessed used 1-byte address) was a poor substitution for registers, due to small amount of real registers in CPU. But, this is poorly scalable to larger setups.

The advantage of registers are following:

They are the most fast. They are faster in a typical modern system than any cache, more so than DRAM. (In the example above, RAM is likely SRAM. But SRAM of a few gigabytes is unusably expensive.) And, they are close to processor. Difference of time between register access and DRAM access can reach values like 200 or even 1000. Even compared to L1 cache, register access is typically 2-4 times faster.
Their amount is limited. A typical instruction set will become too bloated if any memory location is addressed explicitly.
Registers are specific to each CPU (core, hardware thread, hart) separately. (In systems where fixed RAM addresses serve role of special registers, as e.g. zSeries does, this needs special remapping of such service area in absolute addresses, separate for each core.)
In the same manner as (3), registers are specific to each process thread without a need to adjust locations in code for a thread.
Registers (relatively easily) allow specific optimizations, as register renaming. This is too complex if memory addresses are used.

Additionally, there are registers that could not be implemented in separate block RAM because access to RAM needs their change. I mean the "execution phase" register in the simplest CPU designs, which takes values like "instruction extracting phase", "instruction decoding phase", "ALU phase", "data writing phase" and so on, and this register equivalents in more complicated (pipeline, out-of-order) designs; also different buffer registers on bus access, and so on. But, such registers are not visible to programmer, so you did likely not mean them.

回复收藏 0 原文

燃情 2024-08-30 06:05:15

x86 与您可能学习汇编的几乎所有其他“普通”CPU 一样，是一个寄存器机¹。还有其他方法可以设计一些可以编程的东西（例如，沿着内存中的逻辑“磁带”移动的图灵机，或生命游戏），但寄存器机已被证明基本上是实现高性能的唯一方法。表现。

https://www.realworldtech.com/architecture-basics/2/ 涵盖可能的替代品，如累加器或堆栈机，现在也已经过时了。尽管它省略了像 x86 这样的 CISC，它可以是加载存储或寄存器存储器。 x86指令实际上可以是 reg,mem ;注册，注册；甚至mem，reg。（或直接来源。）

脚注 1：称为注册机的抽象计算模型不区分寄存器和内存；它所谓的寄存器更像是真实计算机中的内存。我在这里所说的“寄存器机器”是指具有多个通用寄存器的机器，而不是只有一个累加器或堆栈机器或其他东西。大多数 x86 指令都有 2 个显式操作数（但情况各不相同< /a>)，其中最多可以是内存。即使是像 6502 这样的微控制器，只能在一个累加器寄存器中真正进行数学运算，几乎总是有一些其他寄存器（例如，用于指针或索引），这与像 Marie 或 LMC 这样的真正玩具 ISA 不同，它们的编程效率极低，因为您需要不断存储和存储数据。将不同的东西重新加载到累加器中，甚至无法在可以直接使用的任何地方保留数组索引或循环计数器。

由于 x86 被设计为使用寄存器，因此即使您想要并且不关心性能，您也无法真正完全避免它们。

当前的 x86 CPU 每个时钟周期可以读/写的寄存器数量比内存位置多得多。

例如，Intel Skylake 每个周期可以对其 32KiB 8 路关联 L1D 缓存进行两次加载和一次存储（最佳情况），但可以每个时钟读取最多 10 个寄存器，并写入 3 或 4 个（加上 EFLAGS）。

构建一个具有与寄存器文件一样多的读/写端口的 L1D 缓存将非常昂贵（（晶体管数量/面积和功耗），特别是如果您想保持它尽可能大的话。构建能够像 x86 使用寄存器那样使用内存并具有相同性能的东西在物理上可能是不可能的。

此外，写入寄存器然后再次读取它的延迟基本上为零，因为 CPU 会检测到这一点并将结果直接从一个执行单元的输出转发到另一个执行单元的输入，绕过写回阶段。（参见https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing< /a>）。

执行单元之间的这些结果转发连接称为“旁路网络”或“转发网络”，CPU 在寄存器设计中执行此操作比所有内容都必须进入内存并退出要容易得多。 CPU 只需检查 3 至 5 位寄存器编号，而不是 32 位或 64 位地址，即可检测需要立即将一条指令的输出作为另一操作的输入的情况。（这些寄存器号被硬编码到机器代码中，因此它们可以立即使用。）

正如其他人提到的，使用 3 或 4 位来寻址寄存器使机器代码格式比每条指令都紧凑得多有绝对地址。

另请参阅https://en.wikipedia.org/wiki/Memory_hierarchy：你可以想到寄存器作为与主存储器分开的小型快速固定大小存储空间，仅支持直接绝对寻址。（您无法对寄存器进行“索引”：在一个寄存器中给定一个整数N，您无法使用一个insn获取第N寄存器的内容。）

寄存器也是单个 CPU 核心私有的，因此乱序执行可以对它们执行任何操作。对于内存，它必须担心其他 CPU 内核可见的事物的顺序。

拥有固定数量的寄存器是让 CPU 进行寄存器重命名的一部分-订单执行。当指令被解码时，寄存器号立即可用也使这变得更容易：永远不会对未知的寄存器进行读取或写入。

请参阅为什么mulss 在 Haswell 上只需要 3 个周期，与 Agner 的指令表不同吗？（使用多个累加器展开 FP 循环）来解释寄存器重命名，以及一个具体示例（稍后对问题的编辑/我答案的后面部分显示了使用多个累加器展开来隐藏 FMA 延迟的加速效果，即使它重复使用相同的架构寄存器）。

具有存储转发功能的存储缓冲区基本上可以为您提供“内存重命名”。存储/重新加载到内存位置独立于之前从该核心内存储和加载到该位置的操作。（推测执行的 CPU 分支是否可以包含以下操作码访问 RAM？）

使用 stack-args 调用约定重复函数调用和/或通过引用返回值是可以多次重复使用堆栈内存的相同字节的情况。

即使第一个存储仍在等待其输入，秒存储/重新加载也可以执行。（我已经在 Skylake 上对此进行了测试，但我不知道我是否曾在任何地方的答案中发布过结果。）

x86, like pretty much every other "normal" CPU you might learn assembly for, is a register machine¹. There are other ways to design something that you can program (e.g. a Turing machine that moves along a logical "tape" in memory, or the Game of Life), but register machines have proven to be basically the only way to go for high-performance.

https://www.realworldtech.com/architecture-basics/2/ covers possible alternatives like accumulator or stack machines which are also obsolete now. Although it omits CISCs like x86 which can be either load-store or register-memory. x86 instructions can actually be reg,mem; reg,reg; or even mem,reg. (Or with an immediate source.)

Footnote 1: The abstract model of computation called a register machine doesn't distinguish between registers and memory; what it calls registers are more like memory in real computers. I say "register machine" here to mean a machine with multiple general-purpose registers, as opposed to just one accumulator, or a stack machine or whatever. Most x86 instructions have 2 explicit operands (but it varies), up to one of which can be memory. Even microcontrollers like 6502 that can only really do math into one accumulator register almost invariably have some other registers (e.g. for pointers or indices), unlike true toy ISAs like Marie or LMC that are extremely inefficient to program for because you need to keep storing and reloading different things into the accumulator, and can't even keep an array index or loop counter anywhere that you can use it directly.

Since x86 was designed to use registers, you can't really avoid them entirely, even if you wanted to and didn't care about performance.

Current x86 CPUs can read/write many more registers per clock cycle than memory locations.

For example, Intel Skylake can do two loads and one store from/to its 32KiB 8-way associative L1D cache per cycle (best case), but can read upwards of 10 registers per clock, and write 3 or 4 (plus EFLAGS).

Building an L1D cache with as many read/write ports as the register file would be prohibitively expensive (in transistor count/area and power usage), especially if you wanted to keep it as large as it is. It's probably just not physically possible to build something that can use memory the way x86 uses registers with the same performance.

Also, writing a register and then reading it again has essentially zero latency because the CPU detects this and forwards the result directly from the output of one execution unit to the input of another, bypassing the write-back stage. (See https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Solution_A._Bypassing).

These result-forwarding connections between execution units are called the "bypass network" or "forwarding network", and it's much easier for the CPU to do this for a register design than if everything had to go into memory and back out. The CPU only has to check a 3 to 5 bit register number, instead of an 32-bit or 64-bit address, to detect cases where the output of one instruction is needed right away as the input for another operation. (And those register numbers are hard-coded into the machine-code, so they're available right away.)

As others have mentioned, 3 or 4 bits to address a register make the machine-code format much more compact than if every instruction had absolute addresses.

See also https://en.wikipedia.org/wiki/Memory_hierarchy: you can think of registers as a small fast fixed-size memory space separate from main memory, where only direct absolute addressing is supported. (You can't "index" a register: given an integer N in one register, you can't get the contents of the Nth register with one insn.)

Registers are also private to a single CPU core, so out-of-order execution can do whatever it wants with them. With memory, it has to worry about what order things become visible to other CPU cores.

Having a fixed number of registers is part of what lets CPUs do register-renaming for out-of-order execution. Having the register-number available right away when an instruction is decoded also makes this easier: there's never a read or write to a not-yet-known register.

See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an explanation of register renaming, and a specific example (the later edits to the question / later parts of my answer showing the speedup from unrolling with multiple accumulators to hide FMA latency even though it reuses the same architectural register repeatedly).

The store buffer with store forwarding does basically give you "memory renaming". A store/reload to a memory location is independent of earlier stores and load to that location from within this core. (Can a speculatively executed CPU branch contain opcodes that access RAM?)

Repeated function calls with a stack-args calling convention, and/or returning a value by reference, are cases where the same bytes of stack memory can be reused multiple times.

The seconds store/reload can execute even if the first store is still waiting for its inputs. (I've tested this on Skylake, but IDK if I ever posted the results in an answer anywhere.)

回复收藏 0 原文

~没有更多了~