64 位汇编，何时使用较小尺寸的寄存器

发布于 2024-11-18 14:17:26 字数 840 浏览 5 评论 0原文

据我了解，在 x86_64 汇编中，例如有（64 位）rax 寄存器，但它也可以作为 32 位寄存器、eax、16 位、ax 和 8 位等进行访问。在什么情况下我不会只使用完整的 64 位，为什么，会有什么优势？

举个例子，用这个简单的 hello world 程序：

section .data
msg: db "Hello World!", 0x0a, 0x00
len: equ $-msg

section .text
global start

start:
mov rax, 0x2000004      ; System call write = 4
mov rdi, 1              ; Write to standard out = 1
mov rsi, msg            ; The address of hello_world string
mov rdx, len            ; The size to write
syscall                 ; Invoke the kernel
mov rax, 0x2000001      ; System call number for exit = 1
mov rdi, 0              ; Exit success = 0
syscall                 ; Invoke the kernel

rdi 和 rdx，至少只需要 8 位而不是 64 位，对吧？但是，如果我将它们分别更改为 dil 和 dl（它们的较低 8 位等效值），程序会进行汇编和链接，但不会输出任何内容。

但是，如果我使用 eax、edi 和 edx，它仍然可以工作，那么我应该使用这些而不是完整的 64 位吗？为什么或为什么不呢？

原文

I understand in x86_64 assembly there is for example the (64 bit) rax register, but it can also be accessed as a 32 bit register, eax, 16 bit, ax, and 8 bit, al. In what situation would I not just use the full 64 bits, and why, what advantage would there be?

As an example, with this simple hello world program:

section .data
msg: db "Hello World!", 0x0a, 0x00
len: equ $-msg

section .text
global start

start:
mov rax, 0x2000004      ; System call write = 4
mov rdi, 1              ; Write to standard out = 1
mov rsi, msg            ; The address of hello_world string
mov rdx, len            ; The size to write
syscall                 ; Invoke the kernel
mov rax, 0x2000001      ; System call number for exit = 1
mov rdi, 0              ; Exit success = 0
syscall                 ; Invoke the kernel

rdi and rdx, at least, only need 8 bits and not 64, right? But if I change them to dil and dl, respectively (their lower 8-bit equivalents), the program assembles and links but doesn't output anything.

However, it still works if I use eax, edi and edx, so should I use those rather than the full 64-bits? Why or why not?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笑，眼淚并存 2024-11-25 14:17:26

你在这里问几个问题。

如果您只加载寄存器的低 8 位，则寄存器的其余部分将保持其先前的值。这可以解释为什么你的系统调用得到了错误的参数。

当您只需要 32 位时，使用 32 位的原因之一是许多使用 EAX 或 EBX 的指令比使用 RAX 或 RBX 的指令短一个字节。这也可能意味着加载到寄存器中的常量更短。

该指令集已经发展了很长时间并且有很多怪癖！

回复收藏 0 原文

在风中等你 2024-11-25 14:17:26

首先也是最重要的是将较小的（例如 8 位）值从内存（读取字符、处理数据结构、反序列化网络数据包等）加载到寄存器中。

MOV AL, [0x1234]

与

MOV RAX, [0x1234]
SHR RAX, 56
# assuming there are actually 8 accessible bytes at 0x1234,
# and they're the right endianness; otherwise you'd need
# AND RAX, 0xFF or similar...

或者，当然，将所述值写回到内存中。

（编辑，就像 6 年后）：

因为这种情况不断出现：

MOV AL, [0x1234]

仅读取 0x1234 处的单个字节内存（相反只会覆盖单个字节内存）
保留其他 56 中的内容RAX 位
- 这会在 RAX 的过去和未来值之间创建依赖关系，因此 CPU 无法使用注册重命名。

相比之下：

MOV RAX, [0x1234]

读取从 0x1234 开始的 8 字节内存（相反会覆盖 8 字节内存）
覆盖 RAX 的全部
假设内存中的字节与 CPU 具有相同的字节序（在网络中通常不是这样）数据包，因此是我几年前的 SHR 指令）

还需要注意的是：

MOV EAX, [0x1234]

读取从 0x1234 开始的 4 字节内存（相反会覆盖 4 字节内存）
覆盖所有RAX，但高位全部为零
- 请参阅：为什么大多数 x64 指令将 32 位寄存器的上部清零

然后，正如注释中提到的，有：

MOVZX EAX, byte [0x1234]

仅读取0x1234处的内存的单个字节
扩展值以填充所有 EAX（以及 RAX）都为零（消除依赖性并允许寄存器重命名优化）。

在所有这些情况下，如果您想从“A”寄存器写入到内存中，您必须选择宽度：

MOV [0x1234], AL   ; write a byte (8 bits)
MOV [0x1234], AX   ; write a word (16 bits)
MOV [0x1234], EAX  ; write a dword (32 bits)
MOV [0x1234], RAX  ; write a qword (64 bits)

First and foremost would be when loading a smaller (e.g. 8-bit) value from memory (reading a char, working on a data structure, deserialising a network packet, etc.) into a register.

MOV AL, [0x1234]

versus

MOV RAX, [0x1234]
SHR RAX, 56
# assuming there are actually 8 accessible bytes at 0x1234,
# and they're the right endianness; otherwise you'd need
# AND RAX, 0xFF or similar...

Or, of course, writing said value back to memory.

(Edit, like 6 years later):

Since this keeps coming up:

MOV AL, [0x1234]

only reads a single byte of memory at 0x1234 (the inverse would only overwrite a single byte of memory)
keeps whatever was in the other 56 bits of RAX
- This creates a dependency between the past and future values of RAX, so the CPU can't optimise the instruction using register renaming.

By contrast:

MOV RAX, [0x1234]

reads 8 bytes of memory starting at 0x1234 (the inverse would overwrite 8 bytes of memory)
overwrites all of RAX
assumes the bytes in memory have the same endianness as the CPU (often not true in network packets, hence my SHR instruction years ago)

Also important to note:

MOV EAX, [0x1234]

reads 4 bytes of memory starting at 0x1234 (the inverse would overwrite 4 bytes of memory)
overwrites all of RAX, but the high bits will all be zero
- see: Why do most x64 instructions zero the upper part of a 32 bit register

Then, as mentioned in the comments, there is:

MOVZX EAX, byte [0x1234]

only reads a single byte of memory at 0x1234
extends the value to fill all of EAX (and thus RAX) with zeroes (eliminating the dependency and allowing register renaming optimisations).

In all of these cases, if you want to write from the 'A' register into memory you'd have to pick your width:

MOV [0x1234], AL   ; write a byte (8 bits)
MOV [0x1234], AX   ; write a word (16 bits)
MOV [0x1234], EAX  ; write a dword (32 bits)
MOV [0x1234], RAX  ; write a qword (64 bits)

回复收藏 0 原文

思念绕指尖 2024-11-25 14:17:26

如果您只需要 32 位寄存器，您可以安全地使用它们，这在 64 位下是可以的。但如果您只需要 16 位或 8 位寄存器，请尽量避免使用它们或始终使用 movzx/movsx 来清除剩余位。众所周知，在x86-64下，使用32位操作数会清除64位寄存器的高位。这样做的主要目的是避免错误的依赖链。

请参阅英特尔® 64 和 IA-32 架构软件开发人员手册第 1 卷：

32 位操作数生成 32 位结果，在目标通用寄存器中零扩展为 64 位结果

打破依赖链允许指令以随机顺序并行执行，通过乱序算法自 Pentium Pro 以来由 CPU 内部实现1995 年。

引用自英特尔® 64 和 IA-32 架构优化参考手册，第 3.5.1.8 节：

修改部分寄存器的代码序列可能会在其依赖链中遇到一些延迟，但可以通过使用依赖破坏惯用法来避免。在基于Intel Core微架构的处理器中，当软件使用这些指令将寄存器内容清零时，许多指令可以帮助清除执行依赖性。通过对 32 位寄存器而不是部分寄存器进行操作，打破指令之间对寄存器部分的依赖性。对于移动，这可以通过 32 位移动或使用 MOVZX 来完成。
汇编/编译器编码规则 37。（M 影响，MH 通用性）：通过操作 32 位寄存器而不是部分寄存器来打破指令之间对寄存器部分的依赖性。对于移动，这可以通过 32 位移动或使用 MOVZX 来完成。

对于 x64，具有 32 位操作数的 MOVZX 和 MOV 是等效的 - 它们都破坏依赖链。

这就是为什么如果您在使用较小的寄存器时始终尝试清除较大寄存器的最高位，您的代码将执行得更快。当这些位总是被清除时，不依赖于寄存器的先前值，CPU可以在内部重命名寄存器。

寄存器重命名是 CPU 内部使用的一种技术，它消除了由于连续指令重用寄存器而产生的错误数据依赖性，而这些连续指令之间没有任何真正的数据依赖性。

If you just need 32-bit registers, you can safely work with them, this is OK under 64-bit. But if you just need 16-bit or 8-bit registers, try to avoid them or always use movzx/movsx to clear the remaining bits. It is well known that under x86-64, using 32-bit operands clears the higher bits of the 64-bit register. The main purpose of this is avoid false dependency chains.

Please refer to the relevant section - 3.4.1.1 - of The Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1:

32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register

Breaking dependency chains allows the instructions to execute in parallel, in random order, by the Out-of-Order algorithm implemented internally by CPUs since Pentium Pro in 1995.

A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8:

Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core micro-architecture, a number of instructions can help clear execution dependency when software uses these instruction to clear register content to zero. Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.
Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.

The MOVZX and MOV with 32-bit operands for x64 are equivalent - they all break dependency chains.

That's why your code will execute faster if you always try clear the highest bits of larger registers when using smaller registers. When the bits are always cleard, thre are no dependencies on the previous value of the register, the CPU can internally rename the registers.

Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.

回复收藏 0 原文