在此编译器输出中,我试图了解 nopw
指令的机器代码编码如何工作:
00000000004004d0 <main>:
4004d0: eb fe jmp 4004d0 <main>
4004d2: 66 66 66 66 66 2e 0f nopw %cs:0x0(%rax,%rax,1)
4004d9: 1f 84 00 00 00 00 00
http://john.freml.in/amd64-nopl。谁能解释一下 4004d2-4004e0 的含义吗?从操作码列表来看,66 ..
代码似乎是多字节扩展。我觉得我可能会在这里得到比我更好的答案,除非我尝试花几个小时来摸索操作码列表。
该 asm 输出来自以下(疯狂的)C 代码,该代码优化为简单的无限循环:
long i = 0;
main() {
recurse();
}
recurse() {
i++;
recurse();
}
当使用 gcc -O2 编译时,编译器会识别无限递归并将其转换为无限循环;事实上,它做得非常好,以至于它实际上在 main()
中循环,而不调用 recurse()
函数。
编者注:用 NOP 填充函数并不特定于无限循环。这是一组具有一定范围的 NOP 长度的函数, Godbolt 编译器资源管理器。
In this compiler output, I'm trying to understand how machine-code encoding of the nopw
instruction works:
00000000004004d0 <main>:
4004d0: eb fe jmp 4004d0 <main>
4004d2: 66 66 66 66 66 2e 0f nopw %cs:0x0(%rax,%rax,1)
4004d9: 1f 84 00 00 00 00 00
There is some discussion about "nopw" at http://john.freml.in/amd64-nopl. Can anybody explain the meaning of 4004d2-4004e0? From looking at the opcode list, it seems that 66 ..
codes are multi-byte expansions. I feel I could probably get a better answer to this here than I would unless I tried to grok the opcode list for a few hours.
That asm output is from the following (insane) code in C, which optimizes down to a simple infinite loop:
long i = 0;
main() {
recurse();
}
recurse() {
i++;
recurse();
}
When compiled with gcc -O2
, the compiler recognizes the infinite recursion and turns it into an infinite loop; it does this so well, in fact, that it actually loops in the main()
without calling the recurse()
function.
editor's note: padding functions with NOPs isn't specific to infinite loops. Here's a set of functions with a range of lengths of NOPs, on the Godbolt compiler explorer.
发布评论
评论(4)
0x66
字节是“操作数大小覆盖”前缀。拥有其中一项以上就相当于拥有一项。0x2e
在 64 位模式下是一个“空前缀”(否则它是一个 CS: 段覆盖 - 这就是它出现在汇编助记符中的原因)。0x0f 0x1f
是采用 ModRM 字节的 NOP 的 2 字节操作码0x84
是 ModRM 字节,在本例中编码为使用 5 个以上字节的寻址模式。某些 CPU 解码具有许多前缀(例如超过三个)的指令的速度很慢,因此指定 SIB + disp32 的 ModRM 字节是使用额外 5 个字节的更好方法,而不是使用另外 5 个前缀字节。
本质上,这些字节是一条长 NOP 指令,无论如何都不会被执行。它的存在是为了确保下一个函数在 16 字节边界上对齐,因为编译器发出了
.p2align 4
指令,因此汇编器用 NOP 填充。 gcc 对于 x86 的默认设置是-falign-functions=16
。对于将要执行的 NOP,长 NOP 的最佳选择取决于微架构。对于因许多前缀而阻塞的微体系结构(例如 Intel Silvermont 或 AMD K8),两个分别具有 3 个前缀的 NOP 可能会解码得更快。问题链接到的博客文章 ( http://john.freml.in/amd64-nopl )解释了为什么编译器使用复杂的单个 NOP 指令而不是一堆单字节 0x90 NOP 指令。
您可以在 AMD 的技术参考文档中找到有关指令编码的详细信息:
主要在《AMD64架构程序员手册第3卷:通用和系统指令》中。我确信 Intel 针对 x64 架构的技术参考将具有相同的信息(甚至可能更容易理解)。
The
0x66
bytes are an "Operand-Size Override" prefix. Having more than one of these is equivalent to having one.The
0x2e
is a 'null prefix' in 64-bit mode (it's a CS: segment override otherwise - which is why it shows up in the assembly mnemonic).0x0f 0x1f
is a 2 byte opcode for a NOP that takes a ModRM byte0x84
is ModRM byte which in this case codes for an addressing mode that uses 5 more bytes.Some CPUs are slow to decode instructions with many prefixes (e.g. more than three), so a ModRM byte that specifies a SIB + disp32 is a much better way to use up an extra 5 bytes than five more prefix bytes.
Essentially, those bytes are one long NOP instruction that will never get executed anyway. It's in there to ensure that the next function is aligned on a 16-byte boundary, because the compiler emitted a
.p2align 4
directive, so the assembler padded with a NOP. gcc's default for x86 is-falign-functions=16
. For NOPs that will be executed, the optimal choice of long-NOP depends on the microarchitecture. For a microarchitecture that chokes on many prefixes, like Intel Silvermont or AMD K8, two NOPs with 3 prefixes each might have decoded faster.The blog article the question linked to ( http://john.freml.in/amd64-nopl ) explains why the compiler uses a complicated single NOP instruction instead of a bunch of single-byte 0x90 NOP instructions.
You can find the details on the instruction encoding in AMD's tech ref documents:
Mainly in the "AMD64 Architecture Programmer's Manual Volume 3: General Purpose and System Instructions". I'm sure Intel's technical references for the x64 architecture will have the same information (and might even be more understandable).
汇编器(而不是编译器)使用它能找到的最长的 NOP 指令将代码填充到下一个对齐边界。这就是你所看到的。
The assembler (not the compiler) pads code up to the next alignment boundary with the longest NOP instruction it can find that fits. This is what you're seeing.
我猜这只是分支延迟指令。
I would guess this is just the branch-delay instruction.
我相信 nopw 是垃圾 - i 永远不会在您的程序中读取,因此无需增加它。
I belive that the nopw is junk - i is never read in your program, and there are thus no need to increment it.