从操作码查找指令中操作数的数量
我打算编写自己的小型反汇编程序。我想解码在读取可执行文件时得到的操作码。我看到以下操作码:
69 62 2f 6c 64 2d 6c
必须对应于:
imul $0x6c2d646c,0x2f(%edx),%esp
现在,“imul”指令可以有两个或三个操作数。我如何从那里的操作码中找出这一点?
它基于 Intel 的 i386 指令集。
I am planning on writing my own small disassembler. I want to decode the opcodes which I get upon reading the executable. I see the following opcodes:
69 62 2f 6c 64 2d 6c
which must correspond to:
imul $0x6c2d646c,0x2f(%edx),%esp
Now, the "imul" instruction can have either two or three operands. How do I figure this out from the opcodes I have there?
It's based on Intel's i386 instruction set.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
虽然x86指令集相当复杂(反正是CISC),而且我看到这里很多人都在阻止你尝试理解它,但我会说相反:它仍然可以理解,并且你可以在途中学习为什么如此复杂,以及英特尔如何设法将其从 8086 一直扩展到现代处理器。
x86指令使用可变长度编码,因此它们可以由多个字节组成。每个字节都用于编码不同的内容,其中一些是可选的(无论是否使用这些可选字段,它都会在操作码中进行编码)。
例如,每个操作码前面可以有零到四个前缀字节,这是可选的。通常您不需要担心它们。它们用于更改操作数的大小,或作为带有现代 CPU(MMX、SSE 等)扩展指令的操作码表“第二层”的转义码。
然后是实际的操作码,通常是一个字节,但对于扩展指令最多可以是三个字节。如果您只使用基本指令集,则也无需担心它们。
接下来,有所谓的 ModR/M 字节(有时也称为 mode-reg-reg/mem),它对寻址模式和操作数类型进行编码。它仅由确实具有任何此类操作数的操作码使用。它具有三个位字段:
在
ModR/M</code> 字节之后,可能还有另一个可选字节(取决于寻址模式),称为
SIB
(S
caleI
索引B
ase)。它用于更奇特的寻址模式,以对所使用的缩放因子(1x、2x、4x)、基地址/寄存器和索引寄存器进行编码。它具有与 ModR/M 字节类似的布局,但从左侧开始的前两位(最高有效位)用于编码比例,接下来的三位和最后三位编码索引和顾名思义,基址寄存器。如果使用了任何位移,那么它就在那之后。它的长度可能是 0、1、2 或 4 个字节,具体取决于寻址模式和执行模式(16 位/32 位/64 位)。
最后一个始终是即时数据(如果有)。它也可以是 0、1、2 或 4 个字节长。
所以现在,当您知道 x86 指令的整体格式时,您只需要知道所有这些字节的编码是什么。还有一些与普遍看法相反的模式。
例如,所有寄存器编码都遵循简洁的模式
ACDB
。即对于8位指令,寄存器码的最低两位编码A、C、D、B寄存器,对应:00
=A
寄存器(累加器)01
=C
寄存器(计数器)10
=D
寄存器(数据)11
=B
寄存器(基址)我怀疑他们的 8 位处理器仅使用了这四个以这种方式编码的 8 位寄存器:
然后,在 16 位处理器上,他们将这组寄存器加倍,并在寄存器编码中添加一位来选择该组,这样:
但现在您也可以选择同时使用这些寄存器的两半,作为完整的 16 位寄存器。这是通过操作码的最后一位(最低有效位,最右边的位)完成的:如果它是
0
,则这是一条8位指令。但如果该位被设置(即操作码为奇数),则这是一条 16 位指令。在此模式下,这两位对 ACDB 寄存器之一进行编码,如前所述。图案保持不变。但它们现在编码完整的 16 位寄存器。但是,当第三个字节(最高的字节)也被设置时,它们会切换到另一组寄存器,称为索引/指针寄存器,它们是:SP
(堆栈指针)、BP
(基指针),SI
(源索引),DI
(目标/数据索引)。所以现在的寻址如下:当引入 32 位 CPU 时,他们再次将这些存储体加倍。但模式保持不变。刚才奇数操作码表示 32 位寄存器,偶数操作码和以前一样表示 8 位寄存器。我将奇数操作码称为“长”版本,因为 16/32 位版本的使用取决于 CPU 及其当前的操作模式。当它在 16 位模式下运行时,奇数(“长”)操作码表示 16 位寄存器,但当它在 32 位模式下运行时,奇数(“长”)操作码表示 32 位寄存器。可以通过在整个指令前加上 66 前缀(操作数大小覆盖)来翻转它。偶数操作码(“短”操作码)始终是 8 位。因此,在 32 位 CPU 中,寄存器代码为:
如您所见,
ACDB
模式保持不变。此外,SP,BP,SI,SI
模式保持不变。它只使用较长版本的寄存器。操作码中也有一些模式。我已经描述过其中之一(偶数与奇数 = 8 位“短”与 16/32 位“长”的东西)。您可以在我制作的操作码映射中看到更多内容,用于快速引用和手动组装/拆卸内容:
(这还不是一个完整的表,一些操作码丢失了。也许有一天我会更新它。)
正如你所看到的,算术和运算。逻辑指令大多位于表格的上半部分,左、右部分。它的右半部分遵循类似的布局。数据移动指令位于下半部分。所有分支指令(条件跳转)均位于
7*
行中。还有一整行B*
为mov
指令保留,这是将立即值(常量)加载到寄存器中的简写。它们都是紧跟立即常量的一字节操作码,因为它们以三个最低有效字节(最右边的字节)对操作码中的目标寄存器进行编码(它们是通过此表中的列号选择的) 。它们遵循相同的寄存器编码模式。第四位是“短”/“长”选择之一。您可以看到您的
imul
指令已经在表中,正好位于69
位置(呵呵..;J)。对于许多指令,“短/长”位之前的位用于对操作数的顺序进行编码:
ModR/M</code> 字节中编码的两个寄存器中的哪一个是源,哪一个是源。 1 是目标(这适用于具有两个寄存器操作数的指令)。
至于
ModR/M</code> 字节的寻址模式字段,解释如下:
11
是最简单的:它对寄存器到寄存器传输进行编码。一个寄存器由接下来的三个位(reg
字段)编码,另一个寄存器由该字节的其他三位(R/M
字段)编码。< br>01
表示在该字节之后,将出现一个字节的位移。10
表示相同,但使用的位移是四字节(在32 位 CPU)。00
是最棘手的:它表示间接寻址或简单位移,具体取决于R/M
字段的内容。如果存在
SIB
字节,则通过R/M
位中的100
位模式来表示。还有一个用于 32 位仅位移模式的代码101
,它根本不使用SIB
字节。以下是所有这些寻址模式的摘要:
现在让我们解码您的
imul
:69
是它的操作码。它对imul
的版本进行编码,但不对 8 位操作数进行符号扩展。6B
版本对它们进行了符号扩展。 (如果有人问的话,它们的区别在于操作码中的位 1。)62
是RegR/M
字节。二进制格式为0110 0010
或01 100 010
。前两个字节(Mod
字段)表示间接寻址模式,位移量为 8 位。接下来的三位(reg
字段)是100
,并对SP
寄存器进行编码(在本例中为ESP
,因为我们处于 32 位模式)作为目标寄存器。最后三位是R/M
字段,其中有010
,它对D
寄存器进行编码(在本例中为EDX
)作为使用的其他(源)寄存器。现在我们期望 8 位位移。就是这样:
2f
是位移,一个正数(十进制+47)。最后一部分是立即数的四个字节,这是
imul
指令所需要的。在您的情况下,这是6c 64 2d 6c
,在小尾数中是$6c2d646c
。这就是饼干破碎的方式;-J
Although the x86 instruction set is quite complex (it's CISC anyway) and I saw many people here are discouraging your attempts in trying to understand it, I'll say the contrary: it still can be understood, and you can learn on the way about why is it so complex and how Intel had managed to extend it several times all the way from 8086 to modern processors.
x86 instructions use variable-length encoding, so they can be made up of multiple bytes. Each byte is there to encode different things, and some of them are optional (it is encoded in the opcode whether those optional fields are used or not).
For example, each opcode can be preceded by zero to four prefix bytes, which are optional. Usually you don't need to worry about them. They are used to change the size of operands, or as escape codes to the "second floor" of the opcode table with extended instructions of modern CPUs (MMX, SSE etc.).
Then there is the actual opcode, which is usually one byte, but can be up to three bytes for extended instructions. If you use only the basic instruction set, you don't need to worry about them too.
Next, there's the so called
ModR/M
byte (sometimes also calledmode-reg-reg/mem
), which encodes the addressing mode and operand types. It's used only by opcodes which do have any such operands. It has three bit fields:After the
ModR/M
byte, there could be another optional byte (depending on the addressing mode) calledSIB
(S
caleI
ndexB
ase). It is used for more exotic addressing modes to encode the scaling factor (1x,2x,4x), base address/register, and index register used. It has the similar layout as theModR/M
byte, but the first two bits from the left (most significant) are used to encode the scale, and the next three and the last three bits encode index and base registers, as the name suggests.If there's any displacement used, it goes just after that. It may be 0, 1, 2 or 4 bytes long, depending on the addressing mode and execution mode (16-bit/32-bit/64-bit).
The last one is always the immediate data, if any. It can be also 0, 1, 2 or 4 bytes long.
So now, when you know the overall format of x86 instructions, you just need to know what are the encodings for all those bytes. And there are some patterns, contrary to common beliefs.
For example, all register encodings follow a neat pattern
ACDB
. That is, for 8-bit instructions, the lowest two bits of the register code encode the A, C, D and B registers, correspondingly:00
=A
register (accumulator)01
=C
register (counter)10
=D
register (data)11
=B
register (base)I suspect that their 8-bit processors used just these four 8-bit registers encoded this way:
Then, on 16-bit processors, they doubled this bank of registers and added one more bit in the register encoding to choose the bank, this way:
But now you can also choose to use both halves of these registers together, as full 16-bit registers. This is done by the last bit of the opcode (the least significant bit, the right-most one): if it's
0
, this is an 8-bit instruction. But if this bit is set (that is, the opcode is an odd number), this is a 16-bit instruction. In this mode, the two bits encode one of theACDB
registers, as before. The patterns stays the same. But they encode full 16-bit registers now. But when the third byte (the highest one) is also set, they switch to a whole another bank of registers, called index/pointer registers, which are:SP
(stack pointer),BP
(base pointer),SI
(source index),DI
(destination/data index). So the addressing is now as follows:When introducing 32-bit CPUs, they doubled these banks again. But the pattern stays the same. Just now the odd opcodes mean the 32-bit registers and the even opcodes, as before, 8-bit registers. I'd call the odd opcodes the "long" versions, because the 16/32-bit version is used depending on the CPU and its current mode of operation. When it operates in 16-bit mode, the odd ("long") opcodes mean 16-bit registers, but when it operates in 32-bit mode, the odd ("long") opcodes mean 32-bit registers. It can be flipped around by prefixing the whole instruction with the
66
prefix (operand size override). The even opcodes (the "short" ones) are always 8-bit. So in 32-bit CPU, the register codes are:As you can see, the
ACDB
pattern stays the same. Also theSP,BP,SI,SI
pattern stays the same. It just uses the longer versions of the registers.There are also some patterns in the opcodes. One of them I've described already (the even vs. odd = 8-bit "short" vs. 16/32-bit "long" stuff). More of them you can see in this opcode map I've made once for quick referencing and hand-assembling/disassembling stuff:
(It's not a full table yet, some of the opcodes are missing. Maybe I'll update it someday.)
As you can see, arithmetic & logic instructions are mostly located in the upper half of the table, and the left & right halves of it follow a similar layout. Data moving instructions are at the lower half. All branching instructions (conditional jumps) are in row
7*
. There's also one full rowB*
reserved formov
instruction, which is a shorthand for loading immediate values (constants) into registers. They're all one-byte opcodes immediately followed by the immediate constant, because they encode the destination register in the opcode (they're chosen by the column number in this table), in its three least significant bytes (right-most ones). They follow the same pattern for register encoding. And the fourth bit is the "short"/"long" choosing one.You can see that your
imul
instruction is alreay in the table, exactly at the69
position (huh.. ;J).For many instructions, the bit just before the "short/long" bit, is to encode the order of operands: which one of the two registers encoded in the
ModR/M
byte is the source, and which one is the destination (this applies to the instructions with two register operands).As to the
ModR/M
byte's addressing mode field, here's how to interpret it:11
is the simplest: it encodes register-to-register transfers. One register is encoded by the three next bits (thereg
field), and the other register by the other three bits (theR/M
field) of this byte.01
means that after this byte, a one-byte displacement will be present.10
means the same, but the displacement used is four-byte (on 32-bit CPUs).00
is the trickiest: it means indirect addressing or a simple displacement, depending on the contents of theR/M
field.If the
SIB
byte is present, it is signaled by the100
bit pattern in theR/M
bits. There's also a code101
for 32-bit displacement-only mode, which doesn't use theSIB
byte at all.Here's a summary of all these addressing modes:
So let's now decode your
imul
:69
is its opcode. It encodes theimul
's version which doesn't sign-extend the 8-bit operands. The6B
version does sign-extend them. (They differ by the bit 1 in the opcode if anyone asked.)62
is theRegR/M
byte. In binary it is0110 0010
or01 100 010
. First two bytes (theMod
field) mean the indirect addressing mode, and that the displacement will be 8-bit. The next three bits (thereg
field) are100
and encode theSP
register (in this caseESP
, since we're in 32-bit mode) as the destination register. The last three bits are theR/M
field and we have010
there, which encode theD
register (in this caseEDX
) as the other (source) register used.Now we expect an 8-bit displacement. And there it is:
2f
is the displacement, a positive one (+47 in decimal).The last part is four bytes of the immediate constant, which is required by the
imul
instruction. In your case this is6c 64 2d 6c
which in little-endian is$6c2d646c
.And that's the way the cookie crumbles ;-J
这些手册确实描述了如何区分一个、两个或三个操作数版本。
F6/F7:1个操作数; 0F AF:两个操作数; 6B/69:三个操作数。
The manuals do describe how to differentiate between one, two, or three operand versions.
F6/F7: one operand; 0F AF: two operands; 6B/69: three operands.
一些建议,首先获取您可以获得的所有指令集文档。对于这个 x86 案例,请尝试一些旧的 8088/86 手册以及来自英特尔的最新手册以及网上的大量操作码表。各种解释和文档首先可能存在细微的文档错误或差异,其次有些人可能会以不同且更易于理解的方式呈现信息。
其次,如果这是您的第一个反汇编程序,我建议您避免使用 x86,因为这非常困难。由于您的问题暗示可变字长指令集很困难,因此要制作远程成功的反汇编器,您需要按照执行顺序而不是内存顺序遵循代码。因此,您的反汇编器必须使用某种方案,不仅可以解码和打印指令,还可以解码跳转指令并将目标地址标记为指令的入口点。例如ARM,是固定的指令长度,您可以编写一个ARM反汇编器,从ram的开头开始并直接反汇编每个字(当然假设它不是arm和thumb代码的混合)。拇指(不是拇指2)可以用这种方式反汇编,因为只有一种32位指令,其他都是16位,并且该一种风格可以在简单的状态机中处理,因为这两个16位指令成对出现。
您将无法反汇编所有内容(使用可变长度指令集),并且由于某些手工编码或故意策略的细微差别,以防止反汇编您按执行顺序遍历代码的预先代码,可能会有我所说的碰撞,例如您上面的说明。假设一条路径将您带到 0x69 作为指令的入口点,并且您可以从中确定这是一条 7 字节指令,但假设在其他地方有一条分支指令,其目标计算为 0x2f 作为指令的操作码,尽管非常聪明的编程可能会完成类似的事情,更有可能的是反汇编程序已导致反汇编数据。例如,
反汇编器不会知道数据是数据,并且如果没有额外的智能,反汇编器将不会意识到条件分支实际上是无条件分支(条件清除和条件清除分支之间的不同分支路径上可能有许多指令),因此它假定条件分支之后的字节是一条指令。
最后,我对你的努力表示赞赏,我经常提倡编写简单的反汇编程序(假设代码非常短,有意编写的代码)来很好地学习指令集。如果您不将反汇编器置于必须遵循执行顺序的情况下,而是可以按照内存顺序进行(基本上不要在指令之间嵌入数据,将其放在末尾或其他位置,只留下要反汇编的指令字符串)。了解指令集的操作码解码可以使您更好地针对该平台的低级和高级语言进行编程。
简而言之,英特尔曾经发布过,也许现在仍然发布处理器的技术参考手册,我仍然有我的 8088/86 手册,一本用于电气材料的硬件手册,以及一本用于指令集及其工作原理的软件手册。我有一台 486,可能还有一台 386。伊戈尔的回答中的快照直接类似于英特尔手册。由于指令集随着时间的推移已经发生了很大的变化,因此 x86 充其量只是一个困难的野兽。同时,如果处理器本身可以遍历这些字节并执行它们,那么您可以编写一个可以执行相同操作但对它们进行解码的程序。区别在于您可能不会制作模拟器,并且由代码计算的任何分支并且在代码中不明确的您将无法看到,并且该分支的目的地可能不会显示在您的字节列表中拆卸。
Some advice, first get all the instruction set docs you can get your hands on. for this x86 case try for some old 8088/86 manuals as well as more recent, from intel as well as the wealth of opcode tables on the net. various interpretation and documentation might first have subtle documentation errors or differences, and second some folks may present the info in a different and more understandable way.
Second, if this is your first disassembler I recommend avoiding x86, it is very hard. As your question implies variable word length instruction sets are difficult, to make a remotely successful disassembler, you need to follow the code in execution order, not memory order. So your disassembler has to use some sort of scheme to not only decode and print instructions but decode jump instructions and tag destination addresses as entry points into an instruction. for example ARM, is fixed instruction length, you can write an ARM disassembler that starts at the beginning of ram and disassembles each word straight through (assuming of course it is not a mixture of arm and thumb code). thumb (not thumb2) can be disassembled this way as there is only one flavor of 32 bit instruction, everything else is 16 bit, and that one flavor can be handled in a simple state machine as those two 16 bit instructions show up as pairs.
You are not going to be able to disassemble everything (with a variable length instruction set) and due to nuances of some hand coding or intentional tactics to prevent disassembly your up front code that walks the code in execution order may have what I would call a collision, for example your instructions above. Say that one path takes you to 0x69 being the entry point in to the instruction and you determine from that that is a 7 byte instruction, but say somewhere else there is a branch instruction whose destination computes as 0x2f being the opcode for an instruction, although very clever programming may pull something like that off, it is more likely that the disassembler has been lead to disassemble data. for example
The disassembler wont know the data is data, and without additional smarts the disassembler wont realize that the conditional branch is in fact an unconditional branch (there could be many instructions on different branch paths between the condition clear and branch if condition clear) so it assumes the byte after the conditional branch is an instruction.
lastly I applaud your efforts, I often preach writing simple disassemblers (ones that assume the code is very short, intentionally crafted code) to learn an instruction set very well. If you dont put the disassembler into a situation where it has to follow in execution order and instead it can go in memory order (basically do not embed data between instructions, put it at the end or somewhere else leaving only strings of instructions to be disassembled). understanding the opcode decoding for an instruction set can make you much better at programming for that platform both for low level and high level languages.
short answer, intel used to publish, and maybe still does, technical reference manuals for the processors, I still have my 8088/86 manuals, a hardware one for the electrical stuff, and a software one for the instruction set and how it works. I have a 486 and probably a 386 one. The snapshot in Igor's answer directly resembles an intel manual. Because the instruction set has evolved so much over time makes x86 a difficult beast at best. At the same time, if the processor itself can wade through these bytes and execute them, you can write a program that can do the same thing but decode them. the difference being you are likely not going to make a simulator and any branches that are computed by the code and not explicit in the code you will not be able to see and the destination for that branch may not show up in your list of bytes to disassemble.
这不是机器代码指令(它由操作码和零个或多个操作数组成)。
这是文本字符串的一部分,它翻译为:
这显然是字符串
"/lib/ld-linux.so.2"
的一部分。That is not a machine code instruction (which would consist of an opcode and zero or more operands).
That is part of a text string, it translates as:
which obviously is part of the string
"/lib/ld-linux.so.2"
.如果您不想浏览操作码表/手册,那么从其他项目中学习总是有帮助的,例如开源反汇编器,bea-engine,您可能会发现您甚至不再需要创建自己的引擎,具体取决于您的用途。
If you don't feeling like shifting through opcode tables/manuals, it always helps to learn from other's projects, like the open source disassembler, bea-engine, you might find that you don't even need to create your own one anymore, depending on what your doing it for.