从操作码查找指令中操作数的数量

发布于 2024-11-28 04:19:06 字数 249 浏览 4 评论 0原文

我打算编写自己的小型反汇编程序。我想解码在读取可执行文件时得到的操作码。我看到以下操作码:

69 62 2f 6c 64 2d 6c

必须对应于:

imul   $0x6c2d646c,0x2f(%edx),%esp

现在,“imul”指令可以有两个或三个操作数。我如何从那里的操作码中找出这一点?

它基于 Intel 的 i386 指令集。

I am planning on writing my own small disassembler. I want to decode the opcodes which I get upon reading the executable. I see the following opcodes:

69 62 2f 6c 64 2d 6c

which must correspond to:

imul   $0x6c2d646c,0x2f(%edx),%esp

Now, the "imul" instruction can have either two or three operands. How do I figure this out from the opcodes I have there?

It's based on Intel's i386 instruction set.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

善良天后 2024-12-05 04:19:06

虽然x86指令集相当复杂(反正是CISC),而且我看到这里很多人都在阻止你尝试理解它,但我会说相反:它仍然可以理解,并且你可以在途中学习为什么如此复杂,以及英特尔如何设法将其从 8086 一直扩展到现代处理器。

x86指令使用可变长度编码,因此它们可以由多个字节组成。每个字节都用于编码不同的内容,其中一些是可选的(无论是否使用这些可选字段,它都会在操作码中进行编码)。

例如,每个操作码前面可以有零到四个前缀字节,这是可选的。通常您不需要担心它们。它们用于更改操作数的大小,或作为带有现代 CPU(MMX、SSE 等)扩展指令的操作码表“第二层”的转义码。

然后是实际的操作码,通常是一个字节,但对于扩展指令最多可以是三个字节。如果您只使用基本指令集,则也无需担心它们。

接下来,有所谓的 ModR/M 字节(有时也称为 mode-reg-reg/mem),它对寻址模式和操作数类型进行编码。它仅由确实具有任何此类操作数的操作码使用。它具有三个位字段:

  • 前两位(从左边开始,最高有效位)编码寻址模式(4 个可能的位组合)。
  • 接下来的三位对第一个寄存器进行编码(8 种可能的位组合)。
  • 最后三位可以编码另一个寄存器,或扩展寻址模式,具体取决于前两位的设置。

ModR/M<​​/code> 字节之后,可能还有另一个可选字节(取决于寻址模式),称为 SIB (Scale I索引Base)。它用于更奇特的寻址模式,以对所使用的缩放因子(1x、2x、4x)、基地址/寄存器和索引寄存器进行编码。它具有与 ModR/M 字节类似的布局,但从左侧开始的前两位(最高有效位)用于编码比例,接下来的三位和最后三位编码索引和顾名思义,基址寄存器。

如果使用了任何位移,那么它就在那之后。它的长度可能是 0、1、2 或 4 个字节,具体取决于寻址模式和执行模式(16 位/32 位/64 位)。

最后一个始终是即时数据(如果有)。它也可以是 0、1、2 或 4 个字节长。

所以现在,当您知道 x86 指令的整体格式时,您只需要知道所有这些字节的编码是什么。还有一些与普遍看法相反的模式。

例如,所有寄存器编码都遵循简洁的模式 ACDB。即对于8位指令,寄存器码的最低两位编码A、C、D、B寄存器,对应:

00 = A寄存器(累加器)
01 = C 寄存器(计数器)
10 = D 寄存器(数据)
11 = B 寄存器(基址)

我怀疑他们的 8 位处理器仅使用了这四个以这种方式编码的 8 位寄存器:

       second
      +---+---+
f     | 0 | 1 |          00 = A
i +---+---+---+          01 = C
r | 0 | A : C |          10 = D
s +---+ - + - +          11 = B
t | 1 | D : B |
  +---+---+---+

然后,在 16 位处理器上,他们将这组寄存器加倍,并在寄存器编码中添加一位来选择该组,这样:

       second                second         0 00  =  AL
      +----+----+           +----+----+     0 01  =  CL
f     | 0  | 1  |     f     | 0  | 1  |     0 10  =  DL
i +---+----+----+     i +---+----+----+     0 11  =  BL
r | 0 | AL : CL |     r | 0 | AH : CH |
s +---+ - -+ - -+     s +---+ - -+ - -+     1 00  =  AH
t | 1 | DL : BL |     t | 1 | DH : BH |     1 01  =  CH
  +---+---+-----+       +---+----+----+     1 10  =  DH
    0 = BANK L              1 = BANK H      1 11  =  BH

但现在您也可以选择同时使用这些寄存器的两半,作为完整的 16 位寄存器。这是通过操作码的最后一位(最低有效位,最右边的位)完成的:如果它是0,则这是一条8位指令。但如果该位被设置(即操作码为奇数),则这是一条 16 位指令。在此模式下,这两位对 ACDB 寄存器之一进行编码,如前所述。图案保持不变。但它们现在编码完整的 16 位寄存器。但是,当第三个字节(最高的字节)也被设置时,它们会切换到另一组寄存器,称为索引/指针寄存器,它们是:SP(堆栈指针)、BP(基指针),SI(源索引),DI(目标/数据索引)。所以现在的寻址如下:

       second                second         0 00  =  AX
      +----+----+           +----+----+     0 01  =  CX
f     | 0  | 1  |     f     | 0  | 1  |     0 10  =  DX
i +---+----+----+     i +---+----+----+     0 11  =  BX
r | 0 | AX : CX |     r | 0 | SP : BP |
s +---+ - -+ - -+     s +---+ - -+ - -+     1 00  =  SP
t | 1 | DX : BX |     t | 1 | SI : DI |     1 01  =  BP
  +---+----+----+       +---+----+----+     1 10  =  SI
    0 = BANK OF           1 = BANK OF       1 11  =  DI
  GENERAL-PURPOSE        POINTER/INDEX
     REGISTERS             REGISTERS

当引入 32 位 CPU 时,他们再次将这些存储体加倍。但模式保持不变。刚才奇数操作码表示 32 位寄存器,偶数操作码和以前一样表示 8 位寄存器。我将奇数操作码称为“长”版本,因为 16/32 位版本的使用取决于 CPU 及其当前的操作模式。当它在 16 位模式下运行时,奇数(“长”)操作码表示 16 位寄存器,但当它在 32 位模式下运行时,奇数(“长”)操作码表示 32 位寄存器。可以通过在整个指令前加上 66 前缀(操作数大小覆盖)来翻转它。偶数操作码(“短”操作码)始终是 8 位。因此,在 32 位 CPU 中,寄存器代码为:

0 00 = EAX      1 00 = ESP
0 01 = ECX      1 01 = EBP
0 10 = EDX      1 10 = ESI
0 11 = EBX      1 11 = EDI

如您所见,ACDB 模式保持不变。此外,SP,BP,SI,SI 模式保持不变。它只使用较长版本的寄存器。

操作码中也有一些模式。我已经描述过其中之一(偶数与奇数 = 8 位“短”与 16/32 位“长”的东西)。您可以在我制作的操作码映射中看到更多内容,用于快速引用和手动组装/拆卸内容:
在此处输入图像描述
(这还不是一个完整的表,一些操作码丢失了。也许有一天我会更新它。)

正如你所看到的,算术和运算。逻辑指令大多位于表格的上半部分,左、右部分。它的右半部分遵循类似的布局。数据移动指令位于下半部分。所有分支指令(条件跳转)均位于 7* 行中。还有一整行 B*mov 指令保留,这是将立即值(常量)加载到寄存器中的简写。它们都是紧跟立即常量的一字节操作码,因为它们以三个最低有效字节(最右边的字节)对操作码中的目标寄存器进行编码(它们是通过此表中的列号选择的) 。它们遵循相同的寄存器编码模式。第四位是“短”/“长”选择之一。
您可以看到您的 imul 指令已经在表中,正好位于 69 位置(呵呵..;J)。

对于许多指令,“短/长”位之前的位用于对操作数的顺序进行编码:ModR/M<​​/code> 字节中编码的两个寄存器中的哪一个是源,哪一个是源。 1 是目标(这适用于具有两个寄存器操作数的指令)。

至于 ModR/M<​​/code> 字节的寻址模式字段,解释如下:

  • 11 是最简单的:它对寄存器到寄存器传输进行编码。一个寄存器由接下来的三个位(reg 字段)编码,另一个寄存器由该字节的其他三位(R/M 字段)编码。< br>
  • 01 表示在该字节之后,将出现一个字节的位移。
  • 10 表示相同,但​​使用的位移是四字节(在32 位 CPU)。
  • 00 是最棘手的:它表示间接寻址或简单位移,具体取决于 R/M 字段的内容。

如果存在SIB字节,则通过R/M位中的100位模式来表示。还有一个用于 32 位仅位移模式的代码 101,它根本不使用 SIB 字节。

以下是所有这些寻址模式的摘要:

Mod R/M
 11 rrr = register-register  (one encoded in `R/M` bits, the other one in `reg` bits).
 00 rrr = [ register ]       (except SP and BP, which are encoded in `SIB` byte)
 00 100 = SIB byte present
 00 101 = 32-bit displacement only (no `SIB` byte required)
 01 rrr = [ rrr + disp8 ]    (8-bit displacement after the `ModR/M` byte)
 01 100 = SIB + disp8
 10 rrr = [ rrr + disp32 ]   (except SP, which means that the `SIB` byte is used)
 10 100 = SIB + disp32

现在让我们解码您的 imul

69 是它的操作码。它对 imul 的版本进行编码,但不对 8 位操作数进行符号扩展。 6B 版本对它们进行了符号扩展。 (如果有人问的话,它们的区别在于操作码中的位 1。)

62RegR/M 字节。二进制格式为 0110 001001 100 010。前两个字节(Mod 字段)表示间接寻址模式,位移量为 8 位。接下来的三位(reg 字段)是 100,并对 SP 寄存器进行编码(在本例中为 ESP,因为我们处于 32 位模式)作为目标寄存器。最后三位是 R/M 字段,其中有 010,它对 D 寄存器进行编码(在本例中为 EDX)作为使用的其他(源)寄存器。

现在我们期望 8 位位移。就是这样:2f 是位移,一个正数(十进制+47)。

最后一部分是立即数的四个字节,这是 imul 指令所需要的。在您的情况下,这是 6c 64 2d 6c ,在小尾数中是 $6c2d646c

这就是饼干破碎的方式;-J

Although the x86 instruction set is quite complex (it's CISC anyway) and I saw many people here are discouraging your attempts in trying to understand it, I'll say the contrary: it still can be understood, and you can learn on the way about why is it so complex and how Intel had managed to extend it several times all the way from 8086 to modern processors.

x86 instructions use variable-length encoding, so they can be made up of multiple bytes. Each byte is there to encode different things, and some of them are optional (it is encoded in the opcode whether those optional fields are used or not).

For example, each opcode can be preceded by zero to four prefix bytes, which are optional. Usually you don't need to worry about them. They are used to change the size of operands, or as escape codes to the "second floor" of the opcode table with extended instructions of modern CPUs (MMX, SSE etc.).

Then there is the actual opcode, which is usually one byte, but can be up to three bytes for extended instructions. If you use only the basic instruction set, you don't need to worry about them too.

Next, there's the so called ModR/M byte (sometimes also called mode-reg-reg/mem), which encodes the addressing mode and operand types. It's used only by opcodes which do have any such operands. It has three bit fields:

  • First two bits (from the left, most significant ones) encode the addressing mode (4 possible bit combinations).
  • Next three bits encode the first register (8 possible bit combinations).
  • The last three bits can encode another register, or extend the addressing mode, depending on what's the setup of the first two bits.

After the ModR/M byte, there could be another optional byte (depending on the addressing mode) called SIB (Scale Index Base). It is used for more exotic addressing modes to encode the scaling factor (1x,2x,4x), base address/register, and index register used. It has the similar layout as the ModR/M byte, but the first two bits from the left (most significant) are used to encode the scale, and the next three and the last three bits encode index and base registers, as the name suggests.

If there's any displacement used, it goes just after that. It may be 0, 1, 2 or 4 bytes long, depending on the addressing mode and execution mode (16-bit/32-bit/64-bit).

The last one is always the immediate data, if any. It can be also 0, 1, 2 or 4 bytes long.

So now, when you know the overall format of x86 instructions, you just need to know what are the encodings for all those bytes. And there are some patterns, contrary to common beliefs.

For example, all register encodings follow a neat pattern ACDB. That is, for 8-bit instructions, the lowest two bits of the register code encode the A, C, D and B registers, correspondingly:

00 = A register (accumulator)
01 = C register (counter)
10 = D register (data)
11 = B register (base)

I suspect that their 8-bit processors used just these four 8-bit registers encoded this way:

       second
      +---+---+
f     | 0 | 1 |          00 = A
i +---+---+---+          01 = C
r | 0 | A : C |          10 = D
s +---+ - + - +          11 = B
t | 1 | D : B |
  +---+---+---+

Then, on 16-bit processors, they doubled this bank of registers and added one more bit in the register encoding to choose the bank, this way:

       second                second         0 00  =  AL
      +----+----+           +----+----+     0 01  =  CL
f     | 0  | 1  |     f     | 0  | 1  |     0 10  =  DL
i +---+----+----+     i +---+----+----+     0 11  =  BL
r | 0 | AL : CL |     r | 0 | AH : CH |
s +---+ - -+ - -+     s +---+ - -+ - -+     1 00  =  AH
t | 1 | DL : BL |     t | 1 | DH : BH |     1 01  =  CH
  +---+---+-----+       +---+----+----+     1 10  =  DH
    0 = BANK L              1 = BANK H      1 11  =  BH

But now you can also choose to use both halves of these registers together, as full 16-bit registers. This is done by the last bit of the opcode (the least significant bit, the right-most one): if it's 0, this is an 8-bit instruction. But if this bit is set (that is, the opcode is an odd number), this is a 16-bit instruction. In this mode, the two bits encode one of the ACDB registers, as before. The patterns stays the same. But they encode full 16-bit registers now. But when the third byte (the highest one) is also set, they switch to a whole another bank of registers, called index/pointer registers, which are: SP (stack pointer), BP (base pointer), SI (source index), DI (destination/data index). So the addressing is now as follows:

       second                second         0 00  =  AX
      +----+----+           +----+----+     0 01  =  CX
f     | 0  | 1  |     f     | 0  | 1  |     0 10  =  DX
i +---+----+----+     i +---+----+----+     0 11  =  BX
r | 0 | AX : CX |     r | 0 | SP : BP |
s +---+ - -+ - -+     s +---+ - -+ - -+     1 00  =  SP
t | 1 | DX : BX |     t | 1 | SI : DI |     1 01  =  BP
  +---+----+----+       +---+----+----+     1 10  =  SI
    0 = BANK OF           1 = BANK OF       1 11  =  DI
  GENERAL-PURPOSE        POINTER/INDEX
     REGISTERS             REGISTERS

When introducing 32-bit CPUs, they doubled these banks again. But the pattern stays the same. Just now the odd opcodes mean the 32-bit registers and the even opcodes, as before, 8-bit registers. I'd call the odd opcodes the "long" versions, because the 16/32-bit version is used depending on the CPU and its current mode of operation. When it operates in 16-bit mode, the odd ("long") opcodes mean 16-bit registers, but when it operates in 32-bit mode, the odd ("long") opcodes mean 32-bit registers. It can be flipped around by prefixing the whole instruction with the 66 prefix (operand size override). The even opcodes (the "short" ones) are always 8-bit. So in 32-bit CPU, the register codes are:

0 00 = EAX      1 00 = ESP
0 01 = ECX      1 01 = EBP
0 10 = EDX      1 10 = ESI
0 11 = EBX      1 11 = EDI

As you can see, the ACDB pattern stays the same. Also the SP,BP,SI,SI pattern stays the same. It just uses the longer versions of the registers.

There are also some patterns in the opcodes. One of them I've described already (the even vs. odd = 8-bit "short" vs. 16/32-bit "long" stuff). More of them you can see in this opcode map I've made once for quick referencing and hand-assembling/disassembling stuff:
enter image description here
(It's not a full table yet, some of the opcodes are missing. Maybe I'll update it someday.)

As you can see, arithmetic & logic instructions are mostly located in the upper half of the table, and the left & right halves of it follow a similar layout. Data moving instructions are at the lower half. All branching instructions (conditional jumps) are in row 7*. There's also one full row B* reserved for mov instruction, which is a shorthand for loading immediate values (constants) into registers. They're all one-byte opcodes immediately followed by the immediate constant, because they encode the destination register in the opcode (they're chosen by the column number in this table), in its three least significant bytes (right-most ones). They follow the same pattern for register encoding. And the fourth bit is the "short"/"long" choosing one.
You can see that your imul instruction is alreay in the table, exactly at the 69 position (huh.. ;J).

For many instructions, the bit just before the "short/long" bit, is to encode the order of operands: which one of the two registers encoded in the ModR/M byte is the source, and which one is the destination (this applies to the instructions with two register operands).

As to the ModR/M byte's addressing mode field, here's how to interpret it:

  • 11 is the simplest: it encodes register-to-register transfers. One register is encoded by the three next bits (the reg field), and the other register by the other three bits (the R/M field) of this byte.
  • 01 means that after this byte, a one-byte displacement will be present.
  • 10 means the same, but the displacement used is four-byte (on 32-bit CPUs).
  • 00 is the trickiest: it means indirect addressing or a simple displacement, depending on the contents of the R/M field.

If the SIB byte is present, it is signaled by the 100 bit pattern in the R/M bits. There's also a code 101 for 32-bit displacement-only mode, which doesn't use the SIB byte at all.

Here's a summary of all these addressing modes:

Mod R/M
 11 rrr = register-register  (one encoded in `R/M` bits, the other one in `reg` bits).
 00 rrr = [ register ]       (except SP and BP, which are encoded in `SIB` byte)
 00 100 = SIB byte present
 00 101 = 32-bit displacement only (no `SIB` byte required)
 01 rrr = [ rrr + disp8 ]    (8-bit displacement after the `ModR/M` byte)
 01 100 = SIB + disp8
 10 rrr = [ rrr + disp32 ]   (except SP, which means that the `SIB` byte is used)
 10 100 = SIB + disp32

So let's now decode your imul:

69 is its opcode. It encodes the imul's version which doesn't sign-extend the 8-bit operands. The 6B version does sign-extend them. (They differ by the bit 1 in the opcode if anyone asked.)

62 is the RegR/M byte. In binary it is 0110 0010 or 01 100 010. First two bytes (the Mod field) mean the indirect addressing mode, and that the displacement will be 8-bit. The next three bits (the reg field) are 100 and encode the SP register (in this case ESP, since we're in 32-bit mode) as the destination register. The last three bits are the R/M field and we have 010 there, which encode the D register (in this case EDX) as the other (source) register used.

Now we expect an 8-bit displacement. And there it is: 2f is the displacement, a positive one (+47 in decimal).

The last part is four bytes of the immediate constant, which is required by the imul instruction. In your case this is 6c 64 2d 6c which in little-endian is $6c2d646c.

And that's the way the cookie crumbles ;-J

行至春深 2024-12-05 04:19:06

这些手册确实描述了如何区分一个、两个或三个操作数版本。

IMUL指令

F6/F7:1个操作数; 0F AF:两个操作数; 6B/69:三个操作数。

The manuals do describe how to differentiate between one, two, or three operand versions.

IMUL instruction

F6/F7: one operand; 0F AF: two operands; 6B/69: three operands.

抠脚大汉 2024-12-05 04:19:06

一些建议,首先获取您可以获得的所有指令集文档。对于这个 x86 案例,请尝试一些旧的 8088/86 手册以及来自英特尔的最新手册以及网上的大量操作码表。各种解释和文档首先可能存在细微的文档错误或差异,其次有些人可能会以不同且更易于理解的方式呈现信息。

其次,如果这是您的第一个反汇编程序,我建议您避免使用 x86,因为这非常困难。由于您的问题暗示可变字长指令集很困难,因此要制作远程成功的反汇编器,您需要按照执行顺序而不是内存顺序遵循代码。因此,您的反汇编器必须使用某种方案,不仅可以解码和打印指令,还可以解码跳转指令并将目标地址标记为指令的入口点。例如ARM,是固定的指令长度,您可以编写一个ARM反汇编器,从ram的开头开始并直接反汇编每个字(当然假设它不是arm和thumb代码的混合)。拇指(不是拇指2)可以用这种方式反汇编,因为只有一种32位指令,其他都是16位,并且该一种风格可以在简单的状态机中处理,因为这两个16位指令成对出现。

您将无法反汇编所有内容(使用可变长度指令集),并且由于某些手工编码或故意策略的细微差别,以防止反汇编您按执行顺序遍历代码的预先代码,可能会有我所说的碰撞,例如您上面的说明。假设一条路径将您带到 0x69 作为指令的入口点,并且您可以从中确定这是一条 7 字节指令,但假设在其他地方有一条分支指令,其目标计算为 0x2f 作为指令的操作码,尽管非常聪明的编程可能会完成类似的事情,更有可能的是反汇编程序已导致反汇编数据。例如,

clear condition flag
branch if condition flag clear
data

反汇编器不会知道数据是数据,并且如果没有额外的智能,反汇编器将不会意识到条件分支实际上是无条件分支(条件清除和条件清除分支之间的不同分支路径上可能有许多指令),因此它假定条件分支之后的字节是一条指令。

最后,我对你的努力表示赞赏,我经常提倡编写简单的反汇编程序(假设代码非常短,有意编写的代码)来很好地学习指令集。如果您不将反汇编器置于必须遵循执行顺序的情况下,而是可以按照内存顺序进行(基本上不要在指令之间嵌入数据,将其放在末尾或其他位置,只留下要反汇编的指令字符串)。了解指令集的操作码解码可以使您更好地针对该平台的低级和高级语言进行编程。

简而言之,英特尔曾经发布过,也许现在仍然发布处理器的技术参考手册,我仍然有我的 8088/86 手册,一本用于电气材料的硬件手册,以及一本用于指令集及其工作原理的软件手册。我有一台 486,可能还有一台 386。伊戈尔的回答中的快照直接类似于英特尔手册。由于指令集随着时间的推移已经发生了很大的变化,因此 x86 充其量只是一个困难的野兽。同时,如果处理器本身可以遍历这些字节并执行它们,那么您可以编写一个可以执行相同操作但对它们进行解码的程序。区别在于您可能不会制作模拟器,并且由代码计算的任何分支并且在代码中不明确的您将无法看到,并且该分支的目的地可能不会显示在您的字节列表中拆卸。

Some advice, first get all the instruction set docs you can get your hands on. for this x86 case try for some old 8088/86 manuals as well as more recent, from intel as well as the wealth of opcode tables on the net. various interpretation and documentation might first have subtle documentation errors or differences, and second some folks may present the info in a different and more understandable way.

Second, if this is your first disassembler I recommend avoiding x86, it is very hard. As your question implies variable word length instruction sets are difficult, to make a remotely successful disassembler, you need to follow the code in execution order, not memory order. So your disassembler has to use some sort of scheme to not only decode and print instructions but decode jump instructions and tag destination addresses as entry points into an instruction. for example ARM, is fixed instruction length, you can write an ARM disassembler that starts at the beginning of ram and disassembles each word straight through (assuming of course it is not a mixture of arm and thumb code). thumb (not thumb2) can be disassembled this way as there is only one flavor of 32 bit instruction, everything else is 16 bit, and that one flavor can be handled in a simple state machine as those two 16 bit instructions show up as pairs.

You are not going to be able to disassemble everything (with a variable length instruction set) and due to nuances of some hand coding or intentional tactics to prevent disassembly your up front code that walks the code in execution order may have what I would call a collision, for example your instructions above. Say that one path takes you to 0x69 being the entry point in to the instruction and you determine from that that is a 7 byte instruction, but say somewhere else there is a branch instruction whose destination computes as 0x2f being the opcode for an instruction, although very clever programming may pull something like that off, it is more likely that the disassembler has been lead to disassemble data. for example

clear condition flag
branch if condition flag clear
data

The disassembler wont know the data is data, and without additional smarts the disassembler wont realize that the conditional branch is in fact an unconditional branch (there could be many instructions on different branch paths between the condition clear and branch if condition clear) so it assumes the byte after the conditional branch is an instruction.

lastly I applaud your efforts, I often preach writing simple disassemblers (ones that assume the code is very short, intentionally crafted code) to learn an instruction set very well. If you dont put the disassembler into a situation where it has to follow in execution order and instead it can go in memory order (basically do not embed data between instructions, put it at the end or somewhere else leaving only strings of instructions to be disassembled). understanding the opcode decoding for an instruction set can make you much better at programming for that platform both for low level and high level languages.

short answer, intel used to publish, and maybe still does, technical reference manuals for the processors, I still have my 8088/86 manuals, a hardware one for the electrical stuff, and a software one for the instruction set and how it works. I have a 486 and probably a 386 one. The snapshot in Igor's answer directly resembles an intel manual. Because the instruction set has evolved so much over time makes x86 a difficult beast at best. At the same time, if the processor itself can wade through these bytes and execute them, you can write a program that can do the same thing but decode them. the difference being you are likely not going to make a simulator and any branches that are computed by the code and not explicit in the code you will not be able to see and the destination for that branch may not show up in your list of bytes to disassemble.

小镇女孩 2024-12-05 04:19:06

这不是机器代码指令(它由操作码和零个或多个操作数组成)。

这是文本字符串的一部分,它翻译为:

$ echo -e "\x69\x62\x2f\x6c\x64\x2d\x6c"
ib/ld-l

这显然是字符串 "/lib/ld-linux.so.2" 的一部分。

That is not a machine code instruction (which would consist of an opcode and zero or more operands).

That is part of a text string, it translates as:

$ echo -e "\x69\x62\x2f\x6c\x64\x2d\x6c"
ib/ld-l

which obviously is part of the string "/lib/ld-linux.so.2".

久夏青 2024-12-05 04:19:06

如果您不想浏览操作码表/手册,那么从其他项目中学习总是有帮助的,例如开源反汇编器,bea-engine,您可能会发现您甚至不再需要创建自己的引擎,具体取决于您的用途。

If you don't feeling like shifting through opcode tables/manuals, it always helps to learn from other's projects, like the open source disassembler, bea-engine, you might find that you don't even need to create your own one anymore, depending on what your doing it for.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文