如何编写反汇编程序?

发布于 2024-07-22 13:38:31 字数 463 浏览 6 评论 0 原文

我有兴趣编写一个 x86 反汇编器作为一个教育项目。

我发现的唯一真正的资源是螺旋空间的,“如何编写反汇编程序”。 虽然这对反汇编程序的各个组件进行了很好的高级描述,但我对一些更详细的资源感兴趣。 我还快速浏览了 NASM 源代码,但这在某种程度上是值得学习的重量级内容。

我意识到这个项目的主要挑战之一是我必须处理相当大的 x86 指令集。 我也对基本结构、基本反汇编器链接等感兴趣。

任何人都可以向我指出有关编写 x86 反汇编器的详细资源吗?

I'm interested in writing an x86 dissembler as an educational project.

The only real resource I have found is Spiral Space's, "How to write a disassembler". While this gives a nice high level description of the various components of a disassembler, I'm interested in some more detailed resources. I've also taken a quick look at NASM's source code but this is somewhat of a heavyweight to learn from.

I realize one of the major challenges of this project is the rather large x86 instruction set I'm going to have to handle. I'm also interested in basic structure, basic disassembler links, etc.

Can anyone point me to any detailed resources on writing a x86 disassembler?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

揽清风入怀 2024-07-29 13:38:31

请参阅 第 17.2 节 ="http://pdos.csail.mit.edu/6.828/2008/readings/i386/toc.htm" rel="noreferrer">80386 程序员参考手册。 反汇编器实际上只是一个美化的有限状态机。 反汇编的步骤为:

  1. 检查当前字节是否为指令前缀字节(F3F2F0); 如果是这样,那么您就有了 REP/REPE/REPNE/LOCK 前缀。 前进到下一个字节。
  2. 检查当前字节是否是地址大小字节 (67)。 如果是,则如果当前处于 32 位模式,则以 16 位模式解码指令其余部分中的地址;如果当前处于 16 位模式,则以 32 位模式解码地址
  3. 检查当前字节是否为操作数大小字节(66)。 如果是,则如果当前处于 32 位模式,则以 16 位模式解码立即操作数;如果当前处于 16 位模式,则以 32 位模式解码立即操作数
  4. 检查当前字节是否为段覆盖字节 (2E363E266465)。 如果是这样,请使用相应的段寄存器来解码地址,而不是默认的段寄存器。
  5. 下一个字节是操作码。 如果操作码为0F,则为扩展操作码,读取下一个字节作为扩展操作码。
  6. 根据特定的操作码,读入并解码 Mod R/M 字节、比例索引基 (SIB) 字节、位移(0、1、2 或 4 字节)和/或立即值(0、1 、2 或 4 字节)。 这些字段的大小取决于先前解码的操作码、地址大小覆盖和操作数大小覆盖。

操作码告诉您正在执行的操作。 操作码的参数可以从 Mod R/M、SIB、位移和立即值的值进行解码。 由于 x86 的复杂性,存在很多可能性和很多特殊情况。 请参阅上面的链接以获得更全面的解释。

Take a look at section 17.2 of the 80386 Programmer's Reference Manual. A disassembler is really just a glorified finite-state machine. The steps in disassembly are:

  1. Check if the current byte is an instruction prefix byte (F3, F2, or F0); if so, then you've got a REP/REPE/REPNE/LOCK prefix. Advance to the next byte.
  2. Check to see if the current byte is an address size byte (67). If so, decode addresses in the rest of the instruction in 16-bit mode if currently in 32-bit mode, or decode addresses in 32-bit mode if currently in 16-bit mode
  3. Check to see if the current byte is an operand size byte (66). If so, decode immediate operands in 16-bit mode if currently in 32-bit mode, or decode immediate operands in 32-bit mode if currently in 16-bit mode
  4. Check to see if the current byte is a segment override byte (2E, 36, 3E, 26, 64, or 65). If so, use the corresponding segment register for decoding addresses instead of the default segment register.
  5. The next byte is the opcode. If the opcode is 0F, then it is an extended opcode, and read the next byte as the extended opcode.
  6. Depending on the particular opcode, read in and decode a Mod R/M byte, a Scale Index Base (SIB) byte, a displacement (0, 1, 2, or 4 bytes), and/or an immediate value (0, 1, 2, or 4 bytes). The sizes of these fields depend on the opcode , address size override, and operand size overrides previously decoded.

The opcode tells you the operation being performed. The arguments of the opcode can be decoded form the values of the Mod R/M, SIB, displacement, and immediate value. There are a lot of possibilities and a lot of special cases, due to the complex nature of x86. See the links above for a more thorough explanation.

[旋木] 2024-07-29 13:38:31

我建议检查一些开源反汇编程序,最好是distorm,尤其是“disOps(指令集数据库)”(ctrl +在页面上找到它)。

文档本身充满了有关操作码和指令的有趣信息。

引自 https://code.google.com/p/ distorm/wiki/x86_x64_Machine_Code

80x86 说明:

一条 80x86 指令被划分为一条
元素数量:

  1. 指令前缀,影响指令的行为
    操作。
  2. 用作 SSE 指令操作码字节的强制前缀。
  3. 操作码字节,可以是一个或多个字节(最多 3 个完整字节)。
  4. ModR/M 字节是可选的,有时可能包含
    操作码本身。
  5. SIB 字节是可选的,表示复杂的内存间接寻址
    表格。
  6. 位移是可选的,它是一个不同大小的值
    bytes(byte, word, long) 并用作
    偏移量。
  7. 立即数是可选的,它用作构建的通用数值
    来自不同大小的字节(字节,
    字长)。

格式如下:

<前><代码>/-------------------------------------------------------- -------------------------------------------------- ----------------------------------------------------------\
|*前缀| *强制前缀 | *REX 前缀 | 操作码字节 | *ModR/M | *SIB | *位移(1,2 或 4 字节)| *立即数(1,2 或 4 字节)|
\------------------------------------------------- -------------------------------------------------- ----------------------------------------------------/
* 表示该元素是可选的。

https://code 中解释了数据结构和解码阶段。 google.com/p/distorm/wiki/diStorm_Internals

引用:

解码阶段

  1. [前缀]
  2. [获取操作码]
  3. [过滤操作码]
  4. [提取操作数]
  5. [文本格式]
  6. [十六进制转储]
  7. [解码指令]

每个步骤也进行了解释。


由于历史原因保留原始链接:

http://code.google.com/p/ distorm/wiki/x86_x64_Machine_Codehttp://code.google.com/p /distorm/wiki/diStorm_Internals

I would recommend checking out some open source disassemblers, preferably distorm and especially "disOps (Instructions Sets DataBase)" (ctrl+find it on the page).

The documentation itself is full of juicy information about opcodes and instructions.

Quote from https://code.google.com/p/distorm/wiki/x86_x64_Machine_Code

80x86 Instruction:

A 80x86 instruction is divided to a
number of elements:

  1. Instruction prefixes, affects the behaviour of the instruction's
    operation.
  2. Mandatory prefix used as an opcode byte for SSE instructions.
  3. Opcode bytes, could be one or more bytes (up to 3 whole bytes).
  4. ModR/M byte is optional and sometimes could contain a part of the
    opcode itself.
  5. SIB byte is optional and represents complex memory indirection
    forms.
  6. Displacement is optional and it is a value of a varying size of
    bytes(byte, word, long) and used as an
    offset.
  7. Immediate is optional and it is used as a general number value built
    from a varying size of bytes(byte,
    word, long).

The format looks as follows:

/-------------------------------------------------------------------------------------------------------------------------------------------\
|*Prefixes | *Mandatory Prefix | *REX Prefix | Opcode Bytes | *ModR/M | *SIB | *Displacement (1,2 or 4 bytes) | *Immediate (1,2 or 4 bytes) |
\-------------------------------------------------------------------------------------------------------------------------------------------/
* means the element is optional.

The data structures and decoding phases are explained in https://code.google.com/p/distorm/wiki/diStorm_Internals

Quote:

Decoding Phases

  1. [Prefixes]
  2. [Fetch Opcode]
  3. [Filter Opcode]
  4. [Extract Operand(s)]
  5. [Text Formatting]
  6. [Hex Dump]
  7. [Decoded Instruction]

Each step is explained also.


The original links are kept for historical reasons:

http://code.google.com/p/distorm/wiki/x86_x64_Machine_Code and http://code.google.com/p/distorm/wiki/diStorm_Internals

烦人精 2024-07-29 13:38:31

从一些已组装的小程序开始,它为您提供生成的代码和指令。 为自己获取指令架构的参考,并使用该架构完成一些生成的代码参考,手工。 您会发现这些指令具有非常典型的结构:inst op op op,具有不同数量的操作数。 您所需要做的就是翻译代码的十六进制或八进制表示形式以匹配指令; 稍微玩一下就会发现它。

这个自动化过程是反汇编程序的核心。 理想情况下,您可能希望在内部(或外部,如果程序非常大)构造一个指令结构数组。 然后,您可以将该数组转换为汇编格式的指令。

Start with some small program that has been assembled, and which gives you both the generated code and the instructions. Get yourself a reference with the instruction architecture, and work through some of the generated code with the architecture reference, by hand. You'll find that the instructions have a very stereotypical structure of inst op op op with varying number of operands. All you need to do is translate the hex or octal representation of the code to match the instructions; a little playing around will reveal it.

That process, automated, is the core of a disassembler. Ideally, you're probably going to want to construct a n array of instruction structures internally (or externally, if the program is really large). You can then translate that array into the instructions in assembler format.

痞味浪人 2024-07-29 13:38:31

您需要一个操作码表来加载。

基本的查找数据结构是 trie,但是如果您不太关心速度,那么表就足够好了。

要获取基本操作码类型,请从表中的匹配开始。

有几种解码寄存器参数的常用方法; 然而,有足够多的特殊情况需要单独实施其中的大多数。

因为这是有教育意义的,所以看看 ndisasm。

You need a table of opcodes to load from.

The fundamental lookup datastructure is a trie, however a table will do well enough if you don't care much about speed.

To get the base opcode type, beginswith match on the table.

There are a few stock ways of decoding register arguments; however, there are enough special cases to require implementing most of them individually.

Since this is educational, have a look at ndisasm.

梨涡少年 2024-07-29 13:38:31

查看 objdump 源代码 - 它是一个很棒的工具,它包含许多操作码表,并且它的源代码可以为制作您自己的反汇编程序提供良好的基础。

Checkout objdump sources - it's a great tool, it contains many opcode tables and it's sources can provide a nice base for making your own disassembler.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文