当前位置：文江博客话题详情

x86反编译资源

发布于 2024-09-08 10:06:52 字数 317 浏览 9 评论 0原文

我想深入了解表示和运行程序的低级流程。我决定通过编写一个程序来解析和显示对象文件信息（标头、部分等）来实现此目的。我已经快完成这部分了。一个自然的扩展是将剩余的相关数据反编译为汇编指令。最初，我将重点关注 x86。

在哪里可以找到与此反编译相关的资源（二进制 -> ASM）？我读到 x86 与 ASM 具有一一对应的关系，尽管我不知道从中提取转换表的最佳参考。

另外，当我这样做时，我有兴趣跟踪任何提供的调试信息。是否有关于此信息所用格式的参考（假设 ELF 和 GCC 带有 -g 选项）？

你们有什么一般性的建议吗？这里的目标是通过一个实践项目来增加我的理解。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

为你拒绝所有暧昧 2024-09-15 10:06:52

x86是可变指令长度，这意味着非常难以反汇编。如果这是您的第一个反汇编程序，则不建议这样做。

这么说......我采取的方法是，您必须在二进制文件中识别操作码第一个字节的字节，并将这些字节与操作码或数据中第二个或其他字节的字节分开。一旦您知道，您就可以从二进制文件的开头开始并反汇编操作码。

如何从其他字节中找出操作码？您需要遍历所有可能的执行路径，听起来像是一个递归问题，并且可能是但不一定是。查看中断向量表和/或代码中的所有硬件入口点。这为您提供了操作码字节的简短列表。非递归方法是对二进制文件进行多次传递，查看标记为操作码的每个字节，对其进行解码以了解它消耗了多少字节。您还需要知道它是否是无条件分支、条件分支、返回、调用等。如果它不是无条件分支或返回，您可以假设该指令之后的字节是下一条指令的第一个字节。每当遇到某种分支或调用时，计算目标地址，将该字节添加到列表中。继续进行传递，直到完成的传递不会向列表添加新字节为止。您还需要确保，如果您找到一个 3 字节指令的字节，但它后面的字节被标记为指令，那么您就会遇到问题。像条件分支之类的东西，前面有一些确保它们永远不会分支的东西。如果将高级代码编译为二进制文件，您根本不会看到这么多，但是手写汇编程序的美好时光，或者想要保护其代码的人们会做这样的事情。

不幸的是，如果您拥有的只是二进制文件，对于可变长度指令集，您将无法获得完美的反汇编。一些分支目的地是在运行时计算的，有时手工编码的程序集会在返回之前修改堆栈以更改接下来执行的代码，如果这是该代码的唯一路径，那么您可能不会以编程方式弄清楚它，除非您走得更远来模拟代码。即使使用模拟，您也无法覆盖所有执行路径。

例如，对于像 ARM 这样的固定长度指令集（只要它是arm而不是arm和thumb的混合），您可以简单地从二进制文件的开头开始并反汇编，直到用完单词为止。您可以将数据字分解为有效或无效或不太可能使用的指令，但这很好。

如果精灵中的某个地方有一些东西指示二进制文件的哪些部分是可执行的以及哪些部分是数据，我不会感到惊讶。也许甚至不需要遍历数据路径，我怀疑 objdump 执行的任务可能会使用 elf 文件中的某些内容。

elf 文件格式在很多地方都有记录。有基本结构，供应商可以添加特定的块类型，这些类型将由供应商记录。

x86 is variable instruction length, which means very difficult to disassemble. Not advisable if this is your first disassembler.

Saying that...the approach I take is that you have to identify in the binary the bytes that are the first byte of an opcode and separate those from bytes that are second or other bytes in the opcode or data. Once you know that you can start at the beginning of the binary and disassemble the opcodes.

How do yo figure out opcodes from other bytes? You need to walk all possible execution paths, sounds like a recursion problem, and could be but doesnt have to be. Look at the interrupt vector table and/or all hardware entry points in to the code. That gives you a short list of opcode bytes. A non-recursion approach is to make many passes over the binary looking at each byte that is marked an opcode, decode it just enough to know how many bytes it consumes. You also need to know if it is an unconditional branch, conditional branch, return, call, etc. If it is not an unconditional branch or return you can assume the byte after this instruction is the first byte of the next instruction. Any time you encounter a branch or call of some sort, compute the destination address, add that byte to the list. Keep making passes until you have made a pass that adds no new bytes to the list. You also need to make sure that if say you find a byte that is a 3 byte instruction, but the byte after it is marked as an instruction, then you have a problem. Things like conditional branches that are preceeded by something that insures they will never branch. You dont see this much if at all with high level code compiled to a binary, but the good old days of hand written assembler, or folks that want to protect their code will do things like this.

Unfortunately if all you have is the binary, for a variable length instruciton set, you wont get a perfect disassembly. Some branch destinations are computed at runtime, sometimes hand coded assembly will modify the stack before doing a return to change what code executes next, if that is the only path to that code then you likely wont figure it out programmatically unless you go so far as to simulate the code. And even with simulation you wont cover all execution paths.

With a fixed length instruction set like an ARM for example (so long as it is arm and not a mixture of arm and thumb) you can simply start at the beginning of the binary and disassemble until you run out of words. You might disassemble a data word into a valid or invalid or unlikely to be used instruction, but that is fine.

I wouldnt be surprised if somewhere in the elf there is something that indicates what parts of the binary are executable and what parts are data. maybe even so much that you dont have to walk the data paths, I doubt objdump performs a task like that it probably uses something in the elf file.

The elf file format is documented in many places. There is the basic structure and vendors may add specific block types which would be documented by the vendor.

回复收藏 0 原文