x86 汇编：反汇编器如何知道如何分解指令？

发布于 2024-09-28 13:12:00 字数 1338 浏览 5 评论 0原文

x86 反汇编程序如何知道在哪里分解指令？

我正在查看 8088 指令集。例如，移动指令有 7 种变化，范围从 2 到 4 字节。这些说明本身似乎没有遵循特定的顺序。为什么的另一个原因x86 很丑吗？。

例如：

                        76543210  76543210  76543210  76543210
reg/mem to/from reg     100010dw  ||regr/m  
imm to reg/mem          1100011w  ||000r/m  dat       dat w=1
imm to reg              1011wreg  data      dat w=1
imm to accum            1010000w  addr-low  addrhigh
accum to mem            1010001w  addr-low  addrhigh
reg/mem to seg          10001100  ||0ssr/m
seg to reg/mem          10001100  ||0ssr/m

Legend:
||=mod {NO-DISP=0,DISP-LOW,DISP-HIGH,REG}
ss=seg enum{es=0,cs,ss,ds}
reg=enum{ax=0,bx,cd,dx,bx,sp,bp,si,di (if w=1)} enum{al,bl...} (if w=0)
r/m=reg or mem (mod=3 then REG, else mem)

许多指令可以在第一个字节中重叠：

                        76543210  76543210  76543210  76543210
push                    11111111  ||110r/m
inc                     1111111w  ||000r/m

位掩码似乎具有任意分配。反汇编程序如何分解指令？

这个问题是如何编写反汇编程序的子集。

原文

How does a x86 disassembler know where to break up the instructions?

I am looking at the 8088 instruction set. For example the move instruction has 7 variations that range from 2 to 4 bytes. The instructions themselves seem to follow no particular order. Another reason for Why is x86 ugly?.

For example:

                        76543210  76543210  76543210  76543210
reg/mem to/from reg     100010dw  ||regr/m  
imm to reg/mem          1100011w  ||000r/m  dat       dat w=1
imm to reg              1011wreg  data      dat w=1
imm to accum            1010000w  addr-low  addrhigh
accum to mem            1010001w  addr-low  addrhigh
reg/mem to seg          10001100  ||0ssr/m
seg to reg/mem          10001100  ||0ssr/m

Legend:
||=mod {NO-DISP=0,DISP-LOW,DISP-HIGH,REG}
ss=seg enum{es=0,cs,ss,ds}
reg=enum{ax=0,bx,cd,dx,bx,sp,bp,si,di (if w=1)} enum{al,bl...} (if w=0)
r/m=reg or mem (mod=3 then REG, else mem)

many instruction can overlap in the first byte:

                        76543210  76543210  76543210  76543210
push                    11111111  ||110r/m
inc                     1111111w  ||000r/m

The bitmasks appears to have arbitrary assignment. How does a disassembler break apart the instructions?

This question is a sub set of How to write a disassembler.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你另情深 2024-10-05 13:12:00

看看我的 8086/8088 用户手册程序员参考（ISBN 1-55512-010-5），可能已经绝版了几十年......附录 A 显示了按操作码顺序 0b00000000 到 0b11111111 的指令解码。看起来一点也不混乱。加、减、和、异或、cmp 等都以这样的方式分组，即多路复用器可以直接使用操作码位来路由输入和输出，而其他位选择 alu 对这些位执行的操作。

为了编写反汇编程序，您需要使用这种表或操作码表来对指令进行顶层排序。

在您的特定示例中，请注意每当您看到第一个操作码为 0xFF 时，第二个字节中间的三位会告诉您故事的其余部分，即哪个指令是哪个。所有 8 种组合（其中一种未定义）均由这 3 位表示并轻松解码。

是的，x86 指令集很疯狂。从那时起，就发明了有趣且有趣的功能，但更好的指令集。 x86 没有重蹈 6502 的覆辙的唯一原因是动力，而不是质量。

您也应该看看这个：

十六进制序列如何转换为汇编没有歧义？

如何反汇编这个指令集和任何其他可变字长指令集是按照执行顺序进行的。如果您尝试按地址顺序线性执行，您将会失败。从向量表开始获取入口地址，然后按照地址顺序遵循这些指令，记下并遵循所有分支，直到遇到无条件分支或返回或终止该指令字符串的其他指令。对每个分支目的地重复此操作。这不会涵盖所有可能的指令，因为代码可能会在执行时计算地址（您无法对其进行反汇编）。

如果任何代码是有意或无意地手写的，导致反汇编程序出错，则可能会发生冲突，其中基于一个执行路径的一个操作码的第二个或第三个字节似乎是基于不同执行的指令的第一个操作码小路。例如，清除标志指令后跟一个条件分支（如果标志已清除），后跟一个数据字节，后跟作为分支目的地的实际指令。是的，我遇到过这个。它应该被反汇编程序捕获，您需要进行检查以在冲突时停止反汇编其中一个或两个执行路径。对于完整的反汇编，期望必须支持某种用户输入以排除地址作为操作码，以及用户手动添加有效的操作码以便您跟踪执行路径。

对于固定长度指令集，您可以轻松地按地址或执行顺序进行反汇编，您可以选择，从 0 到内存末尾的地址顺序当然是最简单的。不要在未定义的指令上出错，只需将它们标记为这样并继续，其中一些是数据。

x86 绝对是我尝试反汇编的最后一个可变长度指令集，并且我已经编写了许多反汇编程序。不想尝试那个项目。从一些固定长度的开始，例如图片和手臂/拇指。尝试使用 msp430 来实现可变字长，然后可能使用 6502（小行星、小行星豪华版、月球着陆器等）。也许需要一两个晚上的时间来涵盖上述内容并获得感觉，然后如果愿望仍然存在，就攻击 x86。如果您严格限制自己只能使用 8088/8086，那还不错，需要确保您的工具正在生成这些指令，而不是进入 386 上的指令。

如果 Push 与 Inc 困扰您，请务必先尝试其他产品，例如 msp430。

Looking at my 8086/8088 Users Manual Programmers reference (ISBN 1-55512-010-5), likely decades out of print...Appendix A shows the instruction decoding in opcode order 0b00000000 thru 0b11111111. Does not appear to be chaotic at all. Add, sub, and, xor, cmp, etc are all grouped in such a way that a mux can use the opcode bits directly to route the inputs and outputs, and other bits select the operation the alu performs on those bits.

For writing a disassembler you want to use this kind of table or an opcode chart for the top level sorting of instructions.

In your particular example, notice how whenever you see the first opcode as 0xFF there are three bits in the middle of the second byte that tell you the rest of the story as to which instruction is which. All 8 of those combinations (one is undefined) are represented and easily decoded from those 3 bits.

Yes, the x86 instruction set is crazy. Interesting and fun features, but considerably better instruction sets have been invented since. The only reason x86 has not gone the way of the 6502 for example is momentum, not quality.

You should look at this one too:

How are hex sequence translated to assembly without ambiguity?

How to disassemble this and any other variable word length instruction set is by doing it in execution order. You will fail if you try to do it linearly in address order. Start with the vector table to get the entry addresses then follow those instructions in address order, making a note of and following all branches until you hit an unconditional branch or return or other instruction that terminates that string of instructions. Repeat this for every branch destination. That wont cover all of the instructions possible as the code may compute addresses while executing (not much you can do about disassembling that).

If any of this code was hand written intentionally or accidentally to trip up a disassembler you can expect to have collisions where the second or third byte of one opcode based on one execution path appears to be the first opcode of an instruction based on a different execution path. For example a clear a flag instruction followed by a conditional branch if flag is clear, followed by a byte of data, followed by a real instruction that is a branch destination. Yep, I have come across this. And it should be trapped by your disassembler, you need to put checks in to stop disassembling one or both of those execution paths when they collide. For complete disassembly expect to have to support some sort of user input to exclude addresses as opcodes, as well as for the user to manually add valid opcodes for you to follow the execution path from.

For fixed length instruction sets you can easily disassemble in address or execution order, your choice, address order from 0 to the end of memory is the easiest of course. Dont error out on undefined instructions, just mark them as such and keep going, some of those are data.

x86 is definitely the LAST variable length instruction set I would attempt to disassemble and I have written many disassemblers. No desire to ever attempt that project. Start with some fixed length ones like the pic and arm/thumb. Try the msp430 for variable word length, then maybe the 6502 (asteroids, asteroids deluxe, lunar lander, etc). Maybe a week or two worth of evenings to cover the above and get the feel for it, then attack the x86 if the desire remains. If you limit yourself strictly to the 8088/8086 it is not so bad, need to make sure your tools are generating those instructions and not getting into the 386 on up instructions.

If push vs inc is bothering you, definitely try something else like the msp430 for example first.

回复收藏 0 原文

~没有更多了~