如何将十六进制序列毫无歧义地转换为汇编语言?
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
像上面这样的序列可以通过多种方式进行分段,每个段都可以翻译为相应的汇编指令,但是每个二进制可执行文件都有其唯一的确定的汇编,避免歧义的数学原理是什么?
更新
得票最多的答案实际上根本没有回答我的问题。
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
A senquence like above can be segmented in various ways,each segment can be translated to corresponding assembly instruction, but each binary executable has its only DEFINITE assembly ,what's the mathematical principle that avoids ambiguity?
UPDATE
The answer with most votes doesn't actually answer my question at all.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
知道你的出发点。
换句话说,给定指令的特定起始字节,指令结束的位置是明确的,从而为您提供下一条指令的起始字节并允许您继续。给定一个任意的内存块,如果不知道第一条指令从哪里开始,就不可能将其分解为单独的指令。
从更数学的角度来看,不存在其字节是另一有效指令的前缀的有效指令。因此,如果
ab
有效,那么您就知道ab cd
不可能有效,因此ab
必须是一条指令,而cd
> 是下一条指令的开始。Knowing your starting point.
In other words, given a specific starting byte of an instruction, it is unambiguous where the instruction ends, thus giving you the starting byte of the next instruction and allowing you to continue. Given an arbitrary block of memory it is impossible to break it up into individual instructions without knowing where the first instruction begins.
From a more mathematical perspective, there is no valid instruction whose bytes are a prefix of another valid instruction. So if
ab
is valid, then you know thatab cd
cannot be valid soab
must be one instruction andcd
is the start of the next instruction.如果我正确理解您的问题,您正在尝试理解为什么
8B EC 56 8B F4 68 00 70 40 40 00 FF 15 BC 82 40
可以将EG分为EG,例如
8BEC 568BF4 68007040 00FF 15BC 8240
RATHERN,而不是
8B EC568B F4 680070704040404040 00FF 15BC 8240
这完全由您的架构的 ISA 指定。该文档准确地描述了如何从一系列字节唯一构造指令。
为了使 ISA 格式良好,单个字节系列最多可以对应于单个解码指令系列(如果存在无效指令,则可能会更少)。
为了更具体一点,让我们以 x86 为例:如果您想知道每个字节对应什么,请查看
您会看到,例如,以 00 开头的指令是一条添加指令(附加参数位于下一个字节中,具有特定的编码)。
您还会看到,某些值实际上是修改以下指令的前缀(0F - 扩展操作码空间的前缀,26、2E、36、3E、64、65、66、67、F0、F2、F3),以及其中一些根据下面的确切说明具有不同的含义。这些不是操作码,但它们可以改变操作码参数的编码,或者引入全新的操作码空间(例如SSE使用0F)。
总的来说,x86 编码非常复杂,感谢反汇编程序。
If I understand your question correctly, you're trying to understand why
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
Could be split e.g. as
8BEC 568BF4 68007040 00FF 15BC 8240
Rathern than say,
8B EC568B F4 68007040 00FF 15BC 8240
This is entirely specified by the ISA of your architecture. That document describes exactly how instructions are uniquely constructed from a series of bytes.
For the ISA to be well formed, a single series of bytes can correspond to at most a single series of decoded instructions (might be less, if there are invalid instructions).
To get a bit more concrete, lets take the x86 example: If you want to know what each byte corresponds to, have a look here.
You'll see that, e.g. an instruction starting with 00 is an add (additional parameters are in the next byte, with a specific encoding).
You'll also see that some values are actually prefixes that modify the following instruction (0F - prefix to extend the opcode space, 26, 2E, 36, 3E, 64, 65, 66, 67, F0, F2, F3), and that some of them take different meaning based on the exact following instruction. Those are not opcodes, but they can alter the encoding of the arguments of the opcode, or introduce a completely new opcode space (e.g. SSE uses 0F).
Overall, the x86 encoding is very complex, thanks for disassemblers.
首先,您必须区分RISC和CISC架构。
在 RISC 架构中,通常具有相同大小的指令,因此不会出现歧义。例如,您的 CPU 将为每条指令获取 4 个字节,并且由于它必须从某个地方开始(您的 CPU 不只是像您提供的序列那样,它肯定会有一个起点)一旦它有正确对齐不会出现任何问题。
CISC 指令集发生的情况本质上是相同的:从程序的入口点开始,它将根据您的操作码获取指令。它不需要知道如何在数学上区分歧义,因为它不会发生这样的情况:它只是不知道下一条指令有多长或上一条指令在哪里完成。
因此,询问如何分隔每条指令就像询问如何分隔中的每个单词一样
没有数学证明,但你知道哪些字母组合在一起是正确的,哪些字母没有意义。前面的句子包含“son”,但你知道它是从“is on”获得的。如果没有有意义的短语,你就无法这么说,但你的CPU只执行有意义的程序,那有什么意义呢?
因此,如果 CPU 可以处理前一个句子,它将找到第一个有意义的指令“the”,然后是“pen”、“is”、“on”,而“son”无论如何都不会被识别。
编辑:
需要澄清的是,在 CISC 架构中,您必须确保不存在歧义的唯一限制是避免指令是另一个指令的前缀。让我们假设一个由字母 az 而不是十六进制数字组成的有限字母表(仅出于实用目的)。
如果程序计数器指向
你可以知道
abb
是一条完整的指令。在这种情况下,ab
就不是一条有效的指令,否则CPU不知道在哪里停止,同时abbc
也不能是一条指令,或者它可能会产生问题。保持它,例如,您可以知道ca
是下一条指令,c
不能,cbc
都不是。您可以将此论证扩展到整个字符串。你会看到,如果CPU发现自己处于二进制的下一个字节指向指令的第一个字节的状态,并且没有指令是其他指令的前缀,那么在下一个状态下,程序计数器将指向下一条正确指令的第一个字节。
First of all you have to distinguish between RISC and CISC architectures.
In a RISC architecture you usually have instructions of the same size, so ambiguity cannot be presented. Your CPU will fetch for example 4 bytes for every instruction, and since it will have to start from somewhere (your CPU doesn't have just a sequence like the one you presented, it will have a starting point for sure) once that it has the right alignment no problem can occur.
What happens with a CISC instruction set is essentially the same: starting from the entry point of the program it will fetch instructions accordingly to your opcodes. It doesn't need to know how to matematically distinguish ambiguities since it won't happen that it just doesn't know how long is the next instruction or where the last one finished.
So asking how to separate every instruction is like asking how to separate every word in
There's not mathematical proof but you know which letters are correct together and which ones are not meaningful. The previous sentence contains "son" but you know that it is obtained from "is on". You wouldn't be able to say so without having a meaningful phrase, but your CPU only executes meaningful programs so what's the point?
So if the CPU could work on the previous sentence it will find the first senseful instruction "the", then "pen", "is", "on" and the "son" couldn't never be recognized anyway.
EDIT:
To be cleared, in CISC architectures, the only contraint you have to be sure not to have ambiguities is to avoid having an instruction that is a prefix of another. Let's assume a finite alphabet composed by letters a-z instead that hex numbers (just for practical purposes).
If the program counter points to
you can have that
abb
is a whole instruction. In that caseab
wouldn't be a valid instruction, otherwise the CPU couldn't know where to stop, at the same timeabbc
can't be an instruction too or it may create problems. Keeping it on you can have for example thatca
is the next instruction,c
couldn't andcbc
neither.You can extend this argumentation to the whole string. You will see that, if the CPU finds itself in a state in which the next byte of the binary points to the FIRST byte of an instruction, and there are no instruction that are prefixes of other instruction, then in the next state the program counter will point to the FIRST byte of the next, correct, instruction.
如果您在十六进制编辑器中打开二进制文件,复制部分数据并粘贴到反汇编程序中,您很可能不会复制完整的指令。让我们用你的例子..在Windows XP 32位SP3英语中,如果我汇编
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
我会得到:正如你所看到的与下面其他人的答案完全不同...
现在让我们假设不是组装
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
您在开头添加了
C0
操作码C0 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
现在检查一下单个字节在我们的指令中的作用
:你可以看到它被完全修改了!
处理器知道从哪里开始以及操作码指令要采用的指令参数。在第一个示例中,我们的第一个操作码是 8B,因此它知道它后面可以跟另一个字节。如果这个字节是
EC
,那么指令到这里结束,它的意思是mov ebp, esp
。在我们的第二个示例中,它以 C0 开头,后面可以跟另一个字节,表示另一条指令。那么
C08B
是指令,EC568BF4 68
是参数。想象一下,处理器内部有一个巨大的(但纳米级的)电路,其行为就像一系列 ifs(条件),根据十六进制代码(或操作码)的值,它知道“去哪里”或“如何表现” 。
If you open a binary in a hex editor, copy a portion of data and paste in a disassembler, you will very probably not copy a complete instruction. Let's use your example.. in a Windows XP 32bits SP3 English, if I assembly
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
I'll get:As you can see it assembled completely different then the answer of the other guys below...
Now let's pretend that instead of assembling
8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
you added
C0
opcode at the beginningC0 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
Now check below what a single byte did in our instructions:
As you can see it was completely modified!
The processor know where to start and what arguments to the instruction to take by the opcode instruction. In the first example our first opcode was
8B
so it knows that it can be followed by another byte. If this byte isEC
so the instruction ends here, and it meansmov ebp, esp
.In our second example it start with
C0
and it can be followed by another byte, meaning another instruction. ThenC08B
is the instruction, andEC568BF4 68
is the argument.Imagine that inside the processor has a huge (but nano) circuit, that behave like a chain of ifs (conditions), that depending on the value of the hexcode (or opcode) it knows "where to go" or "how to behave".
您列出的序列恰好显示 1 个数字。在二进制中,它是
100010111110110001010110100010111111010001101000000000000111000001000000000000001111111100010101101111001000001001000000.十进制表示为
726522768938664460674442126658667072
。这些只是书写完全相同的值的不同方式。特定架构的 ISA 会将位划分为字段并赋予它们含义。大多数处理器都有很容易获得的手册,其中描述了分配给这些字段中每个位的含义。The sequence you listed shows exactly 1 number. In binary, it's
. In decimal, it's100010111110110001010110100010111111010001101000000000000111000001000000000000001111111100010101101111001000001001000000
726522768938664460674442126658667072
. These are all just different ways of writing exactly the same value. A particular architecture's ISA will divide the bits into fields and assign them meaning. Most processors have easy to get manuals that describe the meaning assigned to each of the bits in those fields.不要将线性尝试反汇编与代码的执行顺序混淆。二进制文件按照执行顺序进行解码,从已知位置开始。除了出于各种原因的故意黑客攻击之外,没有任何含糊之处。
尝试为可变字长指令集编写反汇编程序。最终,它必须按执行顺序完成,即使在那里,您也只能反汇编部分程序,因为某些分支可以基于运行时计算的地址。现代编译器生成的代码比旧的手工汇编的代码要好得多。例如,在老式的街机游戏中,有条件分支,前面有一条指令,保证只满足其中一个条件(为什么会在那里?我们永远不会知道),并且条件分支后面的数据类似于以下操作码这样你就会遇到与其他操作码的冲突。
试图击败反汇编程序的旧dos程序将具有自修改代码,一条指令会提前一到两条指令修改另一条指令,如果单步执行,则会发生修改,但如果全速运行,则指令已经在管道中获取,并且内存中修改/损坏的一个未被使用。非常巧妙的技巧。
无论如何,您的问题的答案是不要按线性顺序查看字节,而是从向量表中的复位和其他向量定义的地址开始按执行顺序查看它们。
Dont confuse linearly trying to disassemble with execution order of code. The binary is decoded in execution order, starting with a known location. Other than intentional hacks for various reasons there is no ambiguity.
Try writing a disassembler for a variable word length instruction set. At the end of the day it has to be done in execution order, and even there you can only disassemble some of the program as some branches can be based on addresses computed at run time. Modern compiler generated code is much better than older hand assembled code. In an old standup arcade game for example there are conditional branches preceeded by an instruction that guarantees that only one of the conditions is met (why was that in there? we will never know) and the data that follows the conditional branch resembles an opcode in such a way that you run into a collision with other opcodes.
Old dos programs trying to defeat disassemblers would have self modifying code, one instruction would modify another instruction one or two instructions ahead, if single stepped that modification would happen, but if run at full speed the instruction was already fetched in the pipeline, and the modified/broken one in memory was not used. pretty neat trick.
Anyway, the answer to your question is do not look at the bytes in linear order, look at them in execution order starting at the addresses defined by the reset and other vectors in the vector table.
听起来你的问题的答案是有点轻率的“知道你的起点”,但也许你想要更详细的东西。给定您的字符串:
AND 一个起点(假设 8B 是您的起点),对字节只有一种可能的解释。
因此,假设一个操作是 8B EC 56 8B (取决于您的操作长度等),那么下一个操作是 F4 68...在这种情况下,机器不可能尝试解释操作 56 8B F4 68 因为此时没有任何操作结束。
现在,如果您的起点是 56,那么您可以获得该组,但无法获得前面提到的任何一个。
内存的布局非常具体,起始点/跳转点是精确且无情的——它们和代码本身一样是必需的。
It sounds like the answer to your question is the somewhat flippant "Know your starting point", but maybe you want something a little more verbose. Given your string:
AND a starting point (Let's say the 8B is your starting point) there is only one possible interpretation of the bytes.
So let's say one operation is 8B EC 56 8B (Depending on your operation length, etc), then the NEXT operation is F4 68... In this case, it's impossible for the machine to try to interpret an operation 56 8B F4 68 because no operation ended at just that point.
Now, if your start point was the 56, then you can get that group but cannot get either of the ones mentioned previously.
The layout of your memory is very specific and start points/jump points are exact and unforgiving--they are required as surely as the code itself.
其他地方也可能有一些关于什么是有效起始地址的线索。总是有一个复位向量地址,并且通常有中断向量地址,它们都必须是代码块的有效起始点。更有用的是,如果您在其他地方遇到引用您尝试解码的块中的地址的跳转或调用指令,那么这将为您提供另一个起始地址。
我想我明白你的担忧,据我所知它是正确的 - 如果程序计数器被打乱 1 并导致执行无效指令或意外指令,程序可能会崩溃。确实如此,如果您遇到数据块并尝试执行它也是如此。至少可以通过使用哈佛架构来避免后者,其中代码和数据位于单独的存储空间中并且可以具有不同的位宽度。
There might also be some clues elsewhere about what is a valid starting address. There is always a reset vector address, and there are usually interrupt vector addresses, which all must be valid start points for blocks of code. More usefully, if you come across a jump or call instruction elsewhere which references an address in the block you are trying to decode, then that gives you another start address.
I think I see your worry, and as far as I know its correct - if the program counter gets upset by one and that causes invalid instructions or unintended instructions to be executed, the program probably crashes. True, and also if you run into a data block and try to execute that. At least the latter can be avoided by using a Harvard architecture, where code and data are in seperate memory spaces and may be different bit widths.
也许您会发现思考另一个方向很有趣:您将如何设计您的代码以便于其他人进行分段?您可能要求开始序列的字节的最高有效位为零,而序列中间的字节为 1,就像 UTF-8 那样。然后,如果您从随机位置开始 - 假设您知道字节在哪里 - 很容易找到下一个序列。更进一步,您将如何编码纯位流,以便轻松找到序列的开头。这样的编码浪费了多少比特?
既然你问的是数学,我认为相关的主题是“编码理论”、“变长码”或“前缀码”。
如何在碱基对序列中找到基因?
Maybe you find it interesting to think about the other direction: How would you have to design your code to be easy to segment for others? You could require the most significant bit of the byte starting a sequence to be zero, and those in the middle of a sequence to be one, like UTF-8 does it. Then if you start from a random position – assuming you know where the bytes are – it is easy to find the next sequence. Going one step further, how would you code a pure bit stream such that the start of a sequence is easy to find. How many bits were wasted by such a coding?
Since you asked about the maths, I think the relevant topics are “Coding Theory”, “Variable-length codes” or “Prefix codes”.
How do you find a gene in a sequence of base pairs?