如何将十六进制序列毫无歧义地转换为汇编语言？

发布于 2024-09-26 19:59:57 字数 215 浏览 9 评论 0原文

8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40

像上面这样的序列可以通过多种方式进行分段，每个段都可以翻译为相应的汇编指令，但是每个二进制可执行文件都有其唯一的确定的汇编，避免歧义的数学原理是什么？

更新

得票最多的答案实际上根本没有回答我的问题。

原文

8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40

A senquence like above can be segmented in various ways,each segment can be translated to corresponding assembly instruction, but each binary executable has its only DEFINITE assembly ,what's the mathematical principle that avoids ambiguity?

UPDATE

The answer with most votes doesn't actually answer my question at all.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浅蓝的眸勾画不出的柔情 2024-10-03 19:59:57

知道你的出发点。

换句话说，给定指令的特定起始字节，指令结束的位置是明确的，从而为您提供下一条指令的起始字节并允许您继续。给定一个任意的内存块，如果不知道第一条指令从哪里开始，就不可能将其分解为单独的指令。

从更数学的角度来看，不存在其字节是另一有效指令的前缀的有效指令。因此，如果 ab 有效，那么您就知道 ab cd 不可能有效，因此 ab 必须是一条指令，而 cd > 是下一条指令的开始。

回复收藏 0 原文

顾忌 2024-10-03 19:59:57

如果我正确理解您的问题，您正在尝试理解为什么

8B EC 56 8B F4 68 00 70 40 40 00 FF 15 BC 82 40

可以将EG分为EG，例如

8BEC 568BF4 68007040 00FF 15BC 8240

RATHERN，而不是

8B EC568B F4 680070704040404040 00FF 15BC 8240

这完全由您的架构的 ISA 指定。该文档准确地描述了如何从一系列字节唯一构造指令。

为了使 ISA 格式良好，单个字节系列最多可以对应于单个解码指令系列（如果存在无效指令，则可能会更少）。

为了更具体一点，让我们以 x86 为例：如果您想知道每个字节对应什么，请查看

您会看到，例如，以 00 开头的指令是一条添加指令（附加参数位于下一个字节中，具有特定的编码）。

您还会看到，某些值实际上是修改以下指令的前缀（0F - 扩展操作码空间的前缀，26、2E、36、3E、64、65、66、67、F0、F2、F3），以及其中一些根据下面的确切说明具有不同的含义。这些不是操作码，但它们可以改变操作码参数的编码，或者引入全新的操作码空间（例如SSE使用0F）。

总的来说，x86 编码非常复杂，感谢反汇编程序。

回复收藏 0 原文

清风疏影 2024-10-03 19:59:57

首先，您必须区分RISC和CISC架构。

在 RISC 架构中，通常具有相同大小的指令，因此不会出现歧义。例如，您的 CPU 将为每条指令获取 4 个字节，并且由于它必须从某个地方开始（您的 CPU 不只是像您提供的序列那样，它肯定会有一个起点）一旦它有正确对齐不会出现任何问题。

CISC 指令集发生的情况本质上是相同的：从程序的入口点开始，它将根据您的操作码获取指令。它不需要知道如何在数学上区分歧义，因为它不会发生这样的情况：它只是不知道下一条指令有多长或上一条指令在哪里完成。

因此，询问如何分隔每条指令就像询问如何分隔中的每个单词一样

桌子上的笔

没有数学证明，但你知道哪些字母组合在一起是正确的，哪些字母没有意义。前面的句子包含“son”，但你知道它是从“is on”获得的。如果没有有意义的短语，你就无法这么说，但你的CPU只执行有意义的程序，那有什么意义呢？

因此，如果 CPU 可以处理前一个句子，它将找到第一个有意义的指令“the”，然后是“pen”、“is”、“on”，而“son”无论如何都不会被识别。

编辑：

需要澄清的是，在 CISC 架构中，您必须确保不存在歧义的唯一限制是避免指令是另一个指令的前缀。让我们假设一个由字母 az 而不是十六进制数字组成的有限字母表（仅出于实用目的）。

如果程序计数器指向

abbcbcaabdeffabd

你可以知道abb是一条完整的指令。在这种情况下，ab就不是一条有效的指令，否则CPU不知道在哪里停止，同时abbc也不能是一条指令，或者它可能会产生问题。保持它，例如，您可以知道 ca 是下一条指令，c 不能，cbc 都不是。

您可以将此论证扩展到整个字符串。你会看到，如果CPU发现自己处于二进制的下一个字节指向指令的第一个字节的状态，并且没有指令是其他指令的前缀，那么在下一个状态下，程序计数器将指向下一条正确指令的第一个字节。

First of all you have to distinguish between RISC and CISC architectures.

In a RISC architecture you usually have instructions of the same size, so ambiguity cannot be presented. Your CPU will fetch for example 4 bytes for every instruction, and since it will have to start from somewhere (your CPU doesn't have just a sequence like the one you presented, it will have a starting point for sure) once that it has the right alignment no problem can occur.

What happens with a CISC instruction set is essentially the same: starting from the entry point of the program it will fetch instructions accordingly to your opcodes. It doesn't need to know how to matematically distinguish ambiguities since it won't happen that it just doesn't know how long is the next instruction or where the last one finished.

So asking how to separate every instruction is like asking how to separate every word in

thepenisonthetable

There's not mathematical proof but you know which letters are correct together and which ones are not meaningful. The previous sentence contains "son" but you know that it is obtained from "is on". You wouldn't be able to say so without having a meaningful phrase, but your CPU only executes meaningful programs so what's the point?

So if the CPU could work on the previous sentence it will find the first senseful instruction "the", then "pen", "is", "on" and the "son" couldn't never be recognized anyway.

EDIT:

To be cleared, in CISC architectures, the only contraint you have to be sure not to have ambiguities is to avoid having an instruction that is a prefix of another. Let's assume a finite alphabet composed by letters a-z instead that hex numbers (just for practical purposes).

If the program counter points to

abbcbcaabdeffabd

you can have that abb is a whole instruction. In that case ab wouldn't be a valid instruction, otherwise the CPU couldn't know where to stop, at the same time abbc can't be an instruction too or it may create problems. Keeping it on you can have for example that ca is the next instruction, c couldn't and cbc neither.

You can extend this argumentation to the whole string. You will see that, if the CPU finds itself in a state in which the next byte of the binary points to the FIRST byte of an instruction, and there are no instruction that are prefixes of other instruction, then in the next state the program counter will point to the FIRST byte of the next, correct, instruction.

回复收藏 0 原文

哆兒滾 2024-10-03 19:59:57

如果您在十六进制编辑器中打开二进制文件，复制部分数据并粘贴到反汇编程序中，您很可能不会复制完整的指令。让我们用你的例子..在Windows XP 32位SP3英语中，如果我汇编 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40 我会得到：

Hex dump          Command
8BEC              mov ebp,esp
56                push esi
8BF4              mov esi,esp
68 00704000       push 407000
FF15 BC824000     call dword ptr ds:[4082bc]

正如你所看到的与下面其他人的答案完全不同...

现在让我们假设不是组装 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
您在开头添加了 C0 操作码 C0 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40

现在检查一下单个字节在我们的指令中的作用

Hex dump             Command
C08B EC568BF4 68     ror byte ptr ds:[ebx+f48b56ec],68
0070 40              add byte ptr ds:[eax+40],dh
00FF                 add bh,bh
15 BC824000          adc eax,4082bc

：你可以看到它被完全修改了！

处理器知道从哪里开始以及操作码指令要采用的指令参数。在第一个示例中，我们的第一个操作码是 8B，因此它知道它后面可以跟另一个字节。如果这个字节是EC，那么指令到这里结束，它的意思是mov ebp, esp。

在我们的第二个示例中，它以 C0 开头，后面可以跟另一个字节，表示另一条指令。那么C08B是指令，EC568BF4 68是参数。

想象一下，处理器内部有一个巨大的（但纳米级的）电路，其行为就像一系列 ifs（条件），根据十六进制代码（或操作码）的值，它知道“去哪里”或“如何表现” 。

If you open a binary in a hex editor, copy a portion of data and paste in a disassembler, you will very probably not copy a complete instruction. Let's use your example.. in a Windows XP 32bits SP3 English, if I assembly 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40 I'll get:

Hex dump          Command
8BEC              mov ebp,esp
56                push esi
8BF4              mov esi,esp
68 00704000       push 407000
FF15 BC824000     call dword ptr ds:[4082bc]

As you can see it assembled completely different then the answer of the other guys below...

Now let's pretend that instead of assembling 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40
you added C0 opcode at the beginning C0 8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40

Now check below what a single byte did in our instructions:

Hex dump             Command
C08B EC568BF4 68     ror byte ptr ds:[ebx+f48b56ec],68
0070 40              add byte ptr ds:[eax+40],dh
00FF                 add bh,bh
15 BC824000          adc eax,4082bc

As you can see it was completely modified!

The processor know where to start and what arguments to the instruction to take by the opcode instruction. In the first example our first opcode was 8B so it knows that it can be followed by another byte. If this byte is EC so the instruction ends here, and it means mov ebp, esp.

In our second example it start with C0 and it can be followed by another byte, meaning another instruction. Then C08B is the instruction, and EC568BF4 68 is the argument.

Imagine that inside the processor has a huge (but nano) circuit, that behave like a chain of ifs (conditions), that depending on the value of the hexcode (or opcode) it knows "where to go" or "how to behave".

回复收藏 0 原文

小霸王臭丫头 2024-10-03 19:59:57

您列出的序列恰好显示 1 个数字。在二进制中，它是 100010111110110001010110100010111111010001101000000000000111000001000000000000001111111100010101101111001000001001000000.十进制表示为 726522768938664460674442126658667072。这些只是书写完全相同的值的不同方式。特定架构的 ISA 会将位划分为字段并赋予它们含义。大多数处理器都有很容易获得的手册，其中描述了分配给这些字段中每个位的含义。

回复收藏 0 原文

寒冷纷飞旳雪 2024-10-03 19:59:57

不要将线性尝试反汇编与代码的执行顺序混淆。二进制文件按照执行顺序进行解码，从已知位置开始。除了出于各种原因的故意黑客攻击之外，没有任何含糊之处。

尝试为可变字长指令集编写反汇编程序。最终，它必须按执行顺序完成，即使在那里，您也只能反汇编部分程序，因为某些分支可以基于运行时计算的地址。现代编译器生成的代码比旧的手工汇编的代码要好得多。例如，在老式的街机游戏中，有条件分支，前面有一条指令，保证只满足其中一个条件（为什么会在那里？我们永远不会知道），并且条件分支后面的数据类似于以下操作码这样你就会遇到与其他操作码的冲突。

试图击败反汇编程序的旧dos程序将具有自修改代码，一条指令会提前一到两条指令修改另一条指令，如果单步执行，则会发生修改，但如果全速运行，则指令已经在管道中获取，并且内存中修改/损坏的一个未被使用。非常巧妙的技巧。

无论如何，您的问题的答案是不要按线性顺序查看字节，而是从向量表中的复位和其他向量定义的地址开始按执行顺序查看它们。

回复收藏 0 原文

紫南 2024-10-03 19:59:57

听起来你的问题的答案是有点轻率的“知道你的起点”，但也许你想要更详细的东西。给定您的字符串：

8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40

AND 一个起点（假设 8B 是您的起点），对字节只有一种可能的解释。

因此，假设一个操作是 8B EC 56 8B （取决于您的操作长度等），那么下一个操作是 F4 68...在这种情况下，机器不可能尝试解释操作 56 8B F4 68 因为此时没有任何操作结束。

现在，如果您的起点是 56，那么您可以获得该组，但无法获得前面提到的任何一个。

内存的布局非常具体，起始点/跳转点是精确且无情的——它们和代码本身一样是必需的。

It sounds like the answer to your question is the somewhat flippant "Know your starting point", but maybe you want something a little more verbose. Given your string:

8B EC 56 8B F4 68 00 70 40 00 FF 15 BC 82 40

AND a starting point (Let's say the 8B is your starting point) there is only one possible interpretation of the bytes.

So let's say one operation is 8B EC 56 8B (Depending on your operation length, etc), then the NEXT operation is F4 68... In this case, it's impossible for the machine to try to interpret an operation 56 8B F4 68 because no operation ended at just that point.

Now, if your start point was the 56, then you can get that group but cannot get either of the ones mentioned previously.

The layout of your memory is very specific and start points/jump points are exact and unforgiving--they are required as surely as the code itself.

回复收藏 0 原文

自演自醉 2024-10-03 19:59:57

其他地方也可能有一些关于什么是有效起始地址的线索。总是有一个复位向量地址，并且通常有中断向量地址，它们都必须是代码块的有效起始点。更有用的是，如果您在其他地方遇到引用您尝试解码的块中的地址的跳转或调用指令，那么这将为您提供另一个起始地址。

我想我明白你的担忧，据我所知它是正确的 - 如果程序计数器被打乱 1 并导致执行无效指令或意外指令，程序可能会崩溃。确实如此，如果您遇到数据块并尝试执行它也是如此。至少可以通过使用哈佛架构来避免后者，其中代码和数据位于单独的存储空间中并且可以具有不同的位宽度。

回复收藏 0 原文

过去的过去 2024-10-03 19:59:57

也许您会发现思考另一个方向很有趣：您将如何设计您的代码以便于其他人进行分段？您可能要求开始序列的字节的最高有效位为零，而序列中间的字节为 1，就像 UTF-8 那样。然后，如果您从随机位置开始 - 假设您知道字节在哪里 - 很容易找到下一个序列。更进一步，您将如何编码纯位流，以便轻松找到序列的开头。这样的编码浪费了多少比特？

既然你问的是数学，我认为相关的主题是“编码理论”、“变长码”或“前缀码”。

如何在碱基对序列中找到基因？

回复收藏 0 原文

~没有更多了~