当前位置：文江博客话题详情

汇编级功能指纹

发布于 2024-12-02 18:58:02 字数 1228 浏览 9 评论 0原文

我想确定两个可执行文件中的两个函数是否是从相同的（C）源代码编译的，并且即使它们是由不同的编译器版本或使用不同的编译选项编译的，也想这样做。目前，我正在考虑实现某种汇编级函数指纹识别。函数的指纹应具有以下属性：

在不同情况下从同一源编译的两个函数可能具有相同的指纹（或相似的指纹），
从不同的 C 源编译的两个函数可能具有不同的指纹，
（额外的））如果两个源函数相似，则指纹也相似（对于相似的一些合理定义）。

我现在正在寻找的是编译函数的一组属性，这些属性单独满足（1.）并且希望也满足（2.）。

假设

当然，这通常是不可能的，但可能存在一些在大多数情况下都有效的东西。以下是一些可以使其变得更容易的假设：

linux ELF 二进制文件（尽管没有可用的调试信息），
没有以任何方式混淆，
由 gcc 编译，
在 x86 linux 上（可以在其他体系结构上实现的方法会很好）。

想法

不幸的是，我几乎没有装配经验。以下是对上述属性的一些想法：

函数中包含的指令类型（即浮点指令、内存屏障）
函数的内存访问（它是否从堆读取/写入？堆栈？）
调用的库函数（它们的名称）应该在 ELF 中可用；而且它们的顺序通常不应该改变）
控制流图的形状（我猜这将高度依赖于编译器）

现有的工作

我只能找到切线相关的工作：

自动化方法可以识别加密算法在编译的代码中： http://www.emma.rub.de/research/publications /automated-identification-cryptography-primitives/
IDA反汇编器中的快速库识别和识别技术；识别具体的指令序列，但仍然包含一些可能有用的想法：http://www.hex-rays.com/idapro /flirt.htm

您对函数属性有什么建议吗？或者有一个不同的想法也能实现我的目标？或者类似的东西已经实施了，但我完全错过了？

原文

I would like to determine, whether two functions in two executables were compiled from the same (C) source code, and would like to do so even if they were compiled by different compiler versions or with different compilation options. Currently, I'm considering implementing some kind of assembler-level function fingerprinting. The fingerprint of a function should have the properties that:

two functions compiled from the same source under different circumstances are likely to have the same fingerprint (or similar one),
two functions compiled from different C source are likely to have different fingerprints,
(bonus) if the two source functions were similar, the fingerprints are also similar (for some reasonable definition of similar).

What I'm looking for right now is a set of properties of compiled functions that individually satisfy (1.) and taken together hopefully also (2.).

Assumptions

Of course that this is generally impossible, but there might exist something that will work in most of the cases. Here are some assumptions that could make it easier:

linux ELF binaries (without debugging information available, though),
not obfuscated in any way,
compiled by gcc,
on x86 linux (approach that can be implemented on other architectures would be nice).

Ideas

Unfortunately, I have little to no experience with assembly. Here are some ideas for the abovementioned properties:

types of instructions contained in the function (i.e. floating point instructions, memory barriers)
memory accesses from the function (does it read/writes from/to heap? stack?)
library functions called (their names should be available in the ELF; also their order shouldn't usually change)
shape of the control flow graph (I guess this will be highly dependent on the compiler)

Existing work

I was able to find only tangentially related work:

Automated approach which can identify crypto algorithms in compiled code: http://www.emma.rub.de/research/publications/automated-identification-cryptographic-primitives/
Fast Library Identification and Recognition Technology in IDA disassembler; identifies concrete instruction sequences, but still contains some possibly useful ideas: http://www.hex-rays.com/idapro/flirt.htm

Do you have any suggestions regarding the function properties? Or a different idea which also accomplishes my goal? Or was something similar already implemented and I completely missed it?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

财迷小姐 2024-12-09 18:58:02

FLIRT 使用字节级模式匹配，因此它会因指令编码的任何变化而崩溃（例如不同的寄存器分配/重新排序的指令）。

对于图匹配，请参阅 BinDiff。虽然它是闭源的，但 Halvar 在上描述了一些方法他的博客。他们甚至以BinCrowd 插件。

回复收藏 0 原文

一杯敬自由 2024-12-09 18:58:02

在我看来，做这样的事情最简单的方法是将函数汇编分解回一些更高级别的形式，其中构造（例如 for、while、函数调用等） .) 存在，然后匹配这些更高级别构造的结构。

这将防止指令重新排序、循环提升、循环展开和任何其他与比较混乱的优化，您甚至可以（去）优化这个更高级别的结构到两端的最大程度，以确保它们位于同一点，因此未优化之间的比较调试代码和 -O3 不会因缺少临时文件/缺少寄存器溢出等而失败。

您可以使用类似 boomerang 作为反编译的基础（除非你不会吐出 C 代码）。

回复收藏 0 原文

梦回旧景 2024-12-09 18:58:02

我建议您从编写代码的语言以及代码对编译器优化施加的限制的角度来解决这个问题。

我不太熟悉 C 标准，但 C++ 有“可观察”行为的概念。该标准仔细地定义了这一点，并且只要结果给出相同的可观察行为，编译器就被赋予了很大的优化自由度。对于尝试确定两个函数是否相同，我的建议是尝试确定它们的可观察行为是什么（它们执行哪些 I/O 以及如何与其他内存区域交互以及以什么顺序交互）。

回复收藏 0 原文

握住我的手 2024-12-09 18:58:02

如果问题集可以简化为一小组已知的 C 或 C++ 源代码函数，这些函数由 n 个不同的编译器编译，每个编译器都具有 < em>m[n] 组不同的编译器选项，那么一个简单但乏味的解决方案是使用编译器和选项的每种组合来编译代码，并对生成的指令字节进行编目，或者更有效地，他们在数据库中的哈希签名。

可能使用的编译器选项集可能很大，但在实际实践中，工程师通常使用相当标准的小型选项集，通常只是针对调试进行最低限度的优化，并针对发布进行全面优化。研究许多项目配置可能会发现，在任何工程文化中，只有两到三个与编译器工作方式的偏见或迷信有关——无论准确与否。

我怀疑这种方法最接近您真正想要的：一种调查可疑盗用源代码的方法。所有建议的重建编译器解析树的技术都可能会取得成果，但对于被忽视的对称解决方案或模糊的无法解决的情况具有巨大的潜力。

回复收藏 0 原文

~没有更多了~