汇编级功能指纹
我想确定两个可执行文件中的两个函数是否是从相同的(C)源代码编译的,并且即使它们是由不同的编译器版本或使用不同的编译选项编译的,也想这样做。目前,我正在考虑实现某种汇编级函数指纹识别。函数的指纹应具有以下属性:
- 在不同情况下从同一源编译的两个函数可能具有相同的指纹(或相似的指纹),
- 从不同的 C 源编译的两个函数可能具有不同的指纹,
- (额外的) )如果两个源函数相似,则指纹也相似(对于相似的一些合理定义)。
我现在正在寻找的是编译函数的一组属性,这些属性单独满足(1.)并且希望也满足(2.)。
假设
当然,这通常是不可能的,但可能存在一些在大多数情况下都有效的东西。以下是一些可以使其变得更容易的假设:
- linux ELF 二进制文件(尽管没有可用的调试信息),
- 没有以任何方式混淆,
- 由 gcc 编译,
- 在 x86 linux 上(可以在其他体系结构上实现的方法会很好)。
想法
不幸的是,我几乎没有装配经验。以下是对上述属性的一些想法:
- 函数中包含的指令类型(即浮点指令、内存屏障)
- 函数的内存访问(它是否从堆读取/写入?堆栈?)
- 调用的库函数(它们的名称)应该在 ELF 中可用;而且它们的顺序通常不应该改变)
- 控制流图的形状(我猜这将高度依赖于编译器)
现有的工作
我只能找到切线相关的工作:
- 自动化方法可以识别加密算法在编译的代码中: http://www.emma.rub.de/research/publications /automated-identification-cryptography-primitives/
- IDA反汇编器中的快速库识别和识别技术;识别具体的指令序列,但仍然包含一些可能有用的想法:http://www.hex-rays.com/idapro /flirt.htm
您对函数属性有什么建议吗?或者有一个不同的想法也能实现我的目标?或者类似的东西已经实施了,但我完全错过了?
I would like to determine, whether two functions in two executables were compiled from the same (C) source code, and would like to do so even if they were compiled by different compiler versions or with different compilation options. Currently, I'm considering implementing some kind of assembler-level function fingerprinting. The fingerprint of a function should have the properties that:
- two functions compiled from the same source under different circumstances are likely to have the same fingerprint (or similar one),
- two functions compiled from different C source are likely to have different fingerprints,
- (bonus) if the two source functions were similar, the fingerprints are also similar (for some reasonable definition of similar).
What I'm looking for right now is a set of properties of compiled functions that individually satisfy (1.) and taken together hopefully also (2.).
Assumptions
Of course that this is generally impossible, but there might exist something that will work in most of the cases. Here are some assumptions that could make it easier:
- linux ELF binaries (without debugging information available, though),
- not obfuscated in any way,
- compiled by gcc,
- on x86 linux (approach that can be implemented on other architectures would be nice).
Ideas
Unfortunately, I have little to no experience with assembly. Here are some ideas for the abovementioned properties:
- types of instructions contained in the function (i.e. floating point instructions, memory barriers)
- memory accesses from the function (does it read/writes from/to heap? stack?)
- library functions called (their names should be available in the ELF; also their order shouldn't usually change)
- shape of the control flow graph (I guess this will be highly dependent on the compiler)
Existing work
I was able to find only tangentially related work:
- Automated approach which can identify crypto algorithms in compiled code: http://www.emma.rub.de/research/publications/automated-identification-cryptographic-primitives/
- Fast Library Identification and Recognition Technology in IDA disassembler; identifies concrete instruction sequences, but still contains some possibly useful ideas: http://www.hex-rays.com/idapro/flirt.htm
Do you have any suggestions regarding the function properties? Or a different idea which also accomplishes my goal? Or was something similar already implemented and I completely missed it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
FLIRT 使用字节级模式匹配,因此它会因指令编码的任何变化而崩溃(例如不同的寄存器分配/重新排序的指令)。
对于图匹配,请参阅 BinDiff。虽然它是闭源的,但 Halvar 在 上描述了一些方法他的博客。他们甚至以BinCrowd 插件。
FLIRT uses byte-level pattern matching, so it breaks down with any changes in the instruction encodings (e.g. different register allocation/reordered instructions).
For graph matching, see BinDiff. While it's closed source, Halvar has described some of the approaches on his blog. They even have open sourced some of the algos they do to generate fingerprints, in the form of BinCrowd plugin.
在我看来,做这样的事情最简单的方法是将函数汇编分解回一些更高级别的形式,其中构造(例如
for
、while
、函数调用等) .) 存在,然后匹配这些更高级别构造的结构。这将防止指令重新排序、循环提升、循环展开和任何其他与比较混乱的优化,您甚至可以(去)优化这个更高级别的结构到两端的最大程度,以确保它们位于同一点,因此未优化之间的比较调试代码和
-O3
不会因缺少临时文件/缺少寄存器溢出等而失败。您可以使用类似 boomerang 作为反编译的基础(除非你不会吐出 C 代码)。
In my opinion, the easiest way to do something like this would be to decompose the functions assembly back into some higher level form where constructs (like
for
,while
, function calls etc.) exist, then match the structure of these higher level constructs.This would prevent instruction reordering, loop hoisting, loop unrolling and any other optimizations messing with the comparison, you can even (de)optimize this higher level structures to their maximum on both ends to ensure they are at the same point, so comparisons between unoptimized debug code and
-O3
won't fail out due to missing temporaries/lack of register spills etc.You can use something like boomerang as a basis for the decompilation (except you wouldn't spit out C code).
我建议您从编写代码的语言以及代码对编译器优化施加的限制的角度来解决这个问题。
我不太熟悉 C 标准,但 C++ 有“可观察”行为的概念。该标准仔细地定义了这一点,并且只要结果给出相同的可观察行为,编译器就被赋予了很大的优化自由度。对于尝试确定两个函数是否相同,我的建议是尝试确定它们的可观察行为是什么(它们执行哪些 I/O 以及如何与其他内存区域交互以及以什么顺序交互)。
I suggest you approach this problem from the standpoint of the language the code was written in and what constraints that code puts on compiler optimization.
I'm not real familiar with the C standard, but C++ has the concept of "observable" behavior. The standard carefully defines this, and compilers are given great latitude in optimizing as long as the result gives the same observable behavior. My recommendation for trying to determine if two functions are the same would be to try to determine what their observable behavior is (what I/O they do and how the interact with other areas of memory and in what order).
如果问题集可以简化为一小组已知的
C
或C++
源代码函数,这些函数由 n 个不同的编译器编译,每个编译器都具有 < em>m[n] 组不同的编译器选项,那么一个简单但乏味的解决方案是使用编译器和选项的每种组合来编译代码,并对生成的指令字节进行编目,或者更有效地,他们在数据库中的哈希签名。可能使用的编译器选项集可能很大,但在实际实践中,工程师通常使用相当标准的小型选项集,通常只是针对调试进行最低限度的优化,并针对发布进行全面优化。研究许多项目配置可能会发现,在任何工程文化中,只有两到三个与编译器工作方式的偏见或迷信有关——无论准确与否。
我怀疑这种方法最接近您真正想要的:一种调查可疑盗用源代码的方法。所有建议的重建编译器解析树的技术都可能会取得成果,但对于被忽视的对称解决方案或模糊的无法解决的情况具有巨大的潜力。
If the problem set can be reduced to a small set of known
C
orC++
source code functions being compiled by n different compilers, each with m[n] different sets of compiler options, then a straightforward, if tedious, solution would be to compile the code with every combination of compiler and options and catalog the resulting instruction bytes, or more efficiently, their hash signature in a database.The set of likely compiler options used is potentially large, but in actual practice, engineers typically use a pretty standard and small set of options, usually just minimally optimized for debugging and fully optimized for release. Researching many project configurations might reveal there are only two or three more in any engineering culture relating to prejudice or superstition of how compilers work—whether accurate or not.
I suspect this approach is closest to what you actually want: a way of investigating suspected misappropriated source code. All the suggested techniques of reconstructing the compiler's parse tree might bear fruit, but have great potential for overlooked symmetric solutions or ambiguous unsolvable cases.