使用旧调试版本中的符号对剥离的二进制文件进行符号化（不精确的图形匹配）

发布于 2024-10-06 16:59:36 字数 1899 浏览 4 评论 0原文

我有二进制A，它是一个带有附带符号的调试版本——多年前构建的。我还有二进制 B，一个发布版本没有伴随符号，并且是更新的。我正在寻找将二进制 A 中的符号与二进制 B 中的潜在候选符号进行匹配的最有效方法。

鉴于调试版本相当大（进行更多输入验证，向 stderr 打印更多内容等），并且函数总是随着时间的推移而变化，我认为尝试对各个函数进行指纹识别将是浪费时间。

因此，我决定——完全凭空而来，所以我可能会找错树——对函数进行指纹识别的最佳方法是创建两个二进制文件的调用图并尝试匹配顶点（即功能）。

我已经完成了一些预处理，因此我有以下数据结构：

# binary A
[[60, 60, 8734], # function 0 is called by functions 60 (twice) and 8734
 [193, 441, 505], # function 1 is called by functions 193, 441 and 505
 [193, 742],
 [23],  
 [21],  
 [21],  
 [26],  
 [26, 1508, 1509, 1573],  
 [24],  
 [25],
 ...] # (~10k functions)

# binary B
[[8999], # function 0 is called by function 8999
 [9016], # function 1 is called by function 9016
 [1126], 
 [7904, 7904, 7913], 
 [182, 336, 396, 396], 
 [9010], 
 [407], 
 [182, 632], 
 [20], 
 [24],
 ...] # (~10k functions)

需要注意的重要一点是，二进制 A 中的函数“0”和二进制B中的函数“0”。这些是我分配给每个二进制文件中每个函数的任意 ID。

下一步是让我困惑的事情。我的算法非常弱，我想不出一个聪明的方法来继续。我（非常有限）的理解是，为了解决这个问题，我想采用某种形式的不精确图匹配。换句话说，哪一组映射Ai→？ Bi 会最大化两个调用图的相似度吗？

鉴于二进制 A 中有额外的调试功能，并且程序会随着时间的推移而发展，这一明显事实很可能不存在完全匹配的情况。理想情况下，我想要以下形式的输出：

[[(37, 0.998), (8432, 0.912), (442, 0.75)], # matching-ness of function "0" in binary A with function "37" in binary B is 0.998, second most likely candidate is function "8432" in binary B with score 0.912, etc.
 [(42, 0.973), (7751, 0.788)], # matching-ness of function "1" in binary A with function "42" in binary B is 0.973, second most likely candidate is function "7751" in binary B with score 0.788, etc.
 [(4579, 0.996), (123, 0.934)], 
 ...] # around ~10k mappings

实际上，如果我只能与一名候选人凑合并且不提供排名，我会很高兴，但人们可以梦想。

任何 SO-goers 知道我应该从哪里开始吗？

原文

I have binary A, which is a debug build with accompanying symbols -- built many years ago. I also have binary B, a release build without accompanying symbols and is much more recent. I am looking for the most efficient method of matching the symbols from binary A to potential candidates in binary B.

Given that the debug build is considerably bigger (does more input validation, prints more things to stderr, etc.) and that functions invariably change over time, I figure that trying to fingerprint the individual functions will be wasted time.

Therefore, I've decided -- quite out of thin air, so I could be barking up the wrong tree -- that the best way to fingerprint the functions is to create call graphs of both binaries and try to match the vertices (i.e. the functions).

I have already done some preprocessing, so I have the following data structures:

# binary A
[[60, 60, 8734], # function 0 is called by functions 60 (twice) and 8734
 [193, 441, 505], # function 1 is called by functions 193, 441 and 505
 [193, 742],
 [23],  
 [21],  
 [21],  
 [26],  
 [26, 1508, 1509, 1573],  
 [24],  
 [25],
 ...] # (~10k functions)

# binary B
[[8999], # function 0 is called by function 8999
 [9016], # function 1 is called by function 9016
 [1126], 
 [7904, 7904, 7913], 
 [182, 336, 396, 396], 
 [9010], 
 [407], 
 [182, 632], 
 [20], 
 [24],
 ...] # (~10k functions)

An important note to take away is that there is no correspondence between function "0" in binary A and function "0" in binary B. These are arbitrary IDs I have assigned to each function in each binary.

The next step is that which confounds me. My algorithm-fu is very weak and I cannot think of a clever way to proceed. My (very limited) understanding is that to solve this, I would want to employ some form of inexact graph matching. In other words, which set of mappings Ai -> Bi would maximise the similarity of the two call-graphs?

Given that there are additional debugging functions in binary A and the obvious fact that programs evolve over time, there is likely no exact match. Ideally, I would want output of the form:

[[(37, 0.998), (8432, 0.912), (442, 0.75)], # matching-ness of function "0" in binary A with function "37" in binary B is 0.998, second most likely candidate is function "8432" in binary B with score 0.912, etc.
 [(42, 0.973), (7751, 0.788)], # matching-ness of function "1" in binary A with function "42" in binary B is 0.973, second most likely candidate is function "7751" in binary B with score 0.788, etc.
 [(4579, 0.996), (123, 0.934)], 
 ...] # around ~10k mappings

In reality, I would be happy with if I had to make do with only one candidate and no ranking was provided, but one can dream.

Any SO-goers have an idea of where I should start?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

南街九尾狐 2024-10-13 16:59:37

这当然是一个有趣的问题，尽管我怀疑它很难解决。它似乎是有向图上近似图同构的一个实例。我没有找到太多关于此问题的谷歌搜索，但是这里有一些软件用于解决无向一般图，一种更一般的情况，是 NP 难的。

对齐可执行代码

我认为您可以做的最实际的事情就是忘记运行时信息，只需获取每个版本的可执行代码部分并使用全局对齐算法（例如 Needleman-Wunsch，尽管确实存在更快但不太准确的算法）：

将整个指令视为字符（这将需要构建一个基本的反汇编器）
完全忽略指令的地址组件
减少删除的权重（假设“第一个”文件是调试版本，我们预计该版本会更大）
增加CALL指令的匹配权重，并且可能其他“可靠”的指令序列。

假设函数在可执行文件中出现的顺序没有太大变化（不会有太大变化，除非优化版本使用了一些优化，使链接器将相互调用的函数放置在彼此靠近的位置），这应该给你一个很好的初步近似值。

分配问题（二分最大匹配）

或者，如果您能找到一种方法（我的直觉表明这需要是一种迭代方法，类似于 PageRank 如何决定网页的价值）来“评分”的可能性调试版本中的函数 f 对应于优化版本中的函数 g，那么是的，您可以使用图形匹配技术。在这种情况下，图的顶点将是两个版本中的所有函数，并且调试版本中的每个函数和优化版本中的每个函数之间将存在加权边，其权重由评分系统确定。

该图将是二分图，因为同一版本中的两个函数之间永远不会有边。这意味着它是分配问题的一个实例，非常好（而且不太复杂）算法是存在的。

然而，这里缺少的是确定每对权重的方法。一种近似的方法是构建一个向量，计算每个函数的直系子代、孙代、曾孙代等的数量。然后，您可以使用您喜欢的任何距离度量来比较这些向量。但我预计在这里这样做不会很好地工作，因为我们已经预计调试版本比优化版本包含更多的函数调用。

使用完整的调用树

如果您可以访问两者的整个调用树，这将为您提供更多信息：函数内进行的调用顺序，以及调用的确切层次结构的知识。然后，您可以通过仅使用后者来为每个函数构建“签名”：

提取给定函数调用的不同函数的列表，
将第一个调用的函数标记为 1，第二个调用的distinct< /em> 函数 2 等。
签名就是按函数中调用顺序排列的函数标签序列。

现在，编辑距离可用于比较 2 个签名。为了以更多计算为代价获得更高的准确性，您可以使用一种变体，其中允许删除调试版本中最多 k 个不同的函数，对于一些较小的 k（例如 k = 3），并采用最佳的 Levenshtein 距离与所有此类“精简”版本相比，附加了与删除的功能数量成正比的小惩罚。

Certainly an interesting problem, though I suspect it will be hard to solve. It appears to be an instance of approximate graph isomorphism on directed graphs. I didn't find much googling for this, but here's some software for solving for undirected general graphs, a more general case which is NP hard.

Aligning Executable Code

I think the most practical thing you may be able to do is to forget about the runtime information and simply take the executable code sections of each version and use a global alignment algorithm (e.g. Needleman-Wunsch, although there do exist much faster but less accurate algorithms) on them that:

Treats entire instructions as characters (this will require building a rudimentary disassembler)
Disregards address components of instructions entirely
Downweights deletions (assuming the "first" file is the debug version, which we expect to be larger)
Upweights matches of CALL instructions, and possibly other "reliable" sequences of instructions.

Assuming the order that the functions appear in the executables hasn't changed too much (which it won't have, unless the optimised version has used some optimisation that gets the linker to place functions that call each other near each other), this should get you a nice first approximation.

The Assignment Problem (Bipartite Maximum Matching)

Alternatively, if you can find a way (and my intuition suggests it would need to be an iterative approach, along the lines of how PageRank decides the value of a webpage) to "score" the likelihood that a function f in the debug version corresponds to a function g in the optimised version, then yes you could use a graph matching technique. In this case the vertices of the graph would be all functions in both versions, and there would be a weighted edge between every function from the debug version and every function from the optimised version, with the weight determined by your scoring system.

This graph would be bipartite because there will never be an edge between 2 functions in the same version. That means it is an instance of the Assignment Problem, for which quite good (and not too complicated) algorithms exist.

However, the missing piece here is the means of deciding the weights for each pairing. One approximate way of doing this would be to build a vector counting the number of immediate children, grandchildren, great-grandchildren and so on for each function. You could then compare these vectors using any distance measure you like. But I expect doing this here will not work well, because we already expect the debug version to contain far more calls to functions than the optimised version.

Using the Full Call Tree

If you have access to the entire call tree for both, this will give you more information: the sequence of calls made within a function, as well as knowledge of the exact hierarchy of calls. You might then be able to build a "signature" for each function by doing the using just the latter:

Extract the list of distinct functions called by a given function
Label the 1st-called function 1, the 2nd-called distinct function 2, etc.
The signature is just this sequence of function labels in the order that they are called in the function.

Now, Levenshtein distance can be used to compare 2 signatures. For more accuracy at the expense of more computation, you could use a variation in which up to k distinct functions in the debug version are allowed to be deleted, for some small k (e.g. k = 3), and the best Levenshtein distance is taken over all such "slimmed down" versions, with a small penalty attached that is proportional to the number of functions deleted.

回复收藏 0 原文