子程序推理
是否有任何论文描述了从编译的程序推断子例程的算法/技术?换句话说:是否有一种算法可以找到程序中多次出现的代码块?这些块可以对指令进行重新排序(当然,无需更改程序行为),以便更有可能找到匹配项。
这个过程可以看作是编译器为避免调用而完成的子例程内联的相反过程,但增加了二进制大小。
在我看来,这是一个非常难的理论问题。
Is there any paper describing any algorithm/technique to infer subroutines from a compiled program? In other words: is there an algorithm to find blocks of code that appear more than once in the program? These blocks could have the instructions reordered (without program behavior change, of course) so that it's more likely to find a match.
This process can be seen as the opposite of subroutine inlining that is done by compilers to avoid calls, but increasing the binary size.
It seems to me that this is a very hard theoretical problem.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
嗯,这是一个有趣的问题。人们确实在这方面做出了努力。快速搜索会返回这两个:
Keith D. Cooper、Nathaniel McIntosh:嵌入式 RISC 处理器的增强代码压缩,PLDI 1999。
Christopher W弗雷泽、尤金·W·迈尔斯、艾伦·L·温特:分析和压缩汇编代码,SIGPLAN 公告,1984 年 6 月。
但可能还有更多。您可以使用Google Scholar查找引用这些旧论文的最新论文。
Well, it's an interesting problem. People did actually work on this. A quick search returns these two:
Keith D. Cooper, Nathaniel McIntosh: Enhanced Code Compression for Embedded RISC Processors, PLDI 1999.
Christopher W. Fraser, Eugene W. Myers, Alan L. Wendt: Analyzing and Compressing Assembly Code, SIGPLAN Notices, June 1984.
But there are probably many more. You could use Google Scholar to find more recent papers that reference these old ones.
您正在寻找的东西称为“克隆检测器”。您可以在源代码或目标代码上执行此操作。关键的想法是决定您想要接受哪些变化点。
您可以阅读我们的 CloneDR 克隆检测器,它通过比较源代码的语法树来查找重复的代码文件,查找精确匹配和差点匹配。它可以跨多个文件,而不仅仅是在一个源文件内。这有点像“公共子表达式”检测,但它适用于声明和可执行代码。当匹配不精确时,它可以确定“子例程”(抽象)的参数。
请参阅我关于使用抽象语法树进行克隆检测的论文,了解算法描述。
CloneDR 使用语言精确的前端解析器为许多语言执行此操作。
该网站描述了 CloneDR 的工作原理,并将 CloneDR 与许多其他克隆检测工具进行了比较。
CloneDR 不处理“指令重新排序”。通过比较 PDG 来查找重复项的可扩展性较差的方法可以做到这一点。这些非常接近于比较数据流图,这可能有助于查找机器指令代码匹配。
What you are looking for is called a "clone detector". You can do this on source code, or object code. The key idea is to decide what points of variability you want to accept.
You can read about our CloneDR clone detector, which finds duplicated code by comparing the syntax trees of the source files, finding exact and near-miss matches. It does across many files rather than just within one source file. This is kind of like "common subexpression" detection, but it works on declarations as well as executable code. When the match isn't exact, it can determine parameters for the "subroutine" (abstraction).
See my paper on Clone Detection Using Abstract Syntax trees for an algorithmic description.
CloneDR does this for many languages, using language-precise front end parsers.
The site describes how CloneDR works, and compares CloneDR with a number of other clone detection tools.
CloneDR doesn't handle "instruction reordering". Less scalable methods that find duplicates by comparing PDGs can do this. These come pretty close to comparing data flow graphs, which might be good for finding machine instruction code matches.
也许这很愚蠢..但请考虑“差异”。它基本上做了一个受限版本。
Maybe this is dumb .. but consider "diff". It basically does a restricted version of this.