检测 C/C++使用 LLVM 的代码

发布于 2024-12-06 09:13:44 字数 2243 浏览 1 评论 0原文

我刚刚读到 LLVM 项目,它可以使用 LLVM 前端的分析器 Clang 对 C/C++ 代码进行静态分析。我想知道是否可以使用 LLVM 提取源代码中对内存的所有访问(变量、本地以及全局)。

LLVM 中是否存在任何内置库,我可以使用它来提取此信息。 如果没有,请建议我如何编写函数来执行相同的操作。(现有源代码、参考、教程、示例...) 我的想法是,我会首先将源代码转换为 LLVM bc,然后对其进行分析,但不知道到底该怎么做。


我试图弄清楚自己应该使用哪种 IR 来达到我的目的(Clang 的抽象语法树 (AST) 或 LLVM 的 SSA 中间表示 (IR)。),但无法真正弄清楚该使用哪一个。 这就是我正在尝试做的事情。 给定任何 C/C++ 程序(如下面给出的程序),我尝试在每条从内存读取/写入的指令之前和之后插入对某个函数的调用。例如,考虑下面的 C++ 程序 (Account.cpp)

#include <stdio.h>

class Account {
  int balance;

public:
  Account(int b) {
    balance = b;
  }

  int read() {
    int r;
    r = balance;
    return r;
  }

  void deposit(int n) {
    balance = balance + n;
  }

  void withdraw(int n) {
    int r = read();
    balance = r - n;
  }
};

int main () {
  Account* a = new Account(10);
  a->deposit(1);
  a->withdraw(2);
  delete a;
}

因此,在检测之后,我的程序应该如下所示:

#include <stdio.h>

class Account {
  int balance;

public:
  Account(int b) {
    balance = b;
  }

  int read() {
    int r;
    foo();
    r = balance;
    foo();
    return r;
  }

  void deposit(int n) {
    foo();
    balance = balance + n;
    foo();
  }

  void withdraw(int n) {
    foo();
    int r = read();
    foo();
    foo();
    balance = r - n;
    foo();
  }
};

int main () {
  Account* a = new Account(10);
  a->deposit(1);
  a->withdraw(2);
  delete a;
}

其中 foo() 可以是任何函数,例如获取当前系统时间或递增计数器等。我知道要插入像上面这样的函数,我必须首先获取 IR,然后在 IR 上运行仪器传递,这会将此类调用插入到 IR 中,但我真的不知道如何实现它。请举例说明如何去做。

我还了解到,一旦将程序编译到 IR 中,就很难在原始程序和仪表化 IR 之间获得 1:1 映射。那么,是否可以将 IR 中所做的更改(由于检测)反映到原始程序中。

为了开始使用 LLVM pass 以及如何自己制作一个,我查看了一个向 LLVM IR 加载和存储添加运行时检查的 pass 示例,即 SAFECode 的加载/存储检测 pass (http://llvm.org/viewvc/llvm-project/safecode/trunk/include/safecode/LoadStoreChecks.h?view=markuphttp://llvm.org/viewvc/llvm-project/safecode/trunk/lib/InsertPoolChecks/LoadStoreChecks.cpp?view=markup)。但我不知道如何运行这个通行证。请告诉我如何在上面的 Account.cpp 等程序上运行此通行证的步骤。

I just read about the LLVM project and that it could be used to do static analysis on C/C++ codes using the analyzer Clang which the front end of LLVM. I wanted to know if it is possible to extract all the accesses to memory(variables, local as well as global) in the source code using LLVM.

Is there any inbuilt library present in LLVM which I could use to extract this information.
If not please suggest me how to write functions to do the same.(existing source code, reference, tutorial, example...)
Of what i have thought, is I would first convert the source code into LLVM bc and then instrument it to do the analysis, but don't know exactly how to do it.


I tried to figure out myself which IR should I use for my purpose ( Clang's Abstract Syntax Tree (AST) or LLVM's SSA Intermediate Representation (IR). ), but couldn't really figure out which one to use.
Here is what I m trying to do.
Given any C/C++ program (like the one given below), I am trying to insert calls to some function, before and after every instruction that reads/writes to/from memory. For example consider the below C++ program ( Account.cpp)

#include <stdio.h>

class Account {
  int balance;

public:
  Account(int b) {
    balance = b;
  }

  int read() {
    int r;
    r = balance;
    return r;
  }

  void deposit(int n) {
    balance = balance + n;
  }

  void withdraw(int n) {
    int r = read();
    balance = r - n;
  }
};

int main () {
  Account* a = new Account(10);
  a->deposit(1);
  a->withdraw(2);
  delete a;
}

So after the instrumentation my program should look like:

#include <stdio.h>

class Account {
  int balance;

public:
  Account(int b) {
    balance = b;
  }

  int read() {
    int r;
    foo();
    r = balance;
    foo();
    return r;
  }

  void deposit(int n) {
    foo();
    balance = balance + n;
    foo();
  }

  void withdraw(int n) {
    foo();
    int r = read();
    foo();
    foo();
    balance = r - n;
    foo();
  }
};

int main () {
  Account* a = new Account(10);
  a->deposit(1);
  a->withdraw(2);
  delete a;
}

where foo() may be any function like get the current system time or increment a counter .. so on. I understand that to insert function like above I will have to first get the IR and then run an instrumentation pass on the IR which will insert such calls into the IR, but I don't really know how to achieve it. Please suggest me with examples how to go about it.

Also I understand that once I compile the program into the IR, it would be really difficult to get 1:1 mapping between my original program and the instrumented IR. So, is it possible to reflect the changes made in the IR ( because of instrumentation ) into the original program.

In order to get started with LLVM pass and how to make one on my own, I looked at an example of a pass that adds run-time checks to LLVM IR loads and stores, the SAFECode's load/store instrumentation pass (http://llvm.org/viewvc/llvm-project/safecode/trunk/include/safecode/LoadStoreChecks.h?view=markup and http://llvm.org/viewvc/llvm-project/safecode/trunk/lib/InsertPoolChecks/LoadStoreChecks.cpp?view=markup). But I couldn't figure out how to run this pass. Please give me steps how to run this pass on some program say the above Account.cpp.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

心碎无痕… 2024-12-13 09:13:44

首先,您必须决定是否要使用 clang 还是 LLVM。它们都运行在非常不同的数据结构上,各有优缺点。

根据您对问题的稀疏描述,我建议您在 LLVM 中进行优化。使用 IR 将使清理、分析和注入代码变得更加容易,因为这就是它的设计目的。缺点是您的项目将依赖于 LLVM,这对您来说可能是问题,也可能不是问题。您可以使用 C 后端输出结果,但人类无法使用它。

使用优化过程时的另一个重要缺点是您还会丢失原始源代码中的所有符号。即使 Value 类(稍后详细介绍)具有 getName 方法,您也不应该永远依赖它来包含任何有意义的内容。它的目的是帮助您调试您的通道,仅此而已。

您还必须对编译器有基本的了解。例如,需要了解 基本块静态单一分配表单。幸运的是,它们并不是很难学习或理解的概念(维基百科文章应该足够了)。

在开始编码之前,您首先必须阅读一些内容,因此这里有一些可以帮助您开始编码的链接:

  • 架构概述:LLVM 架构快速概述。让您清楚地了解您正在使用什么以及 LLVM 是否是适合您的工具。

  • 文档标题:您可以在其中找到以下所有链接及更多内容。如果我遗漏了什么,请参阅此内容。

  • LLVM 的 IR 参考:这是完整的描述您将要操作的 LLVM IR。该语言相对简单,因此无需学习太多内容。

  • 程序员手册:快速概述您所了解的基本内容使用 LLVM 时需要知道。

  • 写作通行证:写作时需要了解的一切转换或分析过程。

  • LLVM 通行证:所有通行证的完整列表由 LLVM 提供,您可以而且应该使用。这些确实可以帮助清理代码并使其更易于分析。例如,在使用循环时,lcssasimplify-loopindvar 传递将挽救您的生命。

  • 值继承树:这是 doxygen 页面值类。这里重要的一点是继承树,您可以按照它来获取 IR 参考页中定义的所有指令的文档。只需忽略他们称之为协作图的不敬虔的怪物即可。

  • 类型继承树:与上面相同,但针对类型.

一旦你明白了这一切,那就小菜一碟了。寻找内存访问?搜索 storeload 指令。仪器?只需使用 Value 类的正确子类创建您需要的内容,并将其插入存储和加载指令之前或之后。因为你的问题有点太宽泛,所以我无法为你提供更多的帮助。 (参见下面的更正)

顺便说一句,几周前我不得不做类似的事情。在大约 2-3 周内,我能够学习有关 LLVM 的所有知识,创建一个分析通道以查找循环内的内存访问(以及更多),并使用我创建的转换通道来检测它们。没有涉及任何花哨的算法(除了 LLVM 提供的算法),一切都非常简单。这个故事的寓意是 LLVM 易于学习和使用。


更正:当我说您所要做的就是搜索 loadstore 指令时,我犯了一个错误。

loadstore 指令仅提供使用指针对堆进行的访问。为了获得所有内存访问,您还必须查看可以表示堆栈上内存位置的值。该值是写入堆栈还是存储在寄存器中是在后端优化过程中发生的寄存器分配阶段确定的。这意味着它依赖于平台,不应依赖它。

现在,除非您提供有关您正在寻找哪种类型的内存访问、在什么上下文中以及您打算如何使用它们的更多信息,否则我无法为您提供更多帮助。

First off, you have to decide whether you want to work with clang or LLVM. They both operate on very different data structures which have advantages and disadvantages.

From your sparse description of your problem, I'll recommend going for optimization passes in LLVM. Working with the IR will make it much easier to sanitize, analyze and inject code because that's what it was designed to do. The downside is that your project will be dependent on LLVM which may or may not be a problem for you. You could output the result using the C backend but that won't be usable by a human.

Another important downside when working with optimization passes is that you also lose all symbols from the original source code. Even if the Value class (more on that later) has a getName method, you should never rely on it to contain anything meaningful. It's meant to help you debug your passes and nothing else.

You will also have to have a basic understanding of compilers. For example, it's a bit of a requirement to know about basic blocks and static single assignment form. Fortunately they're not very difficult concepts to learn or understand (the Wikipedia articles should be adequate).

Before you can start coding, you first have to do some reading so here's a few links to get you started:

  • Architecture Overview: A quick architectural overview of LLVM. Will give you a good idea of what you're working with and whether LLVM is the right tool for you.

  • Documentation Head: Where you can find all the links below and more. Refer to this if I missed anything.

  • LLVM's IR reference: This is the full description of the LLVM IR which is what you'll be manipulating. The language is relatively simple so there isn't too much to learn.

  • Programmer's manual: A quick overview of basic stuff you'll need to know when working with LLVM.

  • Writing Passes: Everything you need to know to write transformation or analysis passes.

  • LLVM Passes: A comprehensive list of all the passes provided by LLVM that you can and should use. These can really help clean up the code and make it easier to analyze. For example, when working with loops, the lcssa, simplify-loop and indvar passes will save your life.

  • Value Inheritance Tree: This is the doxygen page for the Value class. The important bit here is the inheritance tree that you can follow to get the documentation for all the instructions defined in the IR reference page. Just ignore the ungodly monstrosity that they call the collaboration diagram.

  • Type Inheritance Tree: Same as above but for types.

Once you understand all that then it's cake. To find memory accesses? Search for store and load instructions. To instrument? Just create what you need using the proper subclass of the Value class and insert it before or after the store and load instruction. Because your question is a bit too broad, I can't really help you more than this. (See correction below)

By the way, I had to do something similar a few weeks ago. In about 2-3 weeks I was able to learn all I needed about LLVM, create an analysis pass to find memory accesses (and more) within a loop and instrument them with a transformation pass I created. There was no fancy algorithms involved (except the ones provided by LLVM) and everything was pretty straightforward. Moral of the story is that LLVM is easy to learn and work with.


Correction: I made an error when I said that all you have to do is search for load and store instructions.

The load and store instruction will only give accesses that are made to the heap using pointers. In order to get all memory accesses you also have to look at the values which can represent a memory location on the stack. Whether the value is written to the stack or stored in a register is determined during the register allocation phase which occurs in an optimization pass of the backend. Meaning that it's platform dependent and shouldn't be relied on.

Now unless you provide more information about what kind of memory accesses you're looking for, and in what context and how you intend to instrument them, I can't help you much more than this.

赠佳期 2024-12-13 09:13:44

由于两天后还没有回答你的问题,我将提供他的一个稍微但并非完全偏离主题的答案。

作为 LLVM 的替代方案,对于 C 程序的静态分析,您可以考虑编写一个 Frama-C 插件。

计算 C 函数输入列表的现有插件需要访问函数体中的每个左值。这是在文件 src/inout/inputs.ml 中实现的。该实现很短(复杂性在于向此插件提供结果的其他插件,例如解析指针)并且可以用作您自己的插件的骨架。

抽象语法树的访问者由框架提供。为了对左值做一些特殊的事情,您只需定义相应的方法即可。输入插件的核心是方法定义:

method vlval lv = ...

以下是输入插件功能的示例:

int a, b, c, d, *p;

main(){
  p = &a;
  b = c + *p;
}

main() 的输入是这样计算的:

$ frama-c -input t.c
...
[inout] Inputs for function main:
          a; c; p;

有关编写 Frama-C 的更多信息一般的插件可以在此处找到。

Since there are no answer to your question after two days, I will offer his one which is slightly but not completely off-topic.

As an alternative to LLVM, for static analysis of C programs, you may consider writing a Frama-C plug-in.

The existing plug-in that computes a list of inputs for a C function needs to visit every lvalue in the function's body. This is implemented in file src/inout/inputs.ml. The implementation is short (the complexity is in other plug-ins that provide their results to this one, e.g. resolving pointers) and can be used as a skeleton for your own plug-in.

A visitor for the Abstract Syntax Tree is provided by the framework. In order to do something special for lvalues, you simply define the corresponding method. The heart of the inputs plug-in is the method definition:

method vlval lv = ...

Here is an example of what the inputs plug-in does:

int a, b, c, d, *p;

main(){
  p = &a;
  b = c + *p;
}

The inputs of main() are computed thus:

$ frama-c -input t.c
...
[inout] Inputs for function main:
          a; c; p;

More information about writing Frama-C plug-ins in general can be found here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文