从 clang 发出 llvm 字节码：'byval'用于将带有非平凡析构函数的对象传递到函数中的属性

发布于 2024-11-18 05:30:05 字数 2440 浏览 2 评论 0原文

我有一个 C++ 源代码，我使用 clang 对其进行解析，生成 llvm 字节码。从现在起我想自己处理该文件...... 然而我遇到了一个问题。考虑以下场景： - 我创建一个带有重要析构函数或复制构造函数的类。 - 我定义一个函数，其中此类的对象作为参数按值（无引用或指针）传递。

在生成的字节码中，我得到了一个指针。对于没有析构函数的类，参数被注释为“byval”，但在本例中并非如此。因此，我无法区分参数是按值传递还是实际上按指针传递。

考虑以下示例：

输入文件 - cpass.cpp：

class C {
  public:
  int x;
  ~C() {}
};

void set(C val, int x) {val.x=x;};

void set(C *ptr, int x) {ptr->x=x;}

编译命令行：

clang++ -c cpass.cpp -emit-llvm -o cpass.bc; llvm-dis cpass.bc

生成的输出文件 (cpass.ll)：

; ModuleID = 'cpass.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

%class.C = type { i32 }

define void @_Z3set1Ci(%class.C* %val, i32 %x) nounwind {
  %1 = alloca i32, align 4
  store i32 %x, i32* %1, align 4
  %2 = load i32* %1, align 4
  %3 = getelementptr inbounds %class.C* %val, i32 0, i32 0
  store i32 %2, i32* %3, align 4
  ret void
}

define void @_Z3setP1Ci(%class.C* %ptr, i32 %x) nounwind {
  %1 = alloca %class.C*, align 8
  %2 = alloca i32, align 4
  store %class.C* %ptr, %class.C** %1, align 8
  store i32 %x, i32* %2, align 4
  %3 = load i32* %2, align 4
  %4 = load %class.C** %1, align 8
  %5 = getelementptr inbounds %class.C* %4, i32 0, i32 0
  store i32 %3, i32* %5, align 4
  ret void
}

如您所见，两个 set 函数的参数看起来完全相同。那么我怎么知道第一个函数是按值而不是指针获取参数呢？

一种解决方案可能是以某种方式解析损坏的函数名称，但它可能并不总是可行。如果有人将 extern "C" 放在函数前面怎么办？

有没有办法告诉clang保留byval注释，或者为每个通过值传递的函数参数生成额外的注释？

Anton Korobeynikov 建议我应该深入研究 clang 的 LLVM IR 发射。不幸的是我对 clang 的内部结构几乎一无所知，文档相当稀疏。 clang 的内部手册没有讨论 IR 发射。所以我真的不知道如何开始，从哪里去解决问题，希望不需要实际浏览所有 clang 源代码。有什么指点吗？提示？进一步阅读？

回应 Anton Korobeynikov：

我或多或少知道 C++ ABI 在参数传递方面的样子。在这里找到了一些不错的读物：http://agner.org./optimize/calling_conventions.pdf。但这非常依赖于平台！这种方法在不同的架构或某些特殊情况下可能不可行。

例如，就我而言，该函数将在与调用它的设备不同的设备上运行。这两个设备不共享内存，因此它们甚至不共享堆栈。除非用户传递指针（在这种情况下，我们假设他知道自己在做什么），否则应始终在函数参数消息中传递对象。如果它有一个重要的复制构造函数，则它应该由调用者执行，但该对象也应该在参数区域中创建。

因此，我想做的是以某种方式覆盖 clang 中的 ABI，而不过多侵入其源代码。或者也许添加一些额外的注释，这些注释在正常的编译管道中会被忽略，但我可以在解析 .bc/.ll 文件时检测到。或者以某种方式不同地重建函数签名。

原文

I have a source C++ code which I parse using clang, producing llvm bytecode. From this point I want to process the file myself...
However I encoudered a problem. Consider the following scenario:
- I create a class with a nontrivial destructor or copy constructor.
- I define a function, where an object of this class is passed as a parameter, by value (no reference or pointer).

In the produced bytecode, I get a pointer instead. For classes without the destructor, the parameter is annotated as 'byval', but it is not so in this case.
As a result, I cannot distinguish if the parameter is passed by value, or really by a pointer.

Consider the following example:

Input file - cpass.cpp:

class C {
  public:
  int x;
  ~C() {}
};

void set(C val, int x) {val.x=x;};

void set(C *ptr, int x) {ptr->x=x;}

Compilation command line:

clang++ -c cpass.cpp -emit-llvm -o cpass.bc; llvm-dis cpass.bc

Produced output file (cpass.ll):

; ModuleID = 'cpass.bc'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

%class.C = type { i32 }

define void @_Z3set1Ci(%class.C* %val, i32 %x) nounwind {
  %1 = alloca i32, align 4
  store i32 %x, i32* %1, align 4
  %2 = load i32* %1, align 4
  %3 = getelementptr inbounds %class.C* %val, i32 0, i32 0
  store i32 %2, i32* %3, align 4
  ret void
}

define void @_Z3setP1Ci(%class.C* %ptr, i32 %x) nounwind {
  %1 = alloca %class.C*, align 8
  %2 = alloca i32, align 4
  store %class.C* %ptr, %class.C** %1, align 8
  store i32 %x, i32* %2, align 4
  %3 = load i32* %2, align 4
  %4 = load %class.C** %1, align 8
  %5 = getelementptr inbounds %class.C* %4, i32 0, i32 0
  store i32 %3, i32* %5, align 4
  ret void
}

As you can see, the parameters of both set functions look exactly the same. So how can I tell that the first function was meant to take the parameter by value, instead of a pointer?

One solution could be to somehow parse the mangled function name, but it may not be always viable. What if somebody puts extern "C" before the function?

Is there a way to tell clang to keep the byval annotation, or to produce an extra annotation for each function parameter passed by a value?

Anton Korobeynikov suggests that I should dig into clang's LLVM IR emission. Unfortunately I know almost nothing about clang internals, the documentation is rather sparse. The Internals Manual of clang does not talk about IR emission. So I don't really know how to start, where to go to get the problem solved, hopefully without actually going through all of clang source code. Any pointers? Hints? Further reading?

In response to Anton Korobeynikov:

I know more-or-less how C++ ABI looks like with respect of parameter passing. Found some good reading here: http://agner.org./optimize/calling_conventions.pdf. But this is very platform dependent! This approach might not be feasable on different architectures or in some special circumstances.

In my case, for example, the function is going to be run on a different device than where it is being called from. The two devices don't share memory, so they don't even share the stack. Unless the user is passing a pointer (in which case we assume he knows what he is doing), an object should always be passed within the function-parameters message. If it has a nontrivial copy constructor, it should be executed by the caller, but the object should be created in the parameter area as well.

So, what I would like to do is to somehow override the ABI in clang, without too much intrusion into their source code. Or maybe add some additional annotation, which would be ignored in a normal compilation pipeline, but I could detect when parsing the .bc/.ll file. Or somehow differently reconstruct the function signature.

分享到QQ

分享到微博