有什么办法可以实现 C/C++编译器可以内联 C 回调函数吗？

发布于 2024-10-21 21:00:04 字数 743 浏览 7 评论 0原文

给定一个采用 C-Functionpointer 作为回调的典型函数，例如 C-Stdlib qsort()，任何编译器都可以使用内联来优化代码吗？我认为不可以，这样正确吗？

int cmp(void* pa, void* pb) { /*...*/ }
int func() {
  int vec[1000];
  qsort(vec, 1000, sizeof(int), &cmp);
}

好的，qsort() 是来自外部库的函数，但我什至不认为 LTO 会有所帮助，对吗？

但是，如果我在同一个编译单元中定义了 my_qsort() ，那么编译器可以内联吗？

int cmp(void* pa, void* pb) { /*...*/ }
void my_qsort(int* vec, int n, int sz, (void*)(void*,void*)) { /* ... */ }
int func() {
  int vec[1000];
  my_qsort(vec, 1000, sizeof(int), &cmp);
}

这有什么区别吗？我认为使用C函数指针作为回调是阻止编译器内联的因素。正确的？

（我只是想确保我理解为什么我应该在 C++ 中使用函子）

原文

Given a typical function that takes a C-Functionpointer as a callback like C-Stdlib qsort(), can any compiler optimize the code using inlining? I think it can not, is this correct?

int cmp(void* pa, void* pb) { /*...*/ }
int func() {
  int vec[1000];
  qsort(vec, 1000, sizeof(int), &cmp);
}

Ok, qsort() is a function from an external library, but I don't think even LTO would help here, right?

But what if I have my_qsort() defined in the same compilation unit, would then inlining be possible for the compiler?

int cmp(void* pa, void* pb) { /*...*/ }
void my_qsort(int* vec, int n, int sz, (void*)(void*,void*)) { /* ... */ }
int func() {
  int vec[1000];
  my_qsort(vec, 1000, sizeof(int), &cmp);
}

Does that make any difference? I think the use of the C-function pointer as a callback is the factor that prevents the compiler from inlining. Correct?

(I just want to make sure I understand why I should use Functors in C++)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

棒棒糖 2024-10-28 21:00:04

不，这是不可能的，至少对于传统工具链的工作方式来说是不可能的。传统的操作顺序是先完成所有编译，然后完成链接。

要生成内联比较函数，编译器首先必须为 qsort 本身内联生成代码（因为 qsort 的每个实例通常会使用不同的比较函数）。然而，对于像 qsort 这样的东西，它通常在您开始考虑编写代码之前就已经编译并放置在标准库中。编译代码时，qsort 只能作为目标文件使用。

因此，为了有机会做这样的事情，您需要将内联功能构建到链接器而不是编译器中。至少在理论上这是可能的，但它绝对不是微不足道的——至少在我看来，它几乎肯定比使用源代码更困难。它还需要在链接器中复制相当多的类似编译器的功能，并且可能需要在目标文件中添加相当多的额外信息，以便为链接器提供足够的信息来处理它甚至可以尝试做这项工作。

编辑：也许我应该更详细地讨论，以免评论链变成一场关于措辞的全面争论。

传统上，链接器本质上是一种相当简单的野兽。它从一个目标文件开始，该文件可分为四个主要部分：

要从目标文件复制（未更改，除非有明确指示）到正在生成的可执行文件的位的集合。
目标文件包含的符号列表。
目标文件未提供的符号列表。
需要写入地址的修复列表。

然后，链接器开始匹配在一个文件中导出并在另一个文件中使用的符号。然后，它在库中的目标文件中查找以解析更多符号。每当它添加文件时，它也会添加所需符号的列表，并递归搜索可以满足这些要求的其他目标文件。

当它找到提供所有符号的目标文件时，它将每个符号的位集合复制到输出文件中，并且在修复记录告诉它的地方，它写入分配给特定符号的相对地址（例如，您在哪里当调用 printf 时，它会找出可执行文件中复制构成 printf 的位的位置，并用该地址填充您的调用）。在最近的情况下，它不是从库中复制位，而是可以将对共享对象/DLL 的引用嵌入到可执行文件中，并将其留给加载程序在运行时实际查找/加载该文件，以提供实际代码一个符号。

然而，特别是，链接器传统上不知道它正在复制的位块的实际内容。例如，您可以相当合理地使用完全相同的链接器来处理多个不同处理器中任何一个的代码。只要它们都使用相同的对象和可执行文件格式，就可以了。

链接时间优化确实至少在某种程度上改变了这一点。显然，为了优化代码，我们需要某种额外的智能，这种智能在传统上被认为是链接时发生。有（至少）两种方法可以做到这一点：

在链接器中构建相当多的额外智能，
将智能保留在编译器中，并让链接器调用它来进行优化。

两者都有例子——LLVM（一个明显的例子）几乎采用了前者。前端编译器发出 LLVM 代码，LLVM 投入大量智能/工作来将其转换为优化的可执行文件。 gcc with GIMPLE 采用后一种方式：GIMPLE 记录基本上为链接器提供了足够的信息，使其可以将多个目标文件中的位反馈给编译器，让编译器优化它们，然后将结果反馈给链接器实际上复制到可执行文件中。

我想你可能会提出某种哲学观点，说这两者基本上是等价的——但我有点怀疑任何实现了这两者的人都会同意。

现在，确实（可能是，无论如何）其中任何一个都足以实现手头的优化。就我个人而言，我怀疑是否有人为了优化而实施这种优化。当您认真思考时，qsort 和 bsearch 几乎是它通常会应用的唯一两个相当常见的函数。对于大多数实际目的，这意味着您将专门为了 qsort 而实施优化。

另一方面，如果所涉及的工具包括生成内联函数和链接时间优化的能力，那么我认为至少有合理的机会最终会发生这种特定类型的优化，或多或少是偶然的——两者结合在一起的效果。

至少在理论上，这意味着它可能会发生。不过，还有一个问题需要考虑：完全独立于当前的优化，许多编译器不会为递归函数生成内联代码。即使尝试这样做，编译器也必须首先将递归函数转换为迭代形式。这在尾递归的情况下相当常见——但快速排序不是尾递归。几乎唯一的替代方案是 qsort 的实现，它一开始就不是递归的。这当然是可能的，但同样肯定是相当不寻常的。

因此，即使工具链可以支持回调的内联生成，在qsort的情况下也可能不会（我承认，这是我亲自测试过的唯一案例）。然而，无论好坏，qsort 几乎是此类函数中唯一一个足够常见且非常重要的函数。

No, it's not possible, at least with the way a traditional tool-chain works. The traditional order of operations is that all compilation is done, then linking is done.

To generate your comparison function inline, the compiler would first have to generate the code for qsort itself inline (since each instance of qsort will usually use a different comparison function). In the case of something like qsort, however, it's typically been compiled and placed in the standard library before you ever even start to think about writing your code. When you compile your code, qsort is only available as an object file.

As such, to even stand a chance of doing anything like this, you need to build the inlining capability into the linker rather than the compiler. At least in theory that's possible, but it's decidedly non-trivial -- at least in my estimation, it's almost certainly more difficult than when working with source code. It also requires duplicating quite a bit of compiler-like functionality in the linker, and probably requires adding a fair amount of extra information into the object file to give the linker enough information to work with that it can even try to do the job.

Edit: perhaps I should go into more detail, lest the comment chain turn into a full-fledged argument over little more than wording.

Traditionally, a linker is fundamentally a fairly simple sort of beast. It starts from an object file that can be divided into four primary things:

A collection of bits that are to be copied (unchanged, except as specifically directed) from the object file to the executable being produced.
A list of symbols that the object file contains.
A list of symbols used by not supplied by the object file.
A list of fixups where addresses need to be written.

the linker then starts matching up the symbols exported in one file and used in another. It then looks in the object files in the library (or libraries) to resolve more symbols. Any time it adds in a file, it also adds its list of needed symbols, and searches recursively for other object files that can satisfy those.

When it has found object files that supply all the symbols, it copies the collection of bits part of each into the output file, and where the fixup records tell it to, it writes the relative addresses assigned to specific symbols (e.g., where you've called printf, it figures out where in the executable file it copied the bits that make up printf, and fills in your call with that address). In reasonably recent cases, rather than copying bits from the library it can embed a reference to a shared object/DLL into the executable, and leave it to the loader to actually find/load that file at run-time to supply the actual code for a symbol.

In particular, however, the linker is traditionally oblivious to the actual content of the blocks of bits it's copying. You can (for example) quite reasonably use exactly the same linker to deal with code for any of a number of different processors. As long as they all use the same object and executable file formats, it's fine.

Link time optimization does change that to at least some degree. Clearly, to optimize the code, we need some sort of extra intelligence that happens at what was traditionally considered link time. There are (at least) two ways to do that:

build quite a bit of extra intelligence into the linker
keep the intelligence in the compiler, and have the linker invoke it to do the optimization.

There are examples of both of those -- LLVM (for one obvious example) takes pretty much the former. The front-end compiler emits LLVM codes, and LLVM puts a lot of intelligence/work into translating that to an optimized executable. gcc with GIMPLE takes the latter route: the GIMPLE records basically give the linker enough information that it can feed the bits in a number of object files back to the compiler, have the compiler optimize them, and then feed the result back to the linker to actually copy into the executable.

I suppose you can probably come up with some sort of philosophical viewpoint that says these two are basically equivalent -- but I kind of doubt that anybody who'd implemented both would agree.

Now, it is true (probably, anyway) that either of those would suffice to implement the optimization at hand. Personally, I doubt that anybody implements this optimization for its own sake though. When you get down to it, qsort and bsearch are almost the only two reasonably common functions to which it would/will normally apply. For most practical purposes, that means you'd be implementing the optimization exclusively for the sake of qsort.

On the other hand, if the tools involved include the ability to produce inline functions and link time optimization, then I suppose there's at least a reasonable chance that you could end up with this particular type of optimization happening as a more or less accidental side-effect of the two coming together.

At least in theory, that means it could happen. There's one more wrinkle to take into account though: completely independent of the optimization at hand, many compilers will not generate inline code for a recursive function. To even attempt to, the compiler has to first convert the recursive function to an iterative form. That's fairly common in the case of tail recursion -- but Quick sort is not tail recursive. Nearly the only alternative is an implementation of qsort that isn't recursive to start with. That's certainly possible, but just as certainly rather unusual.

As such, even when/if the toolchain could support inline generation of a callback, it probably won't in the case of qsort (which, I'll admit, is the only case I've personally tested). For better or worse, however, qsort is nearly the only function of this sort that's common enough for it to matter much either.

回复收藏 0 原文