在 C++ 中调用函数有多少开销?
许多文献都谈到使用内联函数来“避免函数调用的开销”。 但我还没有看到可量化的数据。 函数调用的实际开销是多少,即通过内联函数我们可以实现什么样的性能提升?
A lot of literature talks about using inline functions to "avoid the overhead of a function call". However I haven't seen quantifiable data. What is the actual overhead of a function call i.e. what sort of performance increase do we achieve by inlining functions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(15)
根据您构建代码的方式,划分为模块和库等单元在某些情况下可能非常重要。
这就是为什么当比较操作像整数比较一样简单时,使用 stdc 库中的 qsort 比使用 stl 代码慢一个数量级(10 倍)。
同样的惩罚很可能会影响 C++ 的虚函数以及其他函数的使用,这些函数的代码是在单独的模块中定义的。
同样的惩罚很可能
好消息是,整个程序优化可能会解决静态库和模块之间的依赖问题。
好消息是,
Depending on how you structure your code, division into units such as modules and libraries it might matter in some cases profoundly.
That is why using qsort from stdc library is one order of magnitude (10 times) slower than using stl code when comparison operation is as simple as integer comparison.
The same penalty will most likely affect usage of C++'s virtual functions as well as other functions, whose code is defined in separate modules.
Good news is that whole program optimization might resolve the issue for dependencies between static libraries and modules.
在大多数架构上,成本包括将所有(或部分,或没有)寄存器保存到堆栈,将函数参数推送到堆栈(或将它们放入寄存器中),递增堆栈指针并跳转到堆栈的开头。新代码。 然后,当函数完成时,您必须从堆栈中恢复寄存器。 此网页描述了各种调用约定所涉及的内容。
大多数 C++ 编译器现在都足够智能,可以为您内联函数。 inline 关键字只是对编译器的一个提示。 有些人甚至会在他们认为有帮助的地方跨翻译单元进行内联。
On most architectures, the cost consists of saving all (or some, or none) of the registers to the stack, pushing the function arguments to the stack (or putting them in registers), incrementing the stack pointer and jumping to the beginning of the new code. Then when the function is done, you have to restore the registers from the stack. This webpage has a description of what's involved in the various calling conventions.
Most C++ compilers are smart enough now to inline functions for you. The inline keyword is just a hint to the compiler. Some will even do inlining across translation units where they decide it's helpful.
我针对简单的增量函数做了一个简单的基准测试:
inc.c:
main.c
在我的 Intel(R) Core(TM) 上运行十亿次迭代)i5 CPU M 430 @ 2.27GHz 给了我:
(它似乎波动高达 0.2,但我懒得计算适当的标准偏差,也不关心它们)
这表明这台计算机上函数调用的开销约为 3纳秒
我测得的最快速度约为 0.3 纳秒,因此这表明函数调用的成本约为 9 个原始操作,简单地说。
对于通过 PLT(共享库中的函数)调用的函数,每次调用此开销会增加约 2ns(总调用时间约为 6ns)。
I made a simple benchmark against a simple increment function:
inc.c:
main.c
Running it with a billion iterations on my Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz gave me:
(It appears to fluctuate by up to 0.2 but I'm too lazy to calculate proper standard deviations nor do I care for them)
This suggests that the overhead of function calls on this computer is about 3 nanoseconds
The fastest I measured something at it was about 0.3ns so that would suggest a function call costs about 9 primitive ops, to put it very simplistically.
This overhead increases by about another 2ns per call (total time call time about 6ns) for functions called through a PLT (functions in a shared library).
这是技术和实践的答案。 实际的答案是,这永远不会重要,在极少数情况下,您知道的唯一方法是通过实际的分析测试。
由于编译器优化,您的文献引用的技术答案通常不相关。 但如果您仍然感兴趣,乔什。
至于“百分比”,您必须知道该功能本身的成本有多高。 除了被调用函数的成本之外,没有百分比,因为您正在与零成本操作进行比较。 对于内联代码没有任何成本,处理器只是移动到下一条指令。 内联的缺点是代码大小较大,这以与堆栈构建/拆卸成本不同的方式体现其成本。
There's the technical and the practical answer. The practical answer is it will never matter, and in the very rare case it does the only way you'll know is through actual profiled tests.
The technical answer, which your literature refers to, is generally not relevant due to compiler optimizations. But if you're still interested, is well described by Josh.
As far as a "percentage" you'd have to know how expensive the function itself was. Outside of the cost of the called function there is no percentage because you are comparing to a zero cost operation. For inlined code there is no cost, the processor just moves to the next instruction. The downside to inling is a larger code size which manifests it's costs in a different way than the stack construction/tear down costs.
你的问题是没有答案的问题之一,可以称之为“绝对真理”。 正常函数调用的开销取决于三个因素:
CPU。 x86、PPC 和 ARM CPU 的开销差异很大,即使您只使用一种架构,Intel Pentium 4、Intel Core 2 Duo 和 Intel Core i7 之间的开销也有很大差异。 Intel 和 AMD CPU 之间的开销甚至可能存在显着差异,即使两者以相同的时钟速度运行,因为缓存大小、缓存算法、内存访问模式和调用操作码本身的实际硬件实现等因素可能会产生巨大的差异。对开销的影响。
ABI(应用程序二进制接口)。 即使使用相同的 CPU,也经常存在不同的 ABI,它们指定函数调用如何传递参数(通过寄存器、堆栈或两者的组合)以及堆栈帧初始化和清理的位置和方式。 所有这些都会影响开销。 不同的操作系统可能对同一个CPU使用不同的ABI; 例如,Linux、Windows 和 Solaris 三者可能对同一 CPU 使用不同的 ABI。
编译器。 仅当在独立代码单元之间调用函数时,严格遵循 ABI 才重要,例如,如果应用程序调用系统库的函数或用户库调用另一个用户库的函数。 只要函数是“私有”的,在某个库或二进制文件之外不可见,编译器就可能“作弊”。 它可能不严格遵循 ABI,而是使用快捷方式来实现更快的函数调用。 例如,它可以在寄存器中传递参数而不是使用堆栈,或者如果不是真正必要的话,它可以完全跳过堆栈帧设置和清理。
如果您想了解上述三个因素的特定组合的开销,例如 Linux 上使用 GCC 的 Intel Core i5,获取此信息的唯一方法是对两种实现之间的差异进行基准测试,一种使用函数调用,另一种使用直接将代码复制到调用者中; 这样你就可以强制内联,因为内联语句只是一个提示,并不总是导致内联。
然而,这里真正的问题是:确切的开销真的很重要吗? 有一点是肯定的:函数调用总是有开销的。 它可能很小,也可能很大,但它确实存在。 如果一个函数在性能关键部分被足够频繁地调用,那么无论它有多小,开销都会在某种程度上产生影响。 内联很少会让你的代码变慢,除非你做得太过分了。 但它会使代码变得更大。 今天的编译器非常擅长自行决定何时内联、何时不内联,因此您几乎不必为此绞尽脑汁。
就我个人而言,我在开发过程中完全忽略内联,直到我有一个或多或少可用的产品,我可以对其进行分析,并且只有当分析告诉我某个函数确实经常被调用并且也在应用程序的性能关键部分内调用时,然后我才会考虑此函数的“强制内联”。
到目前为止,我的答案非常通用,它适用于 C,就像适用于 C++ 和 Objective-C 一样。 作为结束语,让我特别谈谈 C++:虚拟方法是双重间接函数调用,这意味着它们比普通函数调用具有更高的函数调用开销,而且它们不能内联。 非虚拟方法可能会被编译器内联,也可能不会,但即使它们没有内联,它们仍然比虚拟方法快得多,因此您不应该使方法成为虚拟方法,除非您真的打算覆盖它们或让它们被覆盖。
Your question is one of the questions, that has no answer one could call the "absolute truth". The overhead of a normal function call depends on three factors:
The CPU. The overhead of x86, PPC, and ARM CPUs varies a lot and even if you just stay with one architecture, the overhead also varies quite a bit between an Intel Pentium 4, Intel Core 2 Duo and an Intel Core i7. The overhead might even vary noticeably between an Intel and an AMD CPU, even if both run at the same clock speed, since factors like cache sizes, caching algorithms, memory access patterns and the actual hardware implementation of the call opcode itself can have a huge influence on the overhead.
The ABI (Application Binary Interface). Even with the same CPU, there often exist different ABIs that specify how function calls pass parameters (via registers, via stack, or via a combination of both) and where and how stack frame initialization and clean-up takes place. All this has an influence on the overhead. Different operating systems may use different ABIs for the same CPU; e.g. Linux, Windows and Solaris may all three use a different ABI for the same CPU.
The Compiler. Strictly following the ABI is only important if functions are called between independent code units, e.g. if an application calls a function of a system library or a user library calls a function of another user library. As long as functions are "private", not visible outside a certain library or binary, the compiler may "cheat". It may not strictly follow the ABI but instead use shortcuts that lead to faster function calls. E.g. it may pass parameters in register instead of using the stack or it may skip stack frame setup and clean-up completely if not really necessary.
If you want to know the overhead for a specific combination of the three factors above, e.g. for Intel Core i5 on Linux using GCC, your only way to get this information is benchmarking the difference between two implementations, one using function calls and one where you copy the code directly into the caller; this way you force inlining for sure, since the inline statement is only a hint and does not always lead to inlining.
However, the real question here is: Does the exact overhead really matter? One thing is for sure: A function call always has an overhead. It may be small, it may be big, but it is for sure existent. And no matter how small it is if a function is called often enough in a performance critical section, the overhead will matter to some degree. Inlining rarely makes your code slower, unless you terribly overdo it; it will make the code bigger though. Today's compilers are pretty good at deciding themselves when to inline and when not, so you hardly ever have to rack your brain about it.
Personally I ignore inlining during development completely, until I have a more or less usable product that I can profile and only if profiling tells me, that a certain function is called really often and also within a performance critical section of the application, then I will consider "force-inlining" of this function.
So far my answer is very generic, it applies to C as much as it applies to C++ and Objective-C. As a closing word let me say something about C++ in particular: Methods that are virtual are double indirect function calls, that means they have a higher function call overhead than normal function calls and also they cannot be inlined. Non-virtual methods might be inlined by the compiler or not but even if they are not inlined, they are still significant faster than virtual ones, so you should not make methods virtual, unless you really plan to override them or have them overridden.
开销量取决于编译器、CPU 等。开销百分比取决于您内联的代码。 唯一了解的方法是获取您的代码并以两种方式对其进行分析 - 这就是为什么没有明确的答案。
The amount of overhead will depend on the compiler, CPU, etc. The percentage overhead will depend on the code you're inlining. The only way to know is to take your code and profile it both ways - that's why there's no definitive answer.
对于非常小的函数,内联是有意义的,因为函数调用的(小的)成本相对于函数体的(非常小的)成本来说是显着的。 对于大多数只用几行代码实现的函数来说,这并不是一个很大的胜利。
For very small functions inlining makes sense, because the (small) cost of the function call is significant relative to the (very small) cost of the function body. For most functions over a few lines it's not a big win.
值得指出的是,内联函数会增加调用函数的大小,并且任何增加函数大小的内容都可能对缓存产生负面影响。 如果您正处于边界,“再多一个薄薄的薄荷”内联代码可能会对性能产生巨大的负面影响。
如果您正在阅读警告“函数调用的成本”的文献,我建议它可能是旧材料,不能反映现代处理器。 除非你身处嵌入式世界,否则 C 作为“可移植汇编语言”的时代基本上已经过去了。 过去十年(比如说)芯片设计者的大量聪明才智已经融入到各种低级复杂性中,这些复杂性可能与“过去”的工作方式截然不同。
It's worth pointing out that an inlined function increases the size of the calling function and anything that increases the size of a function may have a negative affect on caching. If you're right at a boundary, "just one more wafer thin mint" of inlined code might have a dramatically negative effect on performance.
If you're reading literature that's warning about "the cost of a function call," I'd suggest it may be older material that doesn't reflect modern processors. Unless you're in the embedded world, the era in which C is a "portable assembly language" has essentially passed. A large amount of the ingenuity of the chip designers in the past decade (say) has gone into all sorts of low-level complexities that can differ radically from the way things worked "back in the day."
有一个很棒的概念,称为“寄存器影子”,它允许通过寄存器(在 CPU 上)而不是堆栈(内存)传递(最多 6 个?)值。 此外,根据其中使用的函数和变量,编译器可能会决定不需要帧管理代码!
另外,即使C++编译器也可能会进行“尾递归优化”,即如果A()调用B(),并且在调用B()之后,A刚刚返回,编译器将重用堆栈帧!
当然,这一切都可以完成,前提是程序坚持标准的语义(请参阅指针别名及其对优化的影响)
There is a great concept called 'register shadowing', which allows to pass ( up to 6 ? ),values thru registers ( on CPU ) instead of stack ( memory ). Also, depending on the function and variables used within, compiler may just decide that frame management code is not required !!
Also, even C++ compiler may do a 'tail recursion optimiztaion', i.e. if A() calls B(), and after calling B(), A just returns, compiler will reuse the stack frame !!
Of course, this all can be done, only if program sticks to the semantics of standard ( see pointer aliasing and it's effect on optimizations )
现代 CPU 速度非常快(显然!)。 几乎每个涉及调用和参数传递的操作都是全速指令(间接调用可能会稍微昂贵,主要是第一次通过循环)。
函数调用开销是如此之小,只有调用函数的循环才能使调用开销相关。
因此,当我们今天谈论(和测量)函数调用开销时,我们通常真正谈论的是无法将公共子表达式提升到循环之外的开销。 如果一个函数每次被调用时都必须做一堆(相同的)工作,那么编译器将能够将它“提升”到循环之外,并且如果它是内联的,则只执行一次。 当未内联时,代码可能会继续并重复您告诉它的工作!
内联函数看起来快得不可思议,不是因为调用和参数开销,而是因为可以从函数中提升的公共子表达式。
示例:
优化器可以看穿这种愚蠢行为,并执行以下操作:
似乎调用开销不可能减少,因为它确实将函数的很大一部分从循环中取出(CalculatePi_1000_digits 调用)。 编译器需要能够证明CalculatePi_1000_digits 始终返回相同的结果,但优秀的优化器可以做到这一点。
Modern CPUs are very fast (obviously!). Almost every operation involved with calls and argument passing are full speed instructions (indirect calls might be slightly more expensive, mostly the first time through a loop).
Function call overhead is so small, only loops that call functions can make call overhead relevant.
Therefore, when we talk about (and measure) function call overhead today, we are usually really talking about the overhead of not being able to hoist common subexpressions out of loops. If a function has to do a bunch of (identical) work every time it is called, the compiler would be able to "hoist" it out of the loop and do it once if it was inlined. When not inlined, the code will probably just go ahead and repeat the work, you told it to!
Inlined functions seem impossibly faster not because of call and argument overhead, but because of common subexpressions that can be hoisted out of the function.
Example:
An optimizer can see through this foolishness and do:
It seems like call overhead is impossibly reduced because it really has hoised a big chunk of the function out of the loop (the CalculatePi_1000_digits call). The compiler would need to be able to prove that CalculatePi_1000_digits always returns the same result, but good optimizers can do that.
这里有几个问题。
如果你有一个足够聪明的编译器,即使你没有指定内联,它也会为你做一些自动内联。 另一方面,有很多东西是不能内联的。
如果函数是虚拟的,那么您当然要付出无法内联的代价,因为目标是在运行时确定的。 相反,在 Java 中,除非您表明该方法是最终方法,否则您可能会付出这个代价。
根据代码在内存中的组织方式,您可能会因缓存未命中甚至页面未命中而付出代价,因为代码位于其他位置。 这最终可能会对某些应用程序产生巨大影响。
There are a few issues here.
If you have a smart enough compiler, it will do some automatic inlining for you even if you did not specify inline. On the other hand, there are many things that cannot be inlined.
If the function is virtual, then of course you are going to pay the price that it cannot be inlined because the target is determined at runtime. Conversely, in Java, you might be paying this price unless you indicate that the method is final.
Depending on how your code is organized in memory, you may be paying a cost in cache misses and even page misses as the code is located elsewhere. That can end up having a huge impact in some applications.
根本没有太多开销,特别是对于小型(可内联)函数甚至类。
以下示例具有三个不同的测试,每个测试都运行很多很多次并定时。 结果始终等于时间单位的千分之几的数量级。
运行 10,000,000 次迭代(每种类型:简单、六个函数调用、三个对象调用)的输出是使用这个半复杂的工作负载:
如下所示:
使用 的简单工作负载
给出相同的结果,只是速度快了几个数量级对于每种情况。
There is not much overhead at all, especially with small (inline-able) functions or even classes.
The following example has three different tests that are each run many, many times and timed. The results are always equal to the order of a couple 1000ths of a unit of time.
The output for running 10,000,000 iterations (of each type: simple, six function calls, three object calls) was with this semi-convoluted work payload:
as follows:
Using a simple work payload of
Gives the same results except a couple orders of magnitude faster for each case.
正如其他人所说,您实际上不必太担心开销,除非您想要终极性能或类似的东西。 当你创建一个函数时,编译器必须编写代码来:
等等...
但是,您必须考虑降低代码的可读性,以及它将如何影响您的测试策略、维护计划以及 src 文件的总体大小影响。
As others have said, you really don't have to worry too much about overhead, unless you're going for ultimate performance or something akin. When you make a function the compiler has to write code to:
etc...
However, you have to account for lowering the readability of your code, as well as how it will impact your testing strategies, maintenance plans, and overall size impact of your src file.
每个新函数都需要创建一个新的本地堆栈。 但是,只有当您在大量迭代中的循环的每次迭代中调用函数时,这种开销才会很明显。
Each new function requires a new local stack to be created. But the overhead of this would only be noticeable if you are calling a function on every iteration of a loop over a very large number of iterations.
对于大多数函数来说,在 C++ 与 C 中调用它们没有额外的开销(除非您将“this”指针视为每个函数的不必要参数。您必须以某种方式将状态传递给函数)...
对于虚拟函数,它们是额外的间接级别(相当于通过 C 中的指针调用函数)...但实际上,在当今的硬件上,这是微不足道的。
For most functions, their is no additional overhead for calling them in C++ vs C (unless you count that the "this" pointer as an unnecessary argument to every function.. You have to pass state to a function somehow tho)...
For virtual functions, their is an additional level of indirection (equivalent to a calling a function through a pointer in C)... But really, on today's hardware this is trivial.