为什么内联被认为比函数调用更快?
现在,我知道这是因为没有调用函数的开销,但是调用函数的开销真的那么重(并且值得内联它的膨胀)吗?
据我所知,当调用一个函数时,比如 f(x,y),x 和 y 被压入堆栈,堆栈指针跳转到一个空块,并开始执行。我知道这有点过于简单化,但我错过了什么吗?几次推送和一次跳转来调用函数,真的有那么大的开销吗?
如果我忘记了什么,请告诉我,谢谢!
Now, I know it's because there's not the overhead of calling a function, but is the overhead of calling a function really that heavy (and worth the bloat of having it inlined) ?
From what I can remember, when a function is called, say f(x,y), x and y are pushed onto the stack, and the stack pointer jumps to an empty block, and begins execution. I know this is a bit of an oversimplification, but am I missing something? A few pushes and a jump to call a function, is there really that much overhead?
Let me know if I'm forgetting something, thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(16)
除了没有调用(因此没有相关费用,例如调用前的参数准备和调用后的清理)之外,内联还有另一个显着优点。当函数体被内联时,它的函数体可以在调用者的特定上下文中重新解释。这可能会立即允许编译器进一步减少和优化代码。
举一个简单的例子,
如果作为非内联函数调用,该函数将需要实际分支。
但是,如果上述调用是内联的,编译器将立即能够消除分支。本质上,在上述情况下,内联允许编译器将函数参数解释为编译时常量(如果参数是编译时常量)——这对于非内联函数通常是不可能的。
然而,它甚至远不限于此。一般来说,内联带来的优化机会影响更为深远。另一个例子,当函数体内联到特定调用者的上下文中时,编译器在一般情况下将能够将调用代码中存在的已知别名相关关系传播到内联函数代码中,从而可以更好地优化函数的代码。
同样,可能的例子有很多,所有这些都源于这样一个基本事实:内联调用沉浸在特定调用者的上下文中,从而实现各种上下文间的优化,这对于非内联调用是不可能的。内联调用。通过内联,您基本上可以获得原始函数的许多单独版本,每个版本都是针对每个特定的调用者上下文单独定制和优化的。显然,这样做的代价是代码膨胀的潜在危险,但如果使用正确,它可以提供显着的性能优势。
Aside from the fact that there's no call (and therefore no associated expenses, like parameter preparation before the call and cleanup after the call), there's another significant advantage of inlining. When the function body is inlined, it's body can be re-interpreted in the specific context of the caller. This might immediately allow the compiler to further reduce and optimize the code.
For one simple example, this function
will require actual branching if called as a non-inlined function
However, if the above calls are inlined, the compiler will immediately be able to eliminate the branching. Essentially, in the above case inlining allows the compiler to interpret the function argument as a compile-time constant (if the parameter is a compile-time constant) - something that is generally not possible with non-inlined functions.
However, it is not even remotely limited to that. In general, the optimization opportunities enabled of inlining are significantly more far-reaching. For another example, when the function body is inlined into the specific caller's context, the compiler in general case will be able to propagate the known aliasing-related relationships present in the calling code into the inlined function code, thus making it possible to optimize the function's code better.
Again, the possible examples are numerous, all of them stemming from the basic fact that inlined calls are immersed into the specific caller's context, thus enabling various inter-context optimizations, which would not be possible with non-inlined calles. With inlining you basically get many individual versions of your original function, each version is tailored and optimized individually for each specific caller context. The price of that is, obviously, the potential danger of code bloat, but if used correctly, it can provide noticeable performance benefits.
“几次推送和一次跳转来调用函数,真的有那么大的开销吗?”
这取决于功能。
如果函数体只是一条机器代码指令,则调用和返回开销可能会高达数百%。假设 6 次,开销为 500%。那么,如果您的程序只包含对该函数的无数次调用,并且没有内联,那么您的运行时间就会增加 500%。
然而,在另一个方向上,内联可能会产生有害影响,例如,因为没有内联的代码不适合一页内存。
因此,当谈到优化时,答案总是首先是测量。
"A few pushes and a jump to call a function, is there really that much overhead?"
It depends on the function.
If the body of the function is just one machine code instruction, the call and return overhead can be many many hundred %. Say, 6 times, 500% overhead. Then if your program consists of nothing but a gazillion calls to that function, with no inlining you've increased the running time by 500%.
However, in the other direction inlining can have a detrimental effect, e.g. because code that without inlining would fit in one page of memory doesn't.
So the answer is always when it comes to optimization, first of all MEASURE.
没有调用和堆栈活动,这无疑节省了一些 CPU 周期。在现代 CPU 中,代码局部性也很重要:执行调用可以刷新指令管道并强制 CPU等待内存被获取。这在紧密循环中非常重要,因为主内存比现代 CPU 慢很多。
但是,如果您的代码仅在应用程序中被调用几次,请不要担心内联。如果在用户等待答案时它被调用了数百万次,那么请非常担心!
There is no calling and stack activity, which certainly saves a few CPU cycles. In modern CPU's, code locality also matters: doing a call can flush the instruction pipeline and force the CPU to wait for memory being fetched. This matters a lot in tight loops, since primary memory is quite a lot slower than modern CPU's.
However, don't worry about inlining if your code is only being called a few times in your application. Worry, a lot, if it's being called millions of times while the user waits for answers!
内联的经典候选者是访问器,例如 std::vector::size()。
启用内联后,这只是从内存中获取变量,在任何体系结构上都可能是单个指令。 “几次推动和一次跳跃”(加上返回)很容易多倍。
除此之外,优化器一次可见的代码越多,它就能更好地完成工作。通过大量内联,它可以一次看到大量代码。这意味着它可能能够将值保留在 CPU 寄存器中,并完全避免昂贵的内存访问。现在我们可能会考虑几个数量级的差异。
然后是模板元编程。有时,这会导致递归调用许多小函数,只是为了在递归结束时获取单个值。 (考虑获取包含数十个对象的元组中特定类型的第一个条目的值。)启用内联后,优化器可以直接访问该值(请记住,该值可能位于寄存器中),< em>将数十个函数调用压缩为访问 CPU 寄存器中的单个值。这可以将一个糟糕的性能消耗者变成一个漂亮而快速的程序。
将状态隐藏为对象中的私有数据(封装)有其成本。内联从一开始就是 C++ 的一部分,目的是最小化这些抽象成本。当时,编译器在检测内联的良好候选者(并拒绝不好的候选者)方面比现在差得多,因此手动内联导致了相当大的速度提升。
如今编译器被认为比我们在内联方面聪明得多。编译器能够自动内联函数,或者不内联用户标记为内联的函数,即使它们可以。有人说内联应该完全留给编译器,我们甚至不应该将函数标记为内联。然而,我还没有看到一项全面的研究表明手动这样做是否仍然值得。所以目前,我将继续自己做,并让编译器覆盖它,如果它认为它可以做得更好。
The classic candidate for inlining is an accessor, like
std::vector<T>::size()
.With inlining enabled this is just the fetching of a variable from memory, likely a single instruction on any architectures. The "few pushes and a jump" (plus the return) is easily multiple times as much.
Add to that the fact that, the more code is visible at once to an optimizer, the better it can do its work. With lots of inlining, it sees lots of code at once. That means that it might be able to keep the value in a CPU register, and completely spare the costly trip to memory. Now we might take about a difference of several orders of magnitude.
And then theres template meta-programming. Sometimes this results in calling many small functions recursively, just to fetch a single value at the end of the recursion. (Think of fetching the value of the first entry of a specific type in a tuple with dozens of objects.) With inlining enabled, the optimizer can directly access that value (which, remember, might be in a register), collapsing dozens of function calls into accessing a single value in a CPU register. This can turn a terrible performance hog into a nice and speedy program.
Hiding state as private data in objects (encapsulation) has its costs. Inlining was part of C++ from the very beginning in order to minimize these costs of abstraction. Back then, compilers were significantly worse in detecting good candidates for inlining (and rejecting bad ones) than they are today, so manually inlining resulted in considerable speed gainings.
Nowadays compilers are reputed to be much more clever than we are about inline. Compilers are able to inline functions automatically or don't inline functions users marked as
inline
, even though they could. Some say that inlining should be left to the compiler completely and we shouldn't even bother marking functions asinline
. However, I have yet to see a comprehensive study showing whether manually doing so is still worth it or not. So for the time being, I'll keep doing it myself, and let the compiler override that if it thinks it can do better.let
等于
无跳转 - 无开销
let
is equal to
No jump - no overhead
考虑一个简单的函数,例如:
这被转换为以下代码(MSVC++ v6,调试):
您可以看到函数体只有 4 条指令,但函数开销有 15 条指令,不包括另外 3 条用于调用函数本身的指令。如果所有指令花费相同的时间(事实并非如此),则此代码的 80% 是函数开销。
对于像这样的简单函数,函数开销代码很可能需要与主函数体本身一样长的运行时间。当您的简单函数在深度循环体中被调用数百万/数十亿次时,函数调用开销开始变得很大。
与往常一样,关键是分析/测量以确定内联特定函数是否会产生任何净性能增益。对于不“经常”调用的更“复杂”的函数,内联带来的收益可能小得不可估量。
Consider a simple function like:
This is converted to the following code (MSVC++ v6, debug):
You can see that there are just 4 instructions for the function body but 15 instructions for just the function overhead not including another 3 for calling the function itself. If all instructions took the same time (they don't) then 80% of this code is function overhead.
For a trivial function like this there is a good chance that the function overhead code will take just as long to run as the main function body itself. When you have trivial functions that are called in a deep loop body millions/billions of times then the function call overhead begins to become large.
As always, the key is profiling/measuring to determine whether or not inlining a specific function yields any net performance gains. For more "complex" functions that are not called "often" the gain from inlining may be immeasurably small.
内联更快的原因有多种,其中只有一个是显而易见的:
缓存利用率也可能对您不利 - 如果内联使代码变得更大,则缓存未命中的可能性更大。但这种情况的可能性要小得多。
There are multiple reasons for inlining to be faster, only one of which is obvious:
The cache utilization can also work against you - if inlining makes the code larger, there's more possibility of cache misses. That's a much less likely case though.
它产生巨大差异的一个典型例子是 std::sort ,它的比较函数是 O(N log N) 。
尝试创建一个大尺寸的向量,并首先使用内联函数调用 std::sort,然后使用非内联函数调用 std::sort 并测量性能。
顺便说一句,这就是 C++ 中的 sort 比 C 中的 qsort 更快的地方,后者需要函数指针。
A typical example of where it makes a big difference is in std::sort which is O(N log N) on its comparison function.
Try creating a vector of a large size and call std::sort first with an inline function and second with a non-inlined function and measure the performance.
This, by the way, is where sort in C++ is faster than qsort in C, which requires a function pointer.
跳转的另一个潜在副作用是,您可能会触发页面错误,要么第一次将代码加载到内存中,要么如果代码使用得不够频繁,则稍后会从内存中调出页面。
One other potential side effect of the jump is that you might trigger a page fault, either to load the code into memory the first time, or if it's used infrequently enough to get paged out of memory later.
安德烈的回答已经给了你一个非常全面的解释。但补充一点他错过的一点,内联对于非常短的函数也非常有价值。
如果函数体仅由几条指令组成,那么序言/结尾代码(基本上是推送/弹出/调用指令)实际上可能比函数体本身更昂贵。如果您经常调用这样的函数(例如,从紧密循环中),那么除非该函数是内联的,否则您最终可能会将大部分 CPU 时间花费在函数调用上,而不是函数的实际内容上。
重要的并不是函数调用的绝对成本(可能只需要 5 个时钟周期或类似的时间),而是相对于函数调用频率而言需要多长时间。如果函数很短,每 10 个时钟周期就可以调用一次,那么每次调用“不必要的”入栈/出栈指令都要花费 5 个周期,这是非常糟糕的。
Andrey's answer already gives you a very comprehensive explanation. But just to add one point that he missed, inlining can also be extremely valuable on very short functions.
If a function body consists of just a few instructions, then the prologue/epilogue code (the push/pop/call instructions, basically) might actually be more expensive than the function body itself. If you call such a function often (say, from a tight loop), then unless the function is inlined, you can end up spending the majority of your CPU time on the function call, rather than the actual contents of the function.
What matters isn't really the cost of a function call in absolute terms (where it might take just 5 clock cycles or something like that), but how long it takes relative to how often the function is called. If the function is so short that it can be called every 10 clock cycles, then spending 5 cycles for every call on "unnecessary" push/pop instructions is pretty bad.
内联并不总是会导致更大的代码。例如,一个简单的数据访问函数,例如: 作为
函数调用将导致比作为内联需要更多的指令周期,并且此类函数最适合内联。
如果函数体包含大量代码,则函数调用开销确实是微不足道的,如果从多个位置调用它,则确实可能会导致代码膨胀 - 尽管您的编译器很可能会简单地忽略内联指令在这种情况下。
您还应该考虑打电话的频率;即使对于相当大的代码体,如果从一个位置频繁调用该函数,那么在某些情况下这种节省可能是值得的。它归结为调用开销与代码主体大小的比率以及使用频率。
当然,您可以将其留给编译器来决定。我只显式地内联函数,这些函数由不涉及进一步函数调用的单个语句组成,这更多的是为了类方法的开发速度而不是性能。
It is not always the case that in-lining results in larger code. For example a simple data access function such as:
will result in significantly more instruction cycles as a function call than as an in-line, and such functions are best suited to in-lining.
If the function body contains a significant amount of code the function call overhead will indeed be insignificant, and if it is called from a number of locations, it may indeed result in code bloat - although your compiler is as likely to simply ignore the inline directive in such cases.
You should also consider the frequency of calling; even for a large-ish code body, if the function is called frequently from one location, the saving may in some cases be worthwhile. It comes down to the ratio of call-overhead to code body size, and the frequency of use.
Of course you could just leave it up to your compiler to decide. I only ever explicitly in-line functions that comprise of a single statement not involving a further function call, and that is more for speed of development of class methods than for performance.
因为没有电话。功能代码只是复制的
Because there's no call. The function code is just copied
内联函数是建议编译器用定义代替函数调用。如果将其替换,则不会有函数调用堆栈操作[push、pop]。但并不总是保证。 :)
--干杯
Inlining a function is a suggestion to compiler to replace function call with definition. If its replaced, then there will be no function calling stack operations [push, pop]. But its not guaranteed always. :)
--Cheers
优化编译器应用一组启发式方法来确定内联是否有益。
有时,缺少函数调用所带来的好处将超过额外代码的潜在成本,有时则不然。
Optimizing compilers apply a set of heuristics to determine whether or not inlining will be beneficial.
Sometimes gain from the lack of function call will outweigh the potential cost of the extra code, sometimes not.
当一个函数被多次调用时,内联会产生很大的差异。
Inlining makes the big difference when a function is called multiple times.
因为没有执行跳转。
Because no jump is performed.