函数内联——它会损害性能的例子有哪些?
传统观点认为函数内联并不总是有益,甚至会损害性能:
- Linux 内核风格指南警告不要过度内联
- Google 还建议程序员小心内联
- C++ FAQ lite 说了更多相同的内容,
我明白为什么内联应该有所帮助——它通过将被调用函数包含在其调用者中来消除函数调用开销。
我也理解为什么人们声称它会损害性能——内联函数在某些情况下会增加代码大小,这最终会增加缓存未命中甚至触发额外的页面错误。这都是有道理的。
不过,我很难找到内联实际上损害性能的具体示例。当然,如果这个问题足以值得警告,那么某个地方的某个人一定遇到过一个内联是一个问题的例子。所以,我问……
什么是一个好的、具体的代码示例,其中函数内联实际上损害了性能?
It's conventional wisdom that function inlining doesn't always benefit, and can even hurt performance:
- The Linux kernel style guide warns against excessive inlining
- Google also recommends programmers be careful with inlining
- The C++ FAQ lite says more of the same
I understand why inlining is supposed to help—it eliminates function call overhead by including the called function in its caller.
I also understand why people claim it can hurt performance—inlining functions can in some cases increase code size, which can eventually increase cache misses or even trigger extra page faults. This all makes sense.
I'm having trouble, though, finding specific examples where inlining actually hurts performance. Surely if it's enough of a problem to be worth warning about it, someone somewhere must have come across an example where inlining is a problem. So, I ask…
What is a good, concrete example of code where performance is actually hurt by function inlining?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
在某些具有大型内联函数的平台上,通过引起“远”跳转而不是相对跳转可能会降低性能。内联还可能导致页面错误,其中操作系统需要将更多代码拖入内存,而不是执行可能已经存在的代码(作为子例程)。
某些平台可能针对“近代码”优化了跳转指令。这种类型的跳转使用距当前位置的有符号偏移量。带符号的偏移量可能受到限制,例如 127 字节。长跳转需要更大的指令,因为更长的跳转必须包括绝对地址。较长的指令需要更多的时间来执行。
长内联函数可能会扩展可执行文件的长度,以便操作系统需要将新的“页面”拖入内存,称为页面交换。页面交换会降低应用程序的执行速度。
这些是内联代码降低性能的“可能”原因。真正的真相是通过分析获得的。
On some platforms, with large inlined functions, performance can be reduced by causing a "far" jump rather than a relative jump. Inlining may also cause a page fault where the OS needs to haul in more code into memory, rather than executing code with may already exist (as a subroutine).
Some platforms may have optimized jump instructions for "near code". This type of jump uses a signed offset from the current position. The signed offsets may be restricted, for example 127 bytes. A long jump would require a bigger instruction because the longer jump must include the absolute address. Longer instructions take more time to execute.
Long inlined functions may expand the length of the executable so that the OS needs to haul in a new "page" into memory, called a page swap. Page swapping slows down execution speed of an application.
These are "possible" reasons how inlined code could slow performance. The real truth is obtained by profiling.
我在 C (gcc) 项目中遇到过这样的情况。我的同事在他的库中滥用了内联,迫使
-fno-inline
将 CPU 时间减少了 10%(在带有 Ultrasparc IV+ 处理器的 SUN V890 上)。I had the case in our project in C (gcc). My collegue abused inlines in his library, forcing
-fno-inline
reduced the CPU time by 10% (on SUN V890 with Ultrasparc IV+ processors).尚未提及的事情是,将大函数内联到其他大函数中可能会导致过多的寄存器溢出,这不仅会损害编译代码的质量,还会增加比内联消除的更多的开销(它甚至会搞砸全局和)局部优化启发式,iirc msdn 在
__forceinline
下对此有一个警告)。其他“构造”,例如放入内联中的内联非裸汇编,可能会产生不需要的堆栈帧,或者具有特殊对齐要求的内联,甚至那些只是将堆栈分配推入编译器在堆栈检查分配中推入的范围的内联(_chkstk
在 msvc 下)。Something not mentioned yet is that inlining of big functions into other big functions can cause excessive register spilling, hurting not only the the quality of the compiled code but also adding more overhead than was eliminated by the inline (and it max even screw up global and local optimization heurstics, iirc msdn has a warning about this under
__forceinline
). Other 'constructs' such as inline non-naked asm put in inlines may produce unneeded stack frames, or inlines with special alignment requirements, or even those that just push the stack allocation into the range where the compiler shoves in stack checking allocation(_chkstk
under msvc).我不认为内联会损害性能,除了与代码变大间接相关之外,我认为您已经描述过这一点。
一般来说,内联通过消除调用和返回来提高性能。
I don't think inlining hurts performance other than indirectly relating to the code being larger, which I think you described.
In general, inlining improves performance by eliminating the call and return.
[参考内联函数]
[引用自 SO 用户“Fire Lancer”,因此请相信他]
[In reference to inline functions]
[Quote snagged from SO user 'Fire Lancer' so credit him]
我没有硬数据来支持这一点,但无论如何,就 Linux 内核而言(因为问题中引用了“Linux 内核风格指南”),代码大小可能会影响性能,因为无论内核代码如何占用物理内存,代码大小都会影响性能。指令缓存(内核页面永远不会被调出)。
内核使用的内存页对于用户虚拟内存永久不可用。因此,如果您使用内存页来复制具有可疑好处的内联代码(对于大型函数来说,调用开销通常很小),那么您就会对系统产生负面影响,而没有真正的好处。
I have no hard data to back this up, but in the case of the Linux kernel anyway (since the "The Linux kernel style guide" was cited in the question), code size could impact performance because the kernel code occupies physical memory regardless of instruction caching (kernel pages are never paged out).
Memory pages that are used by the kernel are permanently unavailable for user virtual memory. So if you're using memory pages for inlined code copied that have dubious benefit (the call overhead is generally small for functions that are large), you're having a negative impact on the system for no real benefit.
为什么您需要内联损害性能的具体示例?这是一个上下文敏感的问题。它取决于许多硬件因素,包括 RAM 速度、CPU 型号、编译器版本和许多其他因素。可以在我的计算机上创建这样的示例,但它仍然比您的非内联版本更快。反过来,内联可能会启用数十种其他编译器优化,否则这些优化不会被执行。因此,即使在代码膨胀导致性能下降的情况下,某些编译器也可能会执行许多其他优化来补偿它。
因此,您不会得到比理论更有意义的答案,即为什么它可能会产生较慢的代码。
如果您需要一个具体示例来说明内联可能会损害性能的地方,那么请继续编写它。一旦你了解了理论,这并不困难。
您想要一个足够大的函数,如果内联的话会污染缓存,并且您想从几个不同但密切相关的位置调用它(如果您从两个完全独立的模块调用它,那么该函数的两个实例将不会无论如何,都不会竞争缓存空间。但是,如果您在多个不同的调用站点之间快速切换,那么每个实例化都可能会强制前一个调用站点脱离缓存,
当然,必须编写该函数,以便在调用时几乎不会被消除。如果在内联时,则内联。编译器能够消除 80% 的代码,那么这将减轻您可能遭受的性能损失,
最后,您可能需要强制内联它,编译器倾向于处理
。 inline
关键字作为提示(有时甚至不是),因此您可能必须查找特定于编译器的方法来强制内联函数,因为编译器可能还需要禁用其他优化。否则能够优化内联版本。
因此,一旦您知道该怎么做,通过内联生成较慢的代码就非常简单了。但这需要做大量的工作,特别是如果您想要任何接近可预测或确定性的结果。尽管你付出了努力,明年的编译器或明年的 CPU 可能会再次比你聪明,并从你故意的“过度内联”代码中生成更快的代码。
所以我只是不明白为什么你需要这样做。接受过度内联在某些情况下会造成伤害,并理解为什么会造成伤害。除此之外,为什么还要麻烦呢?
最后一点是,这些警告经常被误导,因为几乎没有什么可警告的。 因为编译器通常会自行选择要内联的内容,并且最多将
inline
关键字视为提示,所以通常并不重要 > 无论您是否尝试内联所有内容。因此,虽然过度内联确实会损害性能,但过度使用
inline
关键字通常却不会。inline
关键字还有其他作用,应该指导其使用。当您想要禁用“单一定义规则”时,请使用它,以防止在多个翻译单元中定义函数时出现链接器错误。why do you need concrete examples of where inlining hurt performance? It is such a context sensitive issue. It depends on a number of hardware factors, including speed of RAM, CPU model, compiler version and a number of other factors. It's possible to create such an example on my computer, but which will still be faster than the non-inlined version no yours. And inlining, in turn, may enable dozens of other compiler optimizations that would not otherwise be performed. So even in a case where the code bloat causes a performance hit, it may enable some compilers to perform a number of other optimizations to compensate for it.
So you're not going to get a more meaningful answer than the theory, of why it may produce slower code.
If you need a specific example of where performance can be hurt by inlining, then go ahead and write it. It's not that difficult once you know the theory.
You want a function that is big enough to pollute the cache if inlined, and you want to call it from several different, but closely related, places (if you call it from two completely separate modules, then the two instantiations of the function won't compete for the cache space anyway. But if you alternate quickly between several different call sites, then each instantiation may force the previous one out of cache.
And of course, the function must be written so that little of it can get eliminated when it is inlined. If, upon inlining, the compiler is able to eliminate 80% of the code, then that'll mitigate the performance hit you might otherwise take.
And finally, you'll likely need to force it to be inlined. At best, compilers tend to treat the
inline
keyword as a hint (sometimes not even that). So you'll likely have to look up compiler-specific ways to force a function to be inlined.You may also want to disable other optimizations, as the compiler might otherwise be able to optimize the inlined version.
So it's pretty straightforward to produce slower code through inlining, once you know what to do. But it's quite a lot of work to do so, especially if you want anything near predictable or deterministic results. And despite your efforts, next year's compilers or next year's CPUs may again be able to outsmart you and produce faster code from your intentionally "over-inlined" code.
So I just don't see why you'd need to do this. Accept that excessive inlining can hurt in some cases, and understand why it can hurt. Beyond that, why bother?
A final point is that those warnings are often misguided, because there's very little to warn about. Because the compiler typically chooses by itself what to inline, and, at best, treats the
inline
keyword as a hint, it generally doesn't matter whether or not you try to inline everything.So while it is true that excessive inlining can hurt performance, excessive use of the
inline
keyword usually doesn't.The
inline
keyword has other effects, which should guide its usage. Use it when you want to disable the One Definition Rule, to prevent linker errors when a function is defined in multiple translation units.