为什么不将所有内容标记为内联?

发布于 2024-09-28 15:38:37 字数 638 浏览 11 评论 0原文

首先,我不是寻找一种方法来强制编译器内联每个函数的实现。

为了减少误导答案的程度,请确保您了解 inline 关键字的实际含义。这是很好的描述,内联、静态、外部

所以我的问题是,为什么不将每个函数定义标记为内联?即理想情况下,唯一的编译单元是main.cpp。或者可能还有一些无法在头文件中定义的函数(pimpl idiom 等)。

这个奇怪的请求背后的理论是,它将为优化器提供最大的可用信息。当然,它可以内联函数实现,但它也可以进行“跨模块”优化,因为只有一个模块。还有其他优点吗?

有人在实际应用程序中尝试过这一点吗?性能有提高吗?减少?!?

将所有函数定义标记为内联有哪些缺点?

  • 编译可能会更慢并且会消耗更多内存。
  • 迭代构建被破坏,每次更改后都需要重新构建整个应用程序。
  • 链接时间可能是天文数字

所有这些缺点只会影响开发人员。运行时的缺点是什么?

First off, I am not looking for a way to force the compiler to inline the implementation of every function.

To reduce the level of misguided answers make sure you understand what the inline keyword actually means. Here is good description, inline vs static vs extern.

So my question, why not mark every function definition inline? ie Ideally, the only compilation unit would be main.cpp. Or possibly a few more for the functions that cannot be defined in a header file (pimpl idiom, etc).

The theory behind this odd request is it would give the optimizer maximum information to work with. It could inline function implementations of course, but it could also do "cross-module" optimization as there is only one module. Are there other advantages?

Has any one tried this in with a real application? Did the performance increase? decrease?!?

What are the disadvantages of marking all function definitions inline?

  • Compilation might be slower and will consume much more memory.
  • Iterative builds are broken, the entire application will need to be rebuilt after every change.
  • Link times might be astronomical

All of these disadvantage only effect the developer. What are the runtime disadvantages?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

娇柔作态 2024-10-05 15:38:37

您真的是指 #include 一切吗?这只会给你一个模块,并让优化器立即看到整个程序。

实际上,当您使用 /GL(全程序优化)开关,在链接器运行并可以访问所有代码之前,它实际上不会编译任何内容。其他编译器也有类似的选项。

Did you really mean #include everything? That would give you only a single module and let the optimizer see the entire program at once.

Actually, Microsoft's Visual C++ does exactly this when you use the /GL (Whole Program Optimization) switch, it doesn't actually compile anything until the linker runs and has access to all code. Other compilers have similar options.

初相遇 2024-10-05 15:38:37

sqlite 使用了这个想法。在开发过程中,它使用传统的源结构。但实际使用时,有一个巨大的 c 文件(112k 行)。他们这样做是为了最大限度地优化。声称性能提升约 5-10%

http://www.sqlite.org/amalgamation.html

sqlite uses this idea. During development it uses a traditional source structure. But for actual use there is one huge c file (112k lines). They do this for maximum optimization. Claim about 5-10% performance improvement

http://www.sqlite.org/amalgamation.html

ι不睡觉的鱼゛ 2024-10-05 15:38:37

我们(和其他一些游戏公司)确实通过制作一个 uber-.CPP 来#include所有其他游戏进行了尝试;这是一种已知的技术。在我们的例子中,它似乎并没有对运行时产生太大影响,但你提到的编译时缺点却被证明是完全严重的。每次更改后都要进行半小时的编译,因此不可能进行有效的迭代。 (这是应用程序分为十几个不同的库。)

我们尝试进行不同的配置,以便在调试时拥有多个 .obj,然后仅在发布选择版本中拥有 uber-CPP,但随后运行陷入编译器内存不足的问题。对于足够大的应用程序,这些工具根本无法编译数百万行的 cpp 文件。

我们也尝试了 LTCG,它提供了一个小但很好的运行时间提升,在极少数情况下它不会在链接阶段简单地崩溃。

We (and some other game companies) did try it via making one uber-.CPP that #includeed all others; it's a known technique. In our case, it didn't seem to affect runtime much, but the compile-time disadvantages you mention turned out to be utterly crippling. With a half an hour compile after every single change, it becomes impossible to iterate effectively. (And this is with the app divvied up into over a dozen different libraries.)

We tried making a different configuration such that we would have multiple .objs while debugging and then have the uber-CPP only in release-opt builds, but then ran into the problem of the compiler simply running out of memory. For a sufficiently large app, the tools simply are not up to compiling a multimillion line cpp file.

We tried LTCG as well, and that provided a small but nice runtime boost, in the rare cases where it didn't simply crash during the link phase.

汐鸠 2024-10-05 15:38:37

有趣的问题!您当然是对的,所有列出的缺点都是针对开发人员的。然而,我认为处于不利地位的开发商不太可能生产出优质的产品。可能没有运行时缺点,但想象一下,如果每次编译需要数小时(甚至数天)才能完成,开发人员将多么不愿意进行小更改。

我会从“过早优化”的角度来看待这个问题:多个文件中的模块化代码使程序员的生活更轻松,因此以这种方式做事有明显的好处。只有当一个特定的应用程序运行得太慢,并且可以证明内联所有内容都可以带来可衡量的改进时,我才会考虑给开发人员带来不便。即便如此,这也会是在大部分开发完成之后(以便可以对其进行衡量),并且可能只会针对生产版本进行。

Interesting question! You are certainly right that all of the listed disadvantages are specific to the developer. I would suggest, however, that a disadvantaged developer is far less likely to produce a quality product. There may be no runtime disadvantages, but imagine how reluctant a developer will be to make small changes if each compile takes hours (or even days) to complete.

I would look at this from a "premature optimization" angle: modular code in multiple files makes life easier for the programmer, so there is an obvious benefit to doing things this way. Only if a specific application turns out to run too slow, and it can be shown that inlining everything makes a measured improvement, would I even consider inconveniencing the developers. Even then, it would be after a majority of the development has been done (so that it can be measured) and would probably only be done for production builds.

一影成城 2024-10-05 15:38:37

这是半相关的,但请注意,Visual C++ 确实具有进行跨模块优化的能力,包括跨模块内联。请参阅 http://msdn.microsoft.com/en- us/library/0zza0de8%28VS.80%29.aspx 了解信息。

为了添加原始问题的答案,我认为假设优化器足够智能(因此将其添加为 Visual Studio 中的优化选项),我认为运行时不会有任何缺点。只需使用足够智能的编译器即可自动完成此操作,而不会产生您提到的所有问题。 :)

This is semi-related, but note that Visual C++ does have the ability to do cross-module optimization, including inline across modules. See http://msdn.microsoft.com/en-us/library/0zza0de8%28VS.80%29.aspx for info.

To add an answer to your original question, I don't think there would be a downside at run time, assuming the optimizer was smart enough (hence why it was added as an optimization option in Visual Studio). Just use a compiler smart enough to do it automatically, without creating all the problems you mention. :)

一场信仰旅途 2024-10-05 15:38:37

这几乎就是整个程序优化和链接时间代码生成背后的哲学(LTCG):优化机会最好具有全球知识。

从实际的角度来看,这有点痛苦,因为现在您所做的每一个更改都需要重新编译整个源代码树。一般来说,您需要优化构建的频率低于您需要进行任意更改的频率。

我在 Metrowerks 时代尝试过这个(使用“Unity”风格构建非常容易设置),但编译从未完成。我提到它只是为了指出这是一个工作流程设置,可能会以他们没有预料到的方式对工具链造成负担。

That's pretty much the philosophy behind Whole Program Optimization and Link Time Code Generation (LTCG) : optimization opportunities are best with global knowledge.

From a practical point of view it's sort of a pain because now every single change you make will require a recompilation of your entire source tree. Generally speaking you need an optimized build less frequently than you need to make arbitrary changes.

I tried this in the Metrowerks era (it's pretty easy to setup with a "Unity" style build) and the compilation never finished. I mention it only to point out that it's a workflow setup that's likely to tax the toolchain in ways they weren't anticipating.

Oo萌小芽oO 2024-10-05 15:38:37

好处不大
在现代平台的良好编译器上,内联只会影响很少的函数。这只是对编译器的一个提示,现代编译器非常擅长自己做出这个决定,并且函数调用的开销已经变得相当小(通常,内联的主要好处不是减少调用开销,但可以进一步优化)。

编译时间
然而,由于内联也会改变语义,因此您必须将所有内容#include放入一个巨大的编译单元中。这通常会显着增加编译时间,这对于大型项目来说是一个杀手。

代码大小
如果您放弃当前的桌面平台及其高性能编译器,情况就会发生很大变化。在这种情况下,由不太聪明的编译器生成的代码大小增加将是一个问题 - 以至于它使代码显着变慢。在嵌入式平台上,代码大小通常是第一个限制。

尽管如此,一些项目可以并且确实从“内联一切”中获利。它为您提供与链接时间优化相同的效果,至少如果您的编译器不盲目遵循内联

Little benefit
On a good compiler for a modern platform, inline will affect only a very few functions. It is just a hint to the compiler, modern compilers are fairly good at making this decision themselves, and the the overhead of a function call has become rather small (often, the main benefit of inlining is not to reduce call overhead, but opening up further optimizations).

Compile time
However, since inline also changes semantics, you will have to #include everything into one huge compile unit. This usually increases compile time significantly, which is a killer on large projects.

Code Size
if you move away from current desktop platforms and its high performance compilers, things change a lot. In this case, the increased code size generated by a less clever compiler will be a problem - so much that it makes the code significantly slower. On embedded platforms, code size is usually the first restriction.

Still, some projects can and do profit from "inline everything". It gives you the same effect as link time optimization, at least if your compiler doesn't blindly follow the inline.

猫七 2024-10-05 15:38:37

在某些情况下已经这样做了。它与unity builds的思想非常相似,并且优点缺点并不是你所描述的那样:

  • 编译器优化
  • 链接时间的更多潜力基本上消失了(如果所有内容都在单个翻译单元中,那么实际上没有什么可链接的)
  • 编译时间消失了,好吧,一种方式或其他。正如您提到的,增量构建变得不可能。另一方面,完整的构建将比其他方式更快(因为每行代码都只编译一次。在常规构建中,标头中的代码最终会在包含标头的每个翻译单元中进行编译) )

但是,如果您已经拥有大量仅包含头文件的代码(例如,如果您使用大量 Boost),那么无论是在构建时间还是可执行性能方面,这可能都是非常值得的优化。

与往常一样,当涉及到性能时,这取决于情况。这不是一个坏主意,但也不是普遍适用的。

就批量时间而言,您基本上有两种优化方法:

  • 最小化翻译单元的数量(这样您的标头包含在更少的位置),或者
  • 最小化标头中的代码量(这样包含标头的成本 。

C 代码通常采用第二种选择,几乎达到了极端:除了前向声明和宏之外,几乎没有任何内容保留在标头中
C++ 通常位于中间,这是您可能获得最差总构建时间的地方(但 PCH 和/或增量构建可能会再次缩短一些时间),但在另一个方向上更进一步,最大限度地减少翻译单元的数量可以确实为总构建时间创造了奇迹。

It is done already in some cases. It is very similar to the idea of unity builds, and the advantages and disadvantages are not fa from what you descibe:

  • more potential for the compiler to optimize
  • link time basically goes away (if everything is in a single translation unit, there is nothing to link, really)
  • compile time goes, well, one way or the other. Incremental builds become impossible, as you mentioned. On the other hand, a complete build is going to be faster than it would be otherwise (as every line of code is compiled exactly once. In a regular build, code in headers ends up being compiled in every translation unit where the header is included)

But in cases where you already have a lot of header-only code (for example if you use a lot of Boost), it might be a very worthwhile optimization, both in terms of build time and executable performance.

As always though, when performance is involved, it depends. It's not a bad idea, but it's not universally applicable either.

As far as buld time goes, you have basically two ways to optimize it:

  • minimize the number of translation units (so your headers are included in fewer places), or
  • minimize the amount of code in headers (so that the cost of including a header in multiple translation units decreases)

C code typically takes the second option, pretty much to its extreme: almost nothing apart from forward declarations and macros are kept in headers.
C++ often lies around the middle, which is where you get the worst possible total build time (but PCH's and/or incremental builds may shave some time off it again), but going further in the other direction, minimizing the number of translation units can really do wonders for the total build time.

○闲身 2024-10-05 15:38:37

这里的假设是编译器无法跨函数进行优化。这是特定编译器的限制,而不是普遍问题。将此作为特定问题的通用解决方案可能会很糟糕。编译器很可能只是将同一内存地址处可重用的函数(使用缓存)在其他地方编译(并由于缓存而损失性能)来使您的程序膨胀。

大函数一般都会花费优化成本,在局部变量的开销和函数中的代码量之间存在一个平衡。将函数中的变量数量(传入的、本地的和全局的)保持在平台可处置变量的数量之内,这样大多数内容都能够保留在寄存器中,而不必被逐出到 ram(也是一个堆栈)中不需要框架(取决于目标),因此函数调用开销显着减少。在现实世界的应用程序中很难一直做到这一点,但替代方案是少量带有大量局部变量的大函数,代码将花费大量时间从内存中逐出和加载带有变量的寄存器(取决于目标)。

尝试 llvm,它可以优化整个程序,而不仅仅是逐个函数进行优化。 Release 27 已经赶上了 gcc 的优化器,至少在一两次测试中,我没有做详尽的性能测试。 28 已经出来了,所以我认为它更好。即使有几个文件,调音旋钮组合的数量也太多了,无法搞乱。我发现最好不要进行优化,直到将整个程序放入一个文件中,然后执行优化,为优化器提供整个程序的工作,基本上就是您尝试使用内联执行的操作,但没有包袱。

The assumption here is that the compiler cannot optimize across functions. That is a limitation of specific compilers and not a general problem. Using this as a general solution for a specific problem might be bad. The compiler may very well just bloat your program with what could have been reusable functions at the same memory address (getting to use the cache) being compiled elsewhere (and losing performance because of the cache).

Big functions in general cost on optimization, there is a balance between the overhead of local variables and the amount of code in the function. Keeping the number of variables in the function (both passed in, local, and global) to within the number of disposable variables for the platform results in most everything being able to stay in registers and not have to be evicted to ram, also a stack frame is not required (depends on the target) so function calling overhead is noticeably reduced. Hard to do in real world applications all the time, but the alternative a small number of big functions with lots of local variables the code is going to spend a significant amount of time evicting and loading registers with variables to/from ram (depends on the target).

Try llvm it can optimize across the entire program not just function by function. Release 27 had caught up to gcc's optimizer, at least for a test or two, I didnt do exhaustive performance testing. And 28 is out so I assume it is better. Even with a few files the number of tuning knob combinations are too many to mess with. I find it best to not optimize at all until you have the whole program into one file, then perform your optimization, giving the optimizer the whole program to work with, basically what you are trying to do with inlining, but without the baggage.

泡沫很甜 2024-10-05 15:38:37

假设 foo()bar() 都调用一些 helper()。如果所有内容都在一个编译单元中,编译器可能会选择不内联 helper(),以减少总指令大小。这会导致 foo()helper() 进行非内联函数调用。

编译器并不知道 foo() 运行时间的一纳秒改进会为您的底线增加 100 美元/天。它不知道 foo() 之外的任何内容的性能改进或下降不会对您的利润产生影响。

只有你作为程序员才知道这些事情(当然是在仔细的分析和分析之后)。不内联 bar() 的决定是告诉编译器您所知道的内容的一种方式。

Suppose foo() and bar() both call some helper(). If everything is in one compilation unit, the compiler might choose not to inline helper(), in order to reduce total instruction size. This causes foo() to make a non-inlined function call to helper().

The compiler doesn't know that a nanosecond improvement to the running time of foo() adds $100/day to your bottom line in expectation. It doesn't know that a performance improvement or degradation of anything outside of foo() has no impact on your bottom line.

Only you as the programmer know these things (after careful profiling and analysis of course). The decision not to inline bar() is a way of telling the compiler what you know.

み零 2024-10-05 15:38:37

内联的问题是您希望高性能函数适合缓存。您可能认为函数调用开销对性能造成很大影响,但在许多体系结构中,缓存未命中会使几次推送和弹出操作失效。例如,如果您有一个大型(可能很深)函数,很少需要从主要高性能路径调用,则可能会导致您的主要高性能循环增长到不适合 L1 icache 的程度。这会比偶尔的函数调用减慢你的代码速度。

The problem with inlining is that you want high performance functions to fit in cache. You might think function call overhead is the big performance hit, but in many architectures a cache miss will blow the couple pushes and pops out of the water. For example, if you have a large (maybe deep) function that needs to be called very rarely from your main high performance path, it could cause your main high performance loop to grow to the point where it doesn't fit in L1 icache. That will slow your code down way, way more than the occasional function call.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文