c++估计函数内联好处的启发式
在 C++ 中,用于估计内联函数的计算时间优势的良好启发式是什么,特别是当该函数被非常频繁地调用并且占程序执行时间的 >= 10% 时(例如,暴力计算的评估函数)或随机优化过程)。尽管内联最终可能超出我的控制范围,但我仍然很好奇。
In c++, what is a good heuristic for estimating the compute time benefits of inlining a function, particularly when the function is called very frequently and accounts for >= 10% of the program's execution time (eg. the evaluation function of a brute force or stochastic optimization process). Even though inlining may be ultimately beyond my control, I am still curious.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
没有一般性的答案。这取决于硬件、数量和
其参数的类型以及函数中执行的操作。以及多久一次
它叫什么名字,在哪里。例如,在 Sparc 上,参数(以及
返回值)在寄存器中传递,每个函数获得 16 个新值
寄存器:如果功能足够复杂,这些新寄存器可能
避免函数内联时发生的溢出,并且
非内联版本最终可能比内联版本更快。在英特尔上,
它是寄存器贫乏的,并在寄存器中传递参数,只是
对于同一程序中的相同功能,情况可能恰恰相反。更多的
一般来说,内联可能会增加程序大小,减少局部性。或者
对于非常简单的功能,可以减少程序大小;但又是这样
取决于架构。唯一可能知道的方法就是尝试
两者,测量时间。即便如此,你也只会知道这一点
特定硬件上的特定程序。
There is no general answer. It depends on the hardware, the number and
type of its arguments, and what is done in the function. And how often
it is called, and where. On a Sparc, for example, arguments (and the
return value) are passed in registers, and each function gets 16 new
registers: if the function is complex enough, those new registers may
avoid spilling that would occur if the function were inlined, and the
non-inline version may end up faster than the inlined one. On an Intel,
which is register poor, and passes arguments in registers, just the
opposite might be true, for the same function in the same program. More
generally, inlining may increase program size, reducing locality. Or
for very simple functions, it may reduce program size; but that again
depends on the architecture. The only possible way to know is to try
both, measuring the time. And even then you'll only know for that
particular program, on that particular hardware.
某些架构上的函数调用和返回只需一条指令(尽管它们通常不是类似 RISC 的单周期指令。)一般来说,您可以将其与函数体表示的周期数进行比较。一个简单的属性访问可能只是一条指令,因此将其放入非内联函数中将使执行它的指令数量增加三倍——显然是内联的一个很好的候选者。另一方面,格式化字符串以进行打印的函数可能代表数百条指令,因此多两条指令根本不会产生任何影响。
A function call and return on some architectures take as few as one instruction each (although they're generally not RISC-like single-cycle instructions.) In general, you can compare that to the number of cycles represented by the body of the function. A simple property access might be only a single instruction, and so putting it into a non-inlined function will triple the number of instructions to execute it -- obviously a great candidate for inlining. On the other hand, a function that formats a string for printing might represent hundreds of instructions, so two more isn't going to make any difference at all.
如果您的瓶颈在于递归函数,并且假设递归级别不是最小的(即平均递归不仅仅是几个级别),那么您最好在函数中使用算法而不是使用内联。
如果可能的话,尝试将递归转换为循环或尾递归(可以由编译器隐式转换为循环),或者尝试确定函数中的何处花费了成本。尝试最小化内部操作的影响(也许您正在动态分配可能具有自动存储持续时间的内存,或者您可以将要在包装器中的函数外部执行的常见操作考虑在内并作为额外参数传入)。 ..)
*在评论后编辑,递归不是有意的,而是迭代*
如果编译器有权访问函数的定义,那么在大多数情况下它将为您做出正确的决定。如果它无权访问定义,只需移动代码以便它能够看到它。也许使函数成为静态函数以提供额外的提示,表明它不会在其他任何地方使用,甚至将其标记为内联(知道这不会强制内联) ),但避免使用强制内联的特殊属性,因为编译器可能比无需查看代码即可生成的任何简单启发式方法做得更好。
If your bottleneck is in a recursive function, and assuming that the level of recursion is not minimal (i.e. average recursion is not just a few levels), you are better off in working with the algorithm in the function rather than with inlining.
Try, if possible, to transform the recursion into a loop or tail-recursion (that can be implicitly transformed into a loop by the compiler), or try to determine where in the function the cost is being spent. Try to minimize the impact of the internal operations (maybe you are dynamically allocating memory that could have auto storage duration, or maybe you can factor a common operation to be performed external to the function in a wrapper and passed in as an extra argument,...)
*EDIT after the comment that recursion was not intended, but rather iteration*
If the compiler has access to the definition of the function, it will make the right decision for you in most cases. If it does not have access to the definition, just move the code around so that it does see it. Maybe make the function a
static
function to provide an extra hint that it won't be used anywhere else, or even mark it asinline
(knowing that this will not force inlining), but avoid using special attributes that will force inlining, as the compiler probably does it better than any simple heuristic that can be produced without looking at the code.所有内联为您节省的是函数的进入/退出成本,因此只有当函数几乎不执行任何操作时才值得考虑。
当然,如果函数本身包含函数调用,则可能不值得考虑。
即使该函数只执行很少的操作,但在该函数的任何加速都会被注意到之前,它必须被调用得太多,以至于它在相当大比例的时间里拥有程序计数器。
All inlining saves you is the entry/exit cost of the function, so it's only worth considering if the function does almost nothing.
Certainly if the function itself contains a function call, it's probably not worth considering.
Even if the function does very little, it has to be called so much that it owns the program counter a significant percent of the time, before any speedup of the function would be noticeable.
这里的行为在某种程度上依赖于编译器。显然,对于递归函数,内联行为理论上可以是无限的。 'inline' 关键字只是对编译器的一个提示,如果它不能用它做任何事情,它可以选择忽略它。有些编译器会将递归函数内联到一定深度。
至于“这会加快多少速度”——不幸的是,我们无法为这个问题提供任何形式的答案,因为“这取决于”——函数做了多少工作与函数调用机制本身的开销相比。你为什么不设置一个测试来看看?
The behaviour here is somewhat compiler dependant. With a recursive function obviously inlining behaviour can in theory be infinite. The 'inline' keyword is only a hint to the compiler, it can choose it ignore if it can't do anything with it. Some compilers will inline the recursive function to a certain depth.
As for the 'how much will this speed things up' - unfortunately we can't provide any sort of answer to that question as 'it depends' - how much work is the function doing vs the overhead of the function call mechanism itself. Why don't you set up a test and see?
根据我们 20 多年编写计算密集型 C++ 的经验,内联并不是灵丹妙药。您确实需要分析代码以查看内联是否会提高性能。对于我们来说,除了低水平的 2D 和 3D 点和矢量操作外,内联是浪费时间。与尝试对时钟滴答声进行微观管理相比,制定更好的算法要好得多。
Our experience, 20+ years of writing computationally intensive C++, is that inlining is no silver bullet. You really do need to profile your code to see whether inlining will increase performance. For us except for low level 2D and 3D point and vector manipulations inlining is a waste of time. You are far better off working out a better algorithm than trying to micromanage clock ticks.