内联算法
有谁知道任何讨论内联算法的论文吗?与之密切相关的是,父子图与调用图的关系。
背景:我有一个用Ocaml
编写的编译器,它积极地内联函数,主要是由于这一点和其他一些优化,它为我的编程语言生成的代码比大多数其他语言更快。许多情况(甚至包括C
)。
问题#1:该算法在递归方面存在问题。为此,我的规则只是将子级内联到父级中,以防止无限递归,但这排除了兄弟函数彼此内联一次。
问题#2:我不知道优化内联操作的简单方法。我的算法对于函数体的可变表示是必要的,因为似乎不可能创建一个有效的函数内联算法。如果调用图是一棵树,那么很明显自下而上的内联是最佳的。
技术信息:内联由许多内联步骤组成。问题在于步骤的顺序。
每个步骤的工作原理如下:
- 我们制作要内联和 beta-reduce 的函数的副本 用参数替换类型参数和值参数。
- 然后我们将 return 语句替换为对新变量的赋值 然后跳转到函数体的末尾。
- 然后,对该函数的原始调用将被该主体替换。
- 然而我们还没有完成。我们还必须克隆所有的孩子 函数,也对它们进行 beta 还原,并将克隆重新设置为 调用函数。
克隆操作使得内联递归函数变得极其困难。保留已在进行的内容的列表并仅检查我们是否已经在处理此调用的常见技巧在天真的形式中不起作用,因为递归调用现在已移至被填充到调用中的 beta-reduced 代码中函数,并且递归目标可能已更改为克隆子级。然而,该子级在调用父级时,仍在调用原始父级,而原始父级又调用其子级,现在递归的展开不会停止。如前所述,我通过只允许内联对子级的递归调用来打破这种回归,从而防止内联兄弟递归。
由于需要对未使用的函数进行垃圾收集
,内联的成本变得更加复杂。由于内联可能是指数级的,因此这一点至关重要。如果对一个函数的所有调用都是内联的,那么如果该函数尚未内联,我们应该删除该函数,否则我们会浪费时间内联到不再使用的函数中。实际上跟踪谁调用了什么是极其困难的,因为在内联时我们处理的不是实际的函数表示,而是“未解开的”函数表示:例如,指令列表正在按顺序处理并建立一个新列表,并且在任何一个时间点都可能没有一个连贯的指令列表。
在他的 ML 编译器中,Steven Weeks 选择使用一些重复应用的小优化,因为这使得优化易于编写且易于控制,但不幸的是,与递归算法相比,这错过了很多优化机会。
问题#3:什么时候内联函数调用是安全的?
一般地解释这个问题:在惰性函数语言中,参数被包装在闭包中,然后我们可以内联应用程序;这是 Haskell 的标准模型。然而,它也解释了为什么Haskell
如此慢。如果参数已知,则不需要闭包,则可以直接用出现的参数替换该参数(这是正常顺序 beta-reduction
)。
但是,如果知道参数求值不是非终止的,则可以使用急切求值来代替:将表达式的值分配给参数一次,然后重新使用。这两种技术的混合是使用闭包,但将结果缓存在闭包对象内。尽管如此,GHC 还没有成功地生成非常高效的代码:这显然非常困难,特别是如果您有单独的编译。
在菲利克斯,我采取了相反的方法。我没有要求正确性并通过证明优化保留语义来逐渐提高效率,而是要求优化定义语义。这保证了优化器的正确操作,但代价是某些代码的行为方式存在不确定性。这个想法是为程序员提供方法,在默认优化策略过于激进的情况下强制优化器符合预期语义。
例如,默认参数传递模式允许编译器选择是否将参数包装在闭包中、用参数替换参数或将参数分配给参数。如果程序员想要强制闭包,他们只需传入一个闭包即可。如果程序员想要强制急切求值,他们会标记参数var
。
这里的复杂性比函数式编程语言要大得多:Felix 是一种带有变量和指针的过程语言。它还具有 Haskell 风格的类型类。这使得内联例程变得极其复杂,例如,类型类实例尽可能替换抽象函数(由于调用多态函数时的类型特化,在内联时可能会找到实例,所以现在我们有了一个新函数可以内联)。
为了清楚起见,我必须添加更多注释。
内联和其他一些优化,例如用户定义的术语减少、类型类实例化、用于变量消除的线性数据流检查、尾部记录优化,都是在给定函数上一次性完成的。
排序问题不是应用不同优化的顺序,问题是对函数进行排序。
我使用脑死亡算法来检测递归:我建立了每个函数直接使用的所有内容的列表,找到闭包,然后检查该函数是否在结果中。请注意,使用集在优化过程中会多次建立,这是一个严重的瓶颈。
不幸的是,函数是否递归可能会发生变化。尾部记录优化后,递归函数可能会变为非递归。但还有一个更困难的情况:实例化类型类“虚拟”函数可以使看似非递归的递归。
至于同级调用,问题是给定 f 和 g,其中 f 调用 g 且 g 调用 f 我实际上想将 f 内联到 g 中,并将 g 内联到 f .. 一次。我的无限回归停止规则是,如果 f 是 g 的子级(不包括内联兄弟),则仅允许将 f 内联到 g 中,前提是它们相互递归。
基本上我想“尽可能地”“展平”所有代码。
Does anyone know of any papers discussion inlining algorithms? And closely related, the relationship of parent-child graph to call graph.
Background: I have a compiler written in Ocaml
which aggressively inlines functions, primarily as a result of this and some other optimisations it generates faster code for my programming language than most others in many circumstances (including even C
).
Problem #1: The algorithm has trouble with recursion. For this my rule is only to inline children into parents, to prevent infinite recursion, but this precludes sibling functions inlining once into each other.
Problem #2: I do not know of a simple way to optimise inlining operations. My algorithm is imperative with mutable representation of function bodies because it does not seem even remotely possible to make an efficient functional inlining algorithm. If the call graph is a tree, it is clear that a bottom up inlining is optimal.
Technical information: Inlining consists of a number of inlining steps. The problem is the ordering of the steps.
Each step works as follows:
- we make a copy of the function to be inlined and beta-reduce by
replacing both type parameters and value parameters with arguments. - We then replace return statement with an assignment to a new variable
followed by a jump to the end of the function body. - The original call to the function is then replaced by this body.
- However we're not finished. We must also clone all the children of
the function, beta-reducting them as well, and reparent the clones to
the calling function.
The cloning operation makes it extremely hard to inline recursive functions. The usual trick of keeping a list of what is already in progress and just checking to see if we're already processing this call does not work in naive form because the recursive call is now moved into the beta-reduced code being stuffed into the calling function, and the recursion target may have changed to a cloned child. However that child, in calling the parent, is still calling the original parent which calls its child, and now the unrolling of the recursion will not stop. As mentioned I broke this regress by only allowing inlining a recursive call to a child, preventing sibling recursions being inlined.
The cost of inlining is further complicated by the need to garbage collect
unused functions. Since inlining is potentially exponential, this is essential. If all the calls to a function are inlined, we should get rid of the function if it has not been inlined into yet, otherwise we'll waste time inlining into a function which is no longer used. Actually keeping track of who calls what is extremely difficult, because when inlining we're not working with an actual function representation, but an "unravelled" one: for example, the list of instructions is being processed sequentially and a new list built up, and at any one point in time there may not be a coherent instruction list.
In his ML compiler Steven Weeks chose to use a number of small optimisations applied repeatedly, since this made the optimisations easy to write and easy to control, but unfortunately this misses a lot of optimisation opportunities compared to a recursive algorithm.
Problem #3: when is it safe to inline a function call?
To explain this problem generically: in a lazy functional language, arguments are wrapped in closures and then we can inline an application; this is the standard model for Haskell. However it also explains why Haskell
is so slow. The closures are not required if the argument is known, then the parameter can be replaced directly with its argument where is occurs (this is normal order beta-reduction
).
However if it is known the argument evaluation is not non-terminating, eager evaluation can be used instead: the parameter is assigned the value of the expression once, and then reused. A hybrid of these two techniques is to use a closure but cache the result inside the closure object. Still, GHC hasn't succeeded in producing very efficient code: it is clearly very difficult, especially if you have separate compilation.
In Felix, I took the opposite approach. Instead of demanding correctness and gradually improving efficiency by proving optimisations preserved semantics, I mandate that the optimisation defines the semantics. This guarantees correct operation of the optimiser at the expense of uncertainty about what how certain code will behave. The idea is to provide ways for the programmer to force the optimiser to conform to intended semantics if the default optimisation strategy is too aggressive.
For example, the default parameter passing mode allows the compiler to chose whether to wrap the argument in a closure, replace the parameter with the argument, or assign the argument to the parameter. If the programmer wants to force a closure, they can just pass in a closure. If the programmer wants to force eager evaluation, they mark the parameter var
.
The complexity here is much greater than a functional programming language: Felix is a procedural language with variables and pointers. It also has Haskell style typeclasses. This makes the inlining routine extremely complex, for example, type-class instances replace abstract functions whenever possible (due to type specialisation when calling a polymorphic function, it may be possible to find an instance whilst inlining, so now we have a new function we can inline).
Just to be clear I have to add some more notes.
Inlining and several other optimisations such as user defined term reductions, typeclass instantiations, linear data flow checks for variable elimination, tail rec optimisation, are done all at once on a given function.
The ordering problem isn't the order to apply different optimisations, the problem is to order the functions.
I use a brain dead algorithm to detect recursion: I build up a list of everything used directly by a each function, find the closure, and then check if the function is in the result. Note the usage set is built up many times during optimisation, and this is a serious bottleneck.
Whether a function is recursive or not can change unfortunately. A recursive function might become non-recursive after tail rec optimisation. But there is a much harder case: instantiating a typeclass "virtual" function can make what appeared to be non-recursive recursive.
As to sibling calls, the problem is that given f and g where f calls g and g calls f I actually want to inline f into g, and g into f .. once. My infinite regress stopping rule is to only allow inlining of f into g if they're mutually recursive if f is a child of g, which excludes inlining siblings.
Basically I want to "flatten out" all code "as much as possible".
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我意识到您可能已经知道所有这些,但仍然完整地写下它似乎很重要,至少为了进一步参考。
在功能社区中,有一些文献主要来自 GHC 人。请注意,他们将内联视为源语言中的转换,而您似乎在较低级别上工作。我相信,使用源语言(或语义相当相似的中间语言)工作对于简单性和正确性有很大帮助。
对于编译器传递的排序问题,这是相当神秘的。仍然在 Haskell 设置中,有 通过非转换进行编译-严格函数式语言 博士论文讨论了不同编译器传递的顺序(以及内联)。
最近还有一篇关于 Compilation by Equality Saturation 的论文,提出了一种新颖的方法优化传递排序。我不确定它是否已经证明了大规模的适用性,但这无疑是一个值得探索的有趣方向。
I realize you probably already know all this, but it seems important to still write it in full, at least for further reference.
In the functional community, there is some litterature mostly from the GHC people. Note that they consider inlining as a transformation in the source language, while you seem to work at a lower level. Working in the source language -- or an intermediate language of reasonably similar semantics -- is, I believe, a big help for simplicity and correctness.
For the question of the ordering compiler passes, this is quite arcane. Still in a Haskell setting, there is the Compilation by Transformation in a Non-strict Functional Language PhD Thesis which discusses the ordering of different compiler passes (and also inlining).
There is also the quite recent paper on Compilation by Equality Saturation which propose a novel approach to optimisation passes ordering. I'm not sure it has yet demonstrated applicability at a large scale, but it's certainly an interesting direction to explore.
对于递归情况,您可以在调用图上使用 Tarjan 算法来检测循环依赖簇,并将它们从内联中排除。它不会影响兄弟姐妹的呼叫。
http://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm
As for the recursion case, you could use Tarjan algorithm on your call graph to detect circular dependency clusters, and exclude them from inlining. It won't affect sibling calls.
http://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm