寄存器分配和溢出,简单的方法吗?
我正在寻找一种将局部变量分配给寄存器的方法。我知道有几种严肃的方法可以做到这一点(即维基百科上提到的那些) ,但我对“溢出”是如何完成的感到困惑。而且,相关文献也相当令人生畏。我希望有一些更简单的东西可以满足我的优先事项:
- 正确性——一种无论有多少局部变量都会生成正确代码的算法。
- 简单性——我无需阅读太多文献就能理解的东西。
- 效率——它需要比当前的方法更好,即:
将操作 x = y # z
转换为:
movl y, %eax
movl z, %ebx
op %ebx, %eax
movl %eax, x
由于我的目标是 Intel 386,一些相关的限制是:
- 二进制操作需要两个参数,其中之一是源和目的地。一元运算采用单个参数。
- 操作只能访问一个内存位置;因此,二元运算在寄存器中至少需要一个参数。
- 最多有六个可用寄存器:
%eax
%ebx
%ecx
%edx
%esi
%edi
。 (%ebp
也可以作为最后的手段包含在内。) - 有一些特殊情况,例如整数除法和返回寄存器,但我现在可以忽略它们。
编译器目前要经历三个步骤:
- i386ification:所有操作都转换为
a = a # b
形式(或对于一元操作为a = #a
) 。 - 活跃度分析:确定每次操作之前和之后的活跃变量集。
- 寄存器分配:构建干扰图并着色。
然后编译器把蜡笔扔到空中,不知道下一步该做什么。
示例
public int mf(int cr, int ci) {
int i = 0;
int zr = 0;
int zi = 0;
while (i < 100 && zr*zr + zi*zi < 4) {
int t = zr * zr - zi * zi + cr;
zi = 2 * zr * zi + ci;
zr = t;
i = i + 1;
}
return i;
}
这是相当漂亮的函数干扰图,以及带有活跃信息的 CFG。不幸的是,CFG 图像确实需要一些垂直滚动。
使用了七种颜色。我想溢出其中一个(或分配该颜色的一组变量)。选择哪种方法并不是太重要。棘手的是如何处理溢出的变量。
假设我溢出了“pink”,它是变量 t
、$t4
、$t7
的集合。这意味着引用这些变量之一的操作将从堆栈帧上的位置访问它,而不是通过寄存器。这应该适用于此示例。
但是,如果程序是:
...
a = a + b
...
并且 a
和 b
都必须溢出怎么办?我无法发出具有两个内存地址的指令 addl b, a
。我需要另一个备用寄存器来临时保存其中一个操作数,这意味着溢出另一种颜色。这提出了一种通用方法:
- 如果所有变量都可以用
r
颜色着色,那就太好了! - 否则,溢出一些颜色及其相关变量。
- 如果存在访问两个溢出变量的操作,则溢出另一种颜色并使用备用寄存器临时存储所有此类操作。
在这一点上,我怀疑溢出的东西比必要的多得多,并且想知道是否有一些更聪明的方法来溢出东西,例如溢出变量生命周期的一部分,而不是整个变量本身。我可以在这里使用一些简单的技术吗?再说一次,我的目标并不是特别高——当然也没有高到需要阅读太深入的东西。 ;-)
具体问题
主要的具体问题是:当变量溢出时,这如何影响生成的指令?使用该变量的所有指令是否都需要直接在内存中访问它(从其堆栈位置)?如果一个操作使用两个溢出变量,这将如何工作? (该体系结构不允许指令访问两个不同的内存位置。)
次要问题是:
- 如何确定在哪里插入加载/存储指令,以确保正确性(以及不太重要的效率)?
- 当变量不立即使用时,我可以只在其生命周期的那部分时间溢出变量,然后再将其恢复吗?以便所有指令都作用于未溢出寄存器。一个变量可能在不同的时间存在于不同的寄存器中。
- 我可以在特殊情况下提高效率吗?例如,
%eax
用于返回值,因此如果在遇到返回时要返回的变量恰好已分配到该寄存器,那就太好了。类似地,某些寄存器是“被调用者保存”的,因此如果在函数调用时恰好存在较少的变量,将它们分配给非被调用者保存寄存器将意味着我可以避免存储这些寄存器。 - SSA 表格会有很大帮助吗(如果有的话)?能够消除公共子表达式并计算常量可能会减少(?)寄存器压力,但否则会有什么影响吗?
我(现在)不关心的方面是:
- 堆栈分配和优化:它已经简单地实现了,并且如果需要的话可以使用干扰图进行优化。
- 编译时效率高,只要终止即可。 (NP 完整性并不意味着应该避免给定的算法。)
更新
对于停机时间感到抱歉——我一直在思考给出的答案,并试图找到一种简单的方法来开始实施一些想法。说实话,我一直在拖延...... :-\
我发现了非常好的演示文稿(PPT,可悲的是):
http://www.cs.princeton.edu/courses/archive/spr05/cos320/notes/Register%20Allocation.ppt
这回答了关于如何处理特定操作需求的问题(例如对源和目标使用相同的寄存器;或者某些操作需要某个寄存器)。我不确定的是活性-着色-分配周期是否终止。
我会尽快尝试做一些实际工作并希望结束这个问题。
I'm looking for a way to allocate local variables to registers. I'm aware of a couple of serious methods for doing it (namely, those mentioned on Wikipedia), but I'm stuck on how "spilling" is accomplished. Also, the relevant literature is quite intimidating. I'm hoping there's something simpler that will satisfy my priorities:
- Correctness -- an algorithm that will generate correct code regardless of how many local variables there are.
- Simplicity -- something I can understand without having to read too much literature.
- Efficiency -- it needs to be better than the current method, which is:
Translate an operation x = y # z
to:
movl y, %eax
movl z, %ebx
op %ebx, %eax
movl %eax, x
As I'm targeting Intel 386, some relevant constraints are:
- Binary operations take two arguments, one of which is a source and destination. Unary operations take a single argument.
- Operations can only access one memory location; binary operations therefore need at least one argument in a register.
- There is a maximum of six registers available:
%eax
%ebx
%ecx
%edx
%esi
%edi
. (%ebp
could also be included as a last resort.) - There are special cases such as for integer division and return registers, but I can ignore them for now.
There are three steps the compiler gets through at the moment:
- i386ification: all operations are converted to a form
a = a # b
(ora = #a
for unary operations). - Liveness analysis: the sets of live variables before and after each operation are determined.
- Register allocation: an interference graph is built and coloured.
And then the compiler throws its crayons in the air and doesn't know what to do next.
Example
public int mf(int cr, int ci) {
int i = 0;
int zr = 0;
int zi = 0;
while (i < 100 && zr*zr + zi*zi < 4) {
int t = zr * zr - zi * zi + cr;
zi = 2 * zr * zi + ci;
zr = t;
i = i + 1;
}
return i;
}
Here's the rather pretty interference graph for the function, and the CFG with liveness information. The CFG image does require some vertical scrolling, unfortunately.
- Interference graph for a function on 14 variables
- Control-flow graph for a function, with liveness information
Seven colours were used. I would like to spill one of them (or the set of variables assigned that colour). The method of choosing which isn't too important. What gets tricky is how to deal with the spilt variables.
Let's say I spill "pink", which is the set of variables t
, $t4
, $t7
. This means that those operations referring to one of these variables will access it from its position on the stack frame, rather than through a register. This should work for this example.
But what if the program was:
...
a = a + b
...
and both a
and b
had to be spilled? I can't emit an instruction addl b, a
with two memory addresses. I would need another spare register to temporarily hold one of the operands, and that means spilling another colour. This suggests a general method of:
- If all variables can be coloured with
r
colours, great! - Otherwise, spill some colours and their associated variables.
- If an operation exists that accesses two spilled variables, spill another colour and use the spare register for temporary storage for all such operations.
At this point I would suspect that a lot more stuff is being spilled than necessary, and wonder if there is some smarter way to spill things, such as spilling part of a variable's lifetime, rather than the whole variable itself. Are there some simple(ish) techniques that I could use here? Again, I'm not aiming particularly high -- certainly not high enough to require reading anything too deep. ;-)
Specific problems
The main specific problem is: when a variable is spilled, how does this affect the instructions generated? Do all instructions using that variable need to access it directly in memory (from its stack position) ? How will this work if an operation uses two spilled variables? (The architecture does not permit instructions to access two distinct memory locations.)
Secondary problems are:
- How do I determine where to insert load/store instructions, for correctness (and less importantly, efficiency) ?
- Can I spill a variable for only that part of its lifetime when it is not in immediate use, and unspill it later? So that all instructions act on unspilled registers. A variable might live in different registers at different times.
- Can I be a little more efficient with special cases. For example,
%eax
is used for the return value, so it would be nice if the variable to be returned happened to be allocated to that register by the time the return was encountered. Similarly, some registers are "callee-save", so if fewer variables happened to be live at the time of a function call, having them allocated to non-callee-save registers would mean I can avoid storing those registers. - Would SSA form help much (if at all) ? Being able to eliminate common subexpressions and evaluate constants might reduce(?) register pressure, but otherwise would it have any effect?
The aspects I'm not concerned about (right now) are:
- Stack allocation and optimisation: it's implemented naively already, and can be optimised using the interference graph if need be.
- Compile-time efficiency, just as long as it terminates. (NP-completeness does not imply a given algorithm should be avoided.)
Update
Sorry about the downtime -- I've been thinking about the answers given and trying to find an easy approach to take to start implementing some of the ideas. To be honest, I've been procrastinating... :-\
I found the very nice presentation (PPT, sadly):
http://www.cs.princeton.edu/courses/archive/spr05/cos320/notes/Register%20Allocation.ppt
Which answers the question about how to deal with specific operation needs (like using the same register for source and destination; or needing a certain register for some operations). What I'm not sure about is whether the Liveness-Colouring-Allocation cycle terminates.
I'll try to do some actual work soon and hopefully close the question.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我曾经在 JVM 分配器中使用过贪婪方法,效果非常好。基本上从基本块的顶部开始,所有值都存储在堆栈上。然后向前扫描指令,维护包含值的寄存器列表,以及该值是否脏(需要写回)。如果指令使用的值不在寄存器中(或不在正确的寄存器中),请发出加载(或移动)以将其放入指令之前的空闲寄存器中。如果指令写入一个值,请确保它位于寄存器中,并在指令后将其标记为脏。
如果您需要寄存器,请通过释放其中的值来溢出已使用的寄存器,并将其写入堆栈(如果它是脏的且处于活动状态)。在基本块的末尾,写回所有脏寄存器和活动寄存器。
该方案清楚地表明了所有加载/存储的去向,您可以随时生成它们。它很容易适应在内存中取值的指令,或者可以在内存中取两个参数之一但不能同时取两个参数的指令。
如果您同意在每个基本块边界处将所有数据都存储在堆栈上,那么此方案效果很好。它应该给出类似于基本块内的线性扫描的结果,因为它基本上做非常相似的事情。
关于如何决定溢出哪些值以及分配哪些寄存器,您可能会变得任意复杂。一些先行功能可能很有用,例如,通过使用特定寄存器标记每个值,它需要在基本块中的某个点(例如,eax 表示返回值,或 ecx 表示移位量),并在该值出现时优先选择该寄存器。首先分配(并避免该寄存器用于其他分配)。但很容易将算法的正确性与改进启发法区分开来。
我在 SSA 编译器 YMMV 中使用了这个分配器。
I've used a greedy approach in a JVM allocator once, which worked pretty well. Basically start at the top of a basic block with all values stored on the stack. Then just scan the instructions forward, maintaining a list of registers which contain a value, and whether the value is dirty (needs to be written back). If an instruction uses a value which is not in a register (or not in the correct register), issue a load (or move) to put it in a free register before the instruction. If an instruction writes a value, ensure it is in a register and mark it dirty after the instruction.
If you ever need a register, spill a used register by deallocating the value from it, and writing it to the stack if it is dirty and live. At the end of the basic block, write back any dirty and live registers.
This scheme makes it clear exactly where all the loads/stores go, you generate them as you go. It is easily adaptable to instructions which take a value in memory, or which can take either of two arguments in memory, but not both.
If you're OK with having all data on the stack at every basic block boundary, this scheme works pretty well. It should give results similar to linear scan within a basic block, as it basically does very similar things.
You can get arbitrarily complicated about how to decide which values to spill and which registers to allocate. Some lookahead can be useful, for example by marking each value with a specific register it needs to be in at some point in the basic block (e.g. eax for a return value, or ecx for a shift amount) and preferring that register when the value is first allocated (and avoiding that register for other allocations). But it is easy to separate out the correctness of the algorithm from the improvement heuristics.
I've used this allocator in an SSA compiler, YMMV.
第一:没有明智的方法可以做到这一点。问题是 NP 完全问题;-)
溢出是如何完成的:
运行寄存器分配算法并获取必须溢出的变量列表。现在您可以在函数开头在堆栈上分配一些空间。将每个溢出变量也链接到堆栈上的一个位置。如果您想智能地将内存与不重叠的生命周期合并。
每当您需要溢出寄存器时,请将其保存到内存中并在再次需要时加载它。
如何处理 eax:
将寄存器标记为已填充,但不在其中存储任何变量(预分配)。这将使代码生成器清除该寄存器。如果有利的话,明智地将值存储在另一个寄存器中。
处理溢出的简单而正确的方法:
将所有东西都溢出即可。这假设每个变量的生存范围是整个程序。这可以通过使用 LRU 或使用计数等内容来选择应释放哪些寄存器来增强。
接下来最好的事情可能是线性扫描寄存器分配。即使使用预分配,它也应该很容易实现。我建议您查看链接的论文。
具体答案
正确性对您意味着什么?如果您没有犯编程错误,即使是简单的分配算法也是正确的。证明(数学)正确性要困难得多。在再次需要值/寄存器之前,需要插入加载和存储。两者都需要在存储/创建值后插入。
是的。如果你这样编程的话。如果您的算法可以在其生命周期内处理多个寄存器中的值,您可以使用这些优化。
再次由您来实施某些改进。一种可能性是仅在需要时阻止 eax,而不是整个程序。
在某些情况下,SSA 确实有帮助。 SSA 代码的推理图始终是弦,这意味着不存在超过 3 个节点的循环。这是图着色的一种特殊情况,其中可以在多项式时间内找到最小着色。转换为 SSA 并不一定意味着套准压力增大或减小。虽然 SSA 形式通常具有更多变量,但它们的生存时间往往较短。
First: There is no smart way to do it. The problem is NP-complete ;-)
How spilling is done:
You run your register allocation algorithm and get a list of variables you have to spill. Now you can allocate some space on the stack at the beginning of your function. Link every spilled variable too a place on the stack. If you want to be smart coalesce memory with non-overlapping live ranges.
Whenever you need to spill a register save it to memory and load it, when it is needed again.
How to handle eax:
Mark the register as filled, but do not store any variable in it (pre-allocation). This will make the code generator clear that register. To be smart store the value in another register if beneficial.
Easy and correct ways to handle spilling:
Just spill everything. This assume that every variable's live range is the whole program. This can be augmented by using stuff like LRU or usage count to choose which registers should be freed.
The next best thing to do is probably linear scan register allocation. It should be quite easy to implement even when using pre-allocation. I suggest you look into the linked paper.
Specific Answers
What does correctness mean for you? Even simple allocations algorithms are correct if you do not make a programming error. Proofing (mathematical) correctness is a lot more difficult. Both loads and stores need to be inserted before the value/register is needed again. Both need to be inserted after the value is stored/created.
Yes. If you program it that way. If your algorithm can handle a value in multiple registers during its livetime you can use those optimizations.
It's again up to you to implement certain improvements. One possibility would be to only block eax when it's needed, not for the whole program.
Under certain conditions SSA does help. Inference graphs of SSA code are always chordal, meaning that there is no cycle with more than 3 nodes. This is a special case of graph coloring, in which a minimal coloring can be found in polynomial time. Converting to SSA does not necessarily mean more or less register pressure. While SSA form has usually more variables, these tend to have smaller livetimes.