编译语言中的垃圾收集实现
在实现精确的垃圾收集时,总是存在一个问题,即确定堆栈上的哪些字是指针,哪些是其他类型的数据,例如整数或浮点数。解释型语言通常通过将所有内容都设置为指针来解决这个问题。某些语言(例如 Lisp)的编译器通常通过使用标记位来区分指针和整数来解决这个问题。
但是,Java 和 C# 等支持完整未装箱机器字整数和浮点数的语言的 JIT 编译器又如何呢?他们如何判断堆栈和CPU寄存器中的哪些内容是指针?
When implementing precise garbage collection, there is always the issue of figuring out which words on the stack are pointers and which are other kinds of data such as integers or floating point numbers. Interpreted languages typically solve this problem by making everything a pointer; compilers for some languages such as Lisp typically solve it by using tag bits to distinguish between pointers and integers.
But what about JIT compilers for languages such as Java and C# that support full unboxed machine word integers and floating-point numbers? How do they tell which of the contents of the stack and CPU registers are pointers?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
此类语言的字节码始终包含完整的类型信息。它要么存储在元数据中(例如,对于参数类型),要么隐式存储在操作码中(例如,可能有不同的操作码用于添加整数或浮点数)。
优化代码时,编译器可以访问此信息并使用它来改进优化。它还使用这些信息为特定GC 安全点处的编译代码生成元数据。
GC 安全点是代码中的一个位置,可以安全地中断线程以调度另一个线程或执行垃圾收集。在 GC 安全点,我们拥有必要的元数据来找出哪些寄存器包含指针,哪些不包含指针。例如,在 Hotspot JVM 中,循环始终包含对内存中特殊位置的读取。该读取的结果未使用,但如果指令读取的地址受读保护,则会发生页错误。只需将该页面设置为只读,即可在任意时间点中断线程。一旦线程被中断,我们就会查看程序计数器并在哈希表中查找元数据。
其他需要 GC 安全点的地方是分配站点:分配可能会失败并导致 GC 发生。您可以通过一次为多个对象分配内存来减少安全点的数量。
编辑:请注意,使用 GC 安全点只是众多选项之一。正如 SK-logic 提到的,另一种选择是对指针和非指针使用单独的堆栈。很明显,GC 期间需要遍历一个堆栈的所有 元素,但其他元素都不需要遍历。不过,您仍然必须小心寄存器中的指针。例如,每当寄存器中存在活动指针时,堆栈中也必须存在相同的指针。
第三种选择是使用影子堆栈,其中包含指向真实堆栈上的堆栈根的指针的链接列表。有关详细信息,请参阅论文 “准确的垃圾收集不合作的环境”作者:Fergus Henderson (PDF)。
The bytecode for such languages always contains full type information. It is stored either in meta-data (e.g., for argument types) or implicit in the opcode (e.g., there may be different opcodes for adding an integer or a floating point number).
When optimising code the compiler can access this information and use it to improve optimisations. It also uses the information to generate meta data for the compiled code at specific GC safe points.
A GC safe point is a place in the code, where it is safe to interrupt a thread to schedule another thread or perform garbage collection. At GC safe points we have the necessary meta data available to find out which registers contain pointers and which don't. In the Hotspot JVM, for example, a loop always contains a read from a special location in memory. The result of that read is unused, but if the address that the instruction reads from is read-protected, a page fault occurs. This can be used to interrupt a thread at arbitrary points in time by simply setting that page to read-only. Once the thread is interrupted we look at the program counter and look up the meta data in, say, a hash table.
Other places that need to be GC safe points are allocation sites: an allocation may fail and cause GC to occur. You can reduce the number of safe points by allocating memory for multiple objects at once.
Edit: Note that using GC-safe points is only one of many options. Another option, as SK-logic mentioned, is to use separate stacks for pointers and non-pointers. It is then clear that all elements of one stack need to be traversed during GC but none of the others. You still have to be careful with pointers in registers, though. For example, whenever there is a live pointer in a register, the same pointer must also exist on the stack.
A third option is to use a shadow stack that contains a linked list of pointers to stack roots living on the real stack. For details see the paper "Accurate Garbage Collection in an Uncooperative Environment" by Fergus Henderson (PDF).
Java 和 C# 等语言的指定方式不需要精确的收集。实现可能会使用保守的收集器,其中看起来像指针的位模式被视为指针(但实际上可能是整数或浮点数)。例如,Boehm 收集器 是一个保守的收集器,可用于 JIT -ed 语言。
Languages like Java and C# are specified in such a way that they do not require precise collection. An implementation might use a conservative collector, where patterns of bits that appear to look like a pointer are treated like a pointer (but might really be an integer or float). For example, the Boehm collector is a conservative collector that could be used for JIT-ed languages.