垃圾收集和运行时类型信息

发布于 2024-07-08 08:34:11 字数 357 浏览 5 评论 0原文

fixnum 问题让我想到了另一个问题我想了很长时间。

许多有关垃圾收集的在线材料没有说明如何实现运行时类型信息。因此，我对各种垃圾收集器了解很多，但不太了解如何实现它们。

fixnum 解决方案实际上非常好，很清楚哪个值是指针，哪个不是。还有哪些其他常用的存储类型信息的解决方案？

另外，我想知道 fixnum - 的事情。这是否意味着您在每个数组索引上都被限制为固定数？或者是否有某种解决方法可以获取完整的 64 位整数？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

江湖彼岸 2024-07-15 08:34:12

基本上，为了实现准确的标记，您需要元数据来指示哪些单词用作指针，哪些不用作指针。

该元数据可以按引用存储，就像 emacs 所做的那样。如果对于您的语言/实现，您不太关心内存使用，您甚至可以使引用比单词大（可能是单词的两倍），以便每个引用都可以携带类型信息及其单字数据。这样，您就可以获得 32 位指针完整大小的固定数，但代价是引用全部为 64 位。

或者，元数据可以与其他类型信息一起存储。因此，例如，一个类除了通常的函数指针表之外，还可以包含数据布局的每个字一位，指示该字是否包含垃圾收集器应遵循的引用。如果您的语言具有虚拟调用，那么您必须已经有一种方法可以从对象中计算出要使用的函数地址，因此相同的机制将允许您计算出要使用的标记数据 - 通常您在以下位置添加一个额外的秘密指针每个对象的开始，指向构成其运行时类型的类。显然，对于某些动态语言，所指向的类型数据需要是写时复制的，因为它是可修改的。

堆栈可以做类似的事情 - 将准确的标记信息存储在代码本身的数据部分中，并让垃圾收集器检查存储的程序计数器和/或堆栈上的链接指针，和/或由堆栈放置在堆栈上的其他信息代码，以确定堆栈的每个位与哪个代码相关，从而确定哪些字是指针。轻量级异常机制往往会做类似的事情来存储有关代码中发生 try/catch 的位置的信息，当然调试器也需要能够解释堆栈，因此这很可能与一堆其他东西折叠在一起您已经在实现任何语言，包括具有内置垃圾收集功能的语言。

请注意，垃圾收集不一定需要准确的标记。您可以将每个单词视为一个指针，无论它是否真的是一个指针，在垃圾收集器的“所有内容的大列表”中查找它，以确定它是否可以引用尚未标记的对象，如果因此将其视为对该对象的引用。这很简单，但代价当然是它介于“相当慢”和“非常慢”之间，具体取决于 gc 用于查找的数据结构。此外，有时一个整数恰好与未引用对象的地址具有相同的值，并导致您保留一大堆应该被收集的对象。因此，这样的垃圾收集器无法为收集未引用的对象提供强有力的保证。这对于玩具实现或第一个工作版本来说可能很好，但不太可能受到用户的欢迎。

例如，混合方法可以准确标记对象，但不能准确标记堆栈中事物变得特别复杂的区域。例如，如果您编写的 JIT 可以创建代码，其中引用的对象地址仅出现在寄存器中，而不是出现在通常的堆栈槽中，那么您可能需要不准确地跟踪操作系统在其存储寄存器时所在的堆栈区域。重新安排有问题的线程来运行垃圾收集器。这可能相当繁琐，因此合理的方法（可能会导致代码变慢）是要求 JIT 始终在准确标记的堆栈上保留其正在使用的所有指针值的副本。

Basically to achieve accurate marking you need meta-data indicating which words are used as pointers and which are not.

This meta-data could be stored per reference, as emacs does. If for your language/implementation you don't care much about memory use, you could even make references bigger than words (perhaps twice as big), so that every reference can carry type information as well as its one-word data. That way you could have a fixnum the full size of a 32 bit pointer, at the cost of references all being 64 bit.

Alternatively, the meta-data could be stored along with other type information. So for example a class could contain, as well as the usual function pointer table, one bit per word of the data layout indicating whether or not the word contains a reference that should be followed by the garbage collector. If your language has virtual calls then you must already have a means of working out from an object what function addresses to use, so the same mechanism will allow you to work out what marking data to use - typically you add an extra, secret pointer at the start of every single object, pointing to the class which constitutes its runtime type. Obviously with certain dynamic languages the type data pointed to would need to be copy-on-write, since it is modifiable.

The stack can do similar - store the accurate marking information in data sections of the code itself, and have the garbage collector examine the stored program counter, and/or link pointers on the stack, and/or other information placed on the stack by the code for the purpose, to determine which code each bit of stack relates to and hence which words are pointers. Lightweight exception mechanisms tend to do a similar thing to store information about where try/catch occurs in the code, and of course debuggers need to be able to interpret the stack too, so this can quite possibly be folded in with a bunch of other stuff you'd already be doing to implement any language, including ones with built-in garbage collection.

Note that garbage collection doesn't necessarily need accurate marking. You could treat every word as a pointer, regardless of whether it really is or not, look it up in your garbage collector's "big list of everything" to decide whether it plausibly could refer to an object that has not yet been marked, and if so treat it as a reference to that object. This is simple, but the cost of course is that it's somewhere between "quite slow" and "very slow", depending on what data structures your gc uses for the lookup. Furthermore, sometimes an integer just so happens to have the same value as the address of an unreferenced object, and causes you to keep a whole bunch of objects which should have been collected. So such a garbage collector cannot offer strong guarantees about unreferenced objects ever being collected. This might be fine for a toy implementation or first working version, but is unlikely to be popular with users.

A mixed approach might, say, do accurate marking of objects, but not of regions of the stack where things get particularly hairy. For example if you write a JIT which can create code where a referenced object address appears only in registers, not in your usual stack slots, then you might need to non-accurately follow the region of the stack where the OS stored the registers when it descheduled the thread in question to run the garbage collector. Which is probably quite fiddly, so a reasonable approach (potentially resulting in slower code) would be to require the JIT to always keep a copy of all pointer values it's using on the accurately marked stack.

回复收藏 0 原文