当前位置：文江博客话题详情

为什么 OCaml 中的 int 只有 31 位？

发布于 2024-09-24 12:55:30 字数 63 浏览 2 评论 0原文

在其他地方没有见过这个“功能”。我知道第32位用于垃圾收集。但为什么只有整数是这样，而其他基本类型却不是这样呢？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

栩栩如生 2024-10-01 12:55:30

这称为标记指针表示，是一种非常常见的优化技巧，在许多不同的解释器、虚拟机和运行时系统中使用了数十年。几乎每个 Lisp 实现都使用它们，许多 Smalltalk VM、许多 Ruby 解释器等等。

通常，在这些语言中，您总是传递指向对象的指针。对象本身由对象头组成，对象头包含对象元数据（例如对象的类型、其类，可能是访问控制限制或安全注释等），然后是实际的对象数据本身。因此，一个简单的整数将表示为一个指针加上一个由元数据和实际整数组成的对象。即使使用非常紧凑的表示形式，一个简单整数也大约有 6 个字节。

此外，您无法将这样的整数对象传递给 CPU 来执行快速整数运算。如果要添加两个整数，实际上只有两个指针，它们指向要添加的两个整数对象的对象头的开头。因此，您首先需要对第一个指针执行整数运算，将对象中的偏移量添加到存储整数数据的位置。然后你必须取消引用该地址。对第二个整数再次执行相同操作。现在您有了两个整数，您可以实际要求 CPU 将其相加。当然，您现在需要构造一个新的整数对象来保存结果。

因此，为了执行一个整数加法，您实际上需要执行三个整数加法加上两次指针取消引用加上一次对象构造。而且你占用了将近20 Byte。

然而，诀窍在于，对于像整数这样的所谓的“不可变值类型”，您通常不需要对象标头中的所有元数据：您可以保留所有这些元数据当有人想看的时候，把东西拿出来，然后简单地合成它（这是 VM 书呆子所说的“假装”）。整数总是具有类Integer，无需单独存储该信息。如果有人使用反射来确定整数的类，您只需回复 Integer 并且没有人会知道您实际上并未将该信息存储在对象标头中，而且事实上，那里甚至不是对象头（或对象）。

因此，诀窍是将对象的值存储在指向对象的指针中，从而有效地将两者合二为一。

有些 CPU 实际上在指针内有额外的空间（所谓的标记位），允许您在指针本身内存储有关指针的额外信息。额外的信息，如“这实际上不是一个指针，这是一个整数”。示例包括 Burroughs B5000、各种 Lisp 机器或 AS/400。不幸的是，目前大多数主流CPU 都不具备该功能。

然而，有一个解决办法：当地址未在字边界上对齐时，大多数当前主流 CPU 的工作速度都会显着变慢。有些甚至根本不支持未对齐访问。

这意味着在实践中，所有指针都可以被 4 整除，这意味着它们总是以两个0位结尾。这使我们能够区分真实指针（以00结尾）和实际上伪装的整数指针（以1结尾）。而且它仍然让所有以 10 结尾的指针可以自由地执行其他操作。此外，大多数现代操作系统为自己保留非常低的地址，这给了我们另一个可以乱搞的区域（以 24 0 开头并以 00 结尾的指针）代码>）。

因此，您可以将 31 位整数编码为指针，只需将其向左移动 1 位并向其添加 1 即可。您可以通过简单地适当移动它们来执行非常快的整数算术（有时甚至没有必要）。

我们如何处理其他地址空间？典型的例子包括在其他大地址空间中编码float以及一些特殊对象，例如true、false、nil、0地址附近的127个ASCII字符、一些常用的短字符串、空列表、空对象、空数组等。

例如，在 MRI、YARV 和 Rubinius Ruby 解释器中，整数按照我上面描述的方式进行编码，false 被编码为地址 0（这正是发生的也是C中false的表示），true作为地址2（恰好是C表示true 移动一位），nil 为 4。

This is called a tagged pointer representation, and is a pretty common optimization trick used in many different interpreters, VMs and runtime systems for decades. Pretty much every Lisp implementation uses them, many Smalltalk VMs, many Ruby interpreters, and so on.

Usually, in those languages, you always pass around pointers to objects. An object itself consists of an object header, which contains object metadata (like the type of an object, its class(es), maybe access control restrictions or security annotations and so on), and then the actual object data itself. So, a simple integer would be represented as a pointer plus an object consisting of metadata and the actual integer. Even with a very compact representation, that's something like 6 Byte for a simple integer.

Also, you cannot pass such an integer object to the CPU to perform fast integer arithmetic. If you want to add two integers, you really only have two pointers, which point to the beginning of the object headers of the two integer objects you want to add. So, you first need to perform integer arithmetic on the first pointer to add the offset into the object to it where the integer data is stored. Then you have to dereference that address. Do the same again with the second integer. Now you have two integers you can actually ask the CPU to add. Of course, you need to now construct a new integer object to hold the result.

So, in order to perform one integer addition, you actually need to perform three integer additions plus two pointer dererefences plus one object construction. And you take up almost 20 Byte.

However, the trick is that with so-called immutable value types like integers, you usually don't need all the metadata in the object header: you can just leave all that stuff out, and simply synthesize it (which is VM-nerd-speak for "fake it"), when anyone cares to look. An integer will always have class Integer, there's no need to separately store that information. If someone uses reflection to figure out the class of an integer, you simply reply Integer and nobody will ever know that you didn't actually store that information in the object header and that in fact, there isn't even an object header (or an object).

So, the trick is to store the value of the object within the pointer to the object, effectively collapsing the two into one.

There are CPUs which actually have additional space within a pointer (so-called tag bits) that allow you to store extra information about the pointer within the pointer itself. Extra information like "this isn't actually a pointer, this is an integer". Examples include the Burroughs B5000, the various Lisp Machines or the AS/400. Unfortunately, most of the current mainstream CPUs don't have that feature.

However, there is a way out: most current mainstream CPUs work significantly slower when addresses aren't aligned on word boundaries. Some even don't support unaligned access at all.

What this means is that in practice, all pointers will be divisible by 4, which means they will always end with two 0 bits. This allows us to distinguish between real pointers (that end in 00) and pointers which are actually integers in disguise (those that end with 1). And it still leaves us with all pointers that end in 10 free to do other stuff. Also, most modern operating systems reserve the very low addresses for themselves, which gives us another area to mess around with (pointers that start with, say, 24 0s and end with 00).

So, you can encode a 31-bit integer into a pointer, by simply shifting it 1 bit to the left and adding 1 to it. And you can perform very fast integer arithmetic with those, by simply shifting them appropriately (sometimes not even that is necessary).

What do we do with those other address spaces? Well, typical examples include encoding floats in the other large address space and a number of special objects like true, false, nil, the 127 ASCII characters, some commonly used short strings, the empty list, the empty object, the empty array and so on near the 0 address.

For example, in the MRI, YARV and Rubinius Ruby interpreters, integers are encoded the way I described above, false is encoded as address 0 (which just so happens also to be the representation of false in C), true as address 2 (which just so happens to be the C representation of true shifted by one bit) and nil as 4.

回复收藏 0 原文

小嗲 2024-10-01 12:55:30

请参阅 https://ocaml.org 的“整数、标记位、堆分配值的表示”部分/learn/tutorials/performance_and_profiling.html 以获得良好的描述。

简短的回答是，这是为了性能。将参数传递给函数时，它可以作为整数或指针传递。在机器级语言级别，无法判断寄存器是否包含整数或指针，它只是一个 32 位或 64 位值。因此，OCaml 运行时检查标记位以确定它接收到的是整数还是指针。如果设置了标记位，则该值是一个整数，并将其传递给正确的重载。否则它是一个指针并查找类型。

为什么只有整数才有这个标签？因为其他所有内容都作为指针传递。传递的要么是整数，要么是指向其他数据类型的指针。如果只有一个标记位，则只能有两种情况。

回复收藏 0 原文

乖乖兔^ω^ 2024-10-01 12:55:30

它并不完全是“用于垃圾收集”。它用于在内部区分指针和未装箱的整数。

回复收藏 0 原文

我喜欢麦丽素 2024-10-01 12:55:30

我必须添加此链接以帮助OP了解更多用于 64 位 OCaml 的 63 位浮点类型

虽然文章标题看起来是关于 float，它实际上是在谈论额外的1位

OCaml 运行时允许通过统一的多态性
类型的表示。每个 OCaml 值都表示为一个
词，这样就可以有一个单一的实现，比如说，
“事物列表”，具有访问功能（例如 List.length）和
构建（例如List.map）这些列表，无论它们是否
是整数列表、浮点数列表或整数集列表。
任何无法容纳在单词中的内容都会被分配到单词中的一个块中
堆。表示该数据的字是指向该块的指针。
由于堆仅包含字块，因此所有这些指针都是
对齐：它们的最低有效位始终未设置。
无参构造函数（如下所示：输入 Fruit = Apple | Orange |
香蕉）和整数并不能代表太多的信息，以至于它们
需要在堆中分配。他们的代表未装箱。这
数据直接位于单词内部，否则该单词将是
指针。因此，虽然列表的列表实际上是指针的列表，但
整数列表包含少一个间接寻址的整数。这
访问和构建列表的函数不会注意到，因为整数和
指针具有相同的大小。
尽管如此，垃圾收集器仍然需要
能够识别整数中的指针。一个指针指向一个
堆中的结构良好的块根据定义是活动的（因为它是
被 GC 访问）并应如此标记。一个整数可以有
任何值，如果不采取预防措施，可能会意外地看到
就像一个指针。这可能会导致死块看起来还活着，但很多
更糟糕的是，它还会导致 GC 改变它认为的位
活动块的头部，当它实际上跟随一个整数时
它看起来像一个指针并且弄乱了用户数据。
这就是为什么未装箱的整数提供 31 位（对于 32 位 OCaml）或 63 位（对于
64 位 OCaml) 给 OCaml 程序员。在表示中，后面
场景，包含整数的单词的最低有效位
总是被设置，以区别于指针。 31 位或 63 位
整数相当不寻常，所以任何使用 OCaml 的人都知道
这。 OCaml 用户通常不知道为什么没有
适用于 64 位 OCaml 的 63 位未装箱浮点类型。

I have to add this link to help the OP to understand more A 63-bit floating-point type for 64-bit OCaml

Although the title of the article seems about float, it actually talking about the extra 1 bit

The OCaml runtime allows polymorphism through the uniform
representation of types. Every OCaml value is represented as a single
word, so that it is possible to have a single implementation for, say,
“list of things”, with functions to access (e.g. List.length) and
build (e.g. List.map) these lists that work just the same whether they
are lists of ints, of floats, or of lists of sets of integers.
Anything that does not fit in in a word is allocated in a block in the
heap. The word representing this data is then a pointer to the block.
Since the heap contains only blocks of words, all these pointers are
aligned: their few least significants bits are always unset.
Argumentless constructors (like this: type fruit = Apple | Orange |
Banana) and integers do not represent so much information that they
need to be allocated in the heap. Their representation is unboxed. The
data is directly inside the word that would otherwise have been a
pointer. So while a list of lists is actually a list of pointers, a
list of ints contains the ints with one less indirection. The
functions accessing and building lists do not notice because ints and
pointers have the same size.
Still, the Garbage Collector needs to be
able to recognize pointers from integers. A pointer points to a
well-formed block in the heap that is by definition alive (since it is
being visited by the GC) and should be marked so. An integer can have
any value and could, if precautions were not taken, accidentally look
like a pointer. This could cause dead blocks to look alive, but much
worse, it would also cause the GC to change bits in what it thinks is
the header of a live block, when it is actually following an integer
that looks like a pointer and messing up user data.
This is why unboxed integers provide 31 bits (for 32-bit OCaml) or 63 bits (for
64-bit OCaml) to the OCaml programmer. In the representation, behind
the scenes, the least significant bit of a word containing an integer
is always set, to distinguish it from a pointer. 31- or 63-bit
integers are rather unusual, so anyone who uses OCaml at all knows
this. What users of OCaml do not usually know is why there isn't a
63-bit unboxed float type for 64-bit OCaml.