为什么utf-16只支持2^20个码位?
嗯,我现在开始研究unicode,我有几个疑问,此时我正在学习什么是平面,我看到平面是一组2^16的代码点,而utf-16编码支持从 0 到 16 枚举的 17 个计划,那么我的问题如下,如果 utf-16 支持最多 32 位,因为实际上它最多只能编码 2^20 个代码点? 20从哪里来?我知道如果一个代码点需要超过 2 个字节,utf-16 使用两个 16 位单元,但是这如何适应所有这些,最后的问题是这个 2^20 从哪里来,而不是 2^32 ?谢谢, :)
Well, I'm starting to study unicode now, and I had several doubts, at this moment I'm learning what a plane is, I saw that a plane is a set of 2^16 code points, and that utf-16 encoding supports 17 plans enumerated from 0 to 16, well my question is the following, if utf-16 supports up to 32 bits, because in practice it only encodes up to 2^20 code points? where does 20 come from? I know that if a code point requires more than 2 bytes, utf-16 uses two 16-bit units, but how does that fit into all of this, the final question is where does this 2^20 come from and not 2^32 ? Thanks, :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Unicode 的原始形式仅支持 64k 代码点(16 位)。目的是支持所有常用的现代字符,而 64k 确实足够了(是的,甚至包括中文)。正如引言所指出的(强调我的):
但是 Unicode 逐渐涵盖了几乎所有人类书写,包括历史悠久且较少使用的书写系统,64k 字符太小而无法处理。 (Unicode 14 有大约 145k 个字符。)正如 Unicode 2.0 介绍所说(再次,强调我的):
在 Unicode 1.x 中,典型的编码是 UCS-2,它只是一个简单的 16-定义代码点的位数。当他们决定需要更多代码时(在 Unicode 1.1 时间范围内),只分配了约 34k 代码点。
最初的想法是创建一个 32 位编码 (UCS-4),可以对 231 个值进行编码,剩下一位,但这会使编码大小加倍,浪费大量资源空间,并且不会向后兼容 UCS-2。
因此,他们决定为 Unicode 2.0 发明一个与所有定义的 UCS-2 代码点向后兼容的系统,但这使他们能够扩大规模。这就是他们发明代理对系统的原因(LMD 的答案解释得很好)。这创建了完全取代 UCS-2 的 UTF-16 编码。
Unicode 2.0 简介中解释了各个区域需要多少空间的完整思考:
目标是将“常用”字符保留在基本多语言平面 (BMP) 中,并将较少使用的字符放入代理扩展区域。
代理系统“浪费”了大量可用于真实字符的代码点。您可以想象用一个更简单的系统替换它,并带有单个“下一个字符在代理空间中”代码点。但这会在字节序列之间产生歧义。您不能仅搜索 0x0041 来查找字母 A。您必须向后扫描以确保它不是代理字符,这使得某些类型的问题变得更加困难。
这种设计选择非常可靠。 20 年来,随着越来越多晦涩难懂的脚本和角色的不断添加,我们只使用了不到 15% 的可用空间。我们绝对不需要另外 10 位。
The original form of Unicode only supported 64k code points (16 bits). The intention was to support all commonly used, modern characters, and 64k really is enough for that (yes, even including Chinese). As the introduction notes (emphasis mine):
But Unicode grew to encompass almost all human writing, including historic and lesser-used writing systems, and 64k characters was too small to handle that. (Unicode 14 has ~145k characters.) As the Unicode 2.0 introduction says (again, emphasis mine):
In Unicode 1.x, the typical encoding was UCS-2, which is just a simple 16-bit number defining the code-point. When they decided that they were going to need more (during the Unicode 1.1 timeframe), there were only ~34k code points assigned.
Originally the thought was to create a 32-bit encoding (UCS-4) that could encode 231 values with one bit left-over, but this would have doubled the size of encoding, wasting a lot of space, and wouldn't have been backward compatible with UCS-2.
So they decided for Unicode 2.0 to invent a system backward-compatible with all defined UCS-2 code points, but that allowed them to scale larger. That's why they invented the surrogate pair system (which LMD's answer explains well). This created the UTF-16 encoding which completely replaces UCS-2.
The full thinking on how much space was needed for various areas is explained in the Unicode 2.0 Introduction:
The goal was to keep "common" characters in the Basic Multilingual Plane (BMP), and to place lesser-used characters into the surrogate extension area.
The surrogate system "wastes" a lot of code points that could be used for real characters. You could imagine replacing it with a more naïve system with a single "the next character is in the surrogate space" code point. But that would create ambiguity between byte sequences. You couldn't just search for 0x0041 to find the letter A. You'd have to scan backwards to make sure it wasn't a surrogate character, making certain kinds of problems much harder.
That design choice has been pretty solid. In 20 years, with steady additions of more and more obscure scripts and characters, we've used less than 15% of the available space. We definitely didn't need another 10 bits.
从 4 的倍数和幂的角度思考对理解
UTF-8 和 UTF-16
很有帮助:thinking in terms of multiples and powers of 4 help a lot with understanding
UTF-8 and UTF-16
:看看代理对如何编码字符
U >= 0x10000
:(源)
如您所见,从 2x16 代理对的 32 位中,2x6 = 12 位“仅”用于传达以下信息:确实是一个代理对(而不仅仅是两个值<0x10000的字符)。这样您就有 32 - 12 = 20 位来存储 U'。
(从技术上讲,您还拥有一些
U <0x10000
值,其中一些值又保留给低代理项和高代理项,这意味着您最终会略高于 2^20可以通过 UTF-16 编码的代码点(但仍远低于 2^21),考虑到 UTF-16 支持的最高可能代码点是U+10FFFF
而不是 2^20 =0x100000
。)Have a look at how surrogate pairs encode a character
U >= 0x10000
:(source)
As you can see, from the 32 bits of the 2x16 surrogate pair, 2x6 = 12 bits are used "only" to convey the information that this is indeed a surrogate pair (and not simply two characters with a value < 0x10000). This leaves you with 32 - 12 = 20 bits to store U'.
(Technically, you additionally have some values for
U < 0x10000
, of which again some are reserved for low and high surrogates, which means you end up slightly above 2^20 codepoints which can be encoded by UTF-16 (but still well below 2^21), considering that the highest possible codepoint that is supported by UTF-16 isU+10FFFF
and not 2^20 =0x100000
.)