为什么utf-16只支持2^20个码位？

发布于 2025-01-12 02:58:09 字数 239 浏览 3 评论 0原文

嗯，我现在开始研究unicode，我有几个疑问，此时我正在学习什么是平面，我看到平面是一组2^16的代码点，而utf-16编码支持从 0 到 16 枚举的 17 个计划，那么我的问题如下，如果 utf-16 支持最多 32 位，因为实际上它最多只能编码 2^20 个代码点？ 20从哪里来？我知道如果一个代码点需要超过 2 个字节，utf-16 使用两个 16 位单元，但是这如何适应所有这些，最后的问题是这个 2^20 从哪里来，而不是 2^32 ？谢谢，：）

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

渡你暖光 2025-01-19 02:58:10

Unicode 的原始形式仅支持 64k 代码点（16 位）。目的是支持所有常用的现代字符，而 64k 确实足够了（是的，甚至包括中文）。正如引言所指出的（强调我的）：

完整性。编码字符集足够大，可以包含可能在一般文本交换中使用的所有字符。

但是 Unicode 逐渐涵盖了几乎所有人类书写，包括历史悠久且较少使用的书写系统，64k 字符太小而无法处理。（Unicode 14 有大约 145k 个字符。）正如 Unicode 2.0 介绍所说（再次，强调我的）：

Unicode 标准 2.0 版包含来自世界各地文字的 38,885 个字符。这些字符不仅对于现代交流来说绰绰有余，而且对于许多语言的古典形式也绰绰有余。

在 Unicode 1.x 中，典型的编码是 UCS-2，它只是一个简单的 16-定义代码点的位数。当他们决定需要更多代码时（在 Unicode 1.1 时间范围内），只分配了约 34k 代码点。

最初的想法是创建一个 32 位编码 (UCS-4)，可以对 2³¹ 个值进行编码，剩下一位，但这会使编码大小加倍，浪费大量资源空间，并且不会向后兼容 UCS-2。

因此，他们决定为 Unicode 2.0 发明一个与所有定义的 UCS-2 代码点向后兼容的系统，但这使他们能够扩大规模。这就是他们发明代理对系统的原因（LMD 的答案解释得很好）。这创建了完全取代 UCS-2 的 UTF-16 编码。

Unicode 2.0 简介中解释了各个区域需要多少空间的完整思考：

有超过 18,000 个未分配的代码位置可供将来分配。这个数字远远超过了现代和大多数古代字符的预期字符编码要求。
通过代理扩展机制可以访问另外一百万个字符......这个数字远远超出了所有世界字符和符号的预期编码要求。

目标是将“常用”字符保留在基本多语言平面 (BMP) 中，并将较少使用的字符放入代理扩展区域。

代理系统“浪费”了大量可用于真实字符的代码点。您可以想象用一个更简单的系统替换它，并带有单个“下一个字符在代理空间中”代码点。但这会在字节序列之间产生歧义。您不能仅搜索 0x0041 来查找字母 A。您必须向后扫描以确保它不是代理字符，这使得某些类型的问题变得更加困难。

这种设计选择非常可靠。 20 年来，随着越来越多晦涩难懂的脚本和角色的不断添加，我们只使用了不到 15% 的可用空间。我们绝对不需要另外 10 位。

The original form of Unicode only supported 64k code points (16 bits). The intention was to support all commonly used, modern characters, and 64k really is enough for that (yes, even including Chinese). As the introduction notes (emphasis mine):

Completeness. The coded character set would be large enough to encompass all characters that were likely to be used in general text interchange.

But Unicode grew to encompass almost all human writing, including historic and lesser-used writing systems, and 64k characters was too small to handle that. (Unicode 14 has ~145k characters.) As the Unicode 2.0 introduction says (again, emphasis mine):

The Unicode Standard, Version 2.0 contains 38,885 characters from the world's scripts. These characters are more than sufficient not only for modern communication, but also for the classical forms of many languages.

In Unicode 1.x, the typical encoding was UCS-2, which is just a simple 16-bit number defining the code-point. When they decided that they were going to need more (during the Unicode 1.1 timeframe), there were only ~34k code points assigned.

Originally the thought was to create a 32-bit encoding (UCS-4) that could encode 2³¹ values with one bit left-over, but this would have doubled the size of encoding, wasting a lot of space, and wouldn't have been backward compatible with UCS-2.

So they decided for Unicode 2.0 to invent a system backward-compatible with all defined UCS-2 code points, but that allowed them to scale larger. That's why they invented the surrogate pair system (which LMD's answer explains well). This created the UTF-16 encoding which completely replaces UCS-2.

The full thinking on how much space was needed for various areas is explained in the Unicode 2.0 Introduction:

There are over 18,000 unassigned code positions that are available for for future allocation. This number far exceeds anticipated character coding requirements for modern and most archaic characters.
One million additional characters are accessible through the surrogate extension mechanism.... This number far exceeds anticipated encoding requirements for all world characters and symbols.

The goal was to keep "common" characters in the Basic Multilingual Plane (BMP), and to place lesser-used characters into the surrogate extension area.

The surrogate system "wastes" a lot of code points that could be used for real characters. You could imagine replacing it with a more naïve system with a single "the next character is in the surrogate space" code point. But that would create ambiguity between byte sequences. You couldn't just search for 0x0041 to find the letter A. You'd have to scan backwards to make sure it wasn't a surrogate character, making certain kinds of problems much harder.

That design choice has been pretty solid. In 20 years, with steady additions of more and more obscure scripts and characters, we've used less than 15% of the available space. We definitely didn't need another 10 bits.

回复收藏 0 原文

吝吻 2025-01-19 02:58:10

从 4 的倍数和幂的角度思考对理解 UTF-8 和 UTF-16 很有帮助：

BMP/ASCII    start  :                       =          0
Supp plane   start  :         4 ^ ( 4 + 4 ) =     65,536

Size of BMP         :         4 ^ ( 4 + 4 ) =     65,536 ( 4 ^  8 )
Size of Supp plane  : 4 * 4 * 4 ^ ( 4 + 4 ) =  1,048,576 ( 4 ^ 10 )
————————————————————————————————————————————————————————
Unicode (unadj)   ( 4*4 + 4^4 ) * ( 4 + 4 )^4
                                            = 4^8 + 4^10
                                            =  1,114,112
UTF-8

2-byte UTF-8 start  :   4 * 4   * ( 4 + 4 ) =        128
3-byte UTF-8 start  : ( 4 ^ 4 ) * ( 4 + 4 ) =      2,048
4-byte UTF-8 start  :       4   ^ ( 4 + 4 ) =     65,536

UTF-8 Multi-byte scale factors 

trailing x 1 : 4 ^ 3  =  4 * (   4   ) * 4  =         64
trailing x 2 : 4 ^ 6  =      ( 4 + 4 ) ^ 4  =      4,096
trailing x 3 : 4 ^ 9  =  4 ^ ( 4 + 4 ) * 4  =    262,144

UTF-16 

Hi surrogate start  : ( 4 ^ 5 ) *     54    =     55,296 ( 0xD800 )
per surrogate width : ( 4 ^ 5 )             =      1,024 ( 0x 400 )
Lo surrogate start  : ( 4 ^ 5 ) *     55    =     56,320 ( 0xDC00 )
Total surr. combos  : ( 4 ^ 5 ) * ( 4 ^ 5 ) =  1,048,576 ( 4 ^ 10 )

thinking in terms of multiples and powers of 4 help a lot with understanding UTF-8 and UTF-16 :

BMP/ASCII    start  :                       =          0
Supp plane   start  :         4 ^ ( 4 + 4 ) =     65,536

Size of BMP         :         4 ^ ( 4 + 4 ) =     65,536 ( 4 ^  8 )
Size of Supp plane  : 4 * 4 * 4 ^ ( 4 + 4 ) =  1,048,576 ( 4 ^ 10 )
————————————————————————————————————————————————————————
Unicode (unadj)   ( 4*4 + 4^4 ) * ( 4 + 4 )^4
                                            = 4^8 + 4^10
                                            =  1,114,112
UTF-8

2-byte UTF-8 start  :   4 * 4   * ( 4 + 4 ) =        128
3-byte UTF-8 start  : ( 4 ^ 4 ) * ( 4 + 4 ) =      2,048
4-byte UTF-8 start  :       4   ^ ( 4 + 4 ) =     65,536

UTF-8 Multi-byte scale factors 

trailing x 1 : 4 ^ 3  =  4 * (   4   ) * 4  =         64
trailing x 2 : 4 ^ 6  =      ( 4 + 4 ) ^ 4  =      4,096
trailing x 3 : 4 ^ 9  =  4 ^ ( 4 + 4 ) * 4  =    262,144

UTF-16 

Hi surrogate start  : ( 4 ^ 5 ) *     54    =     55,296 ( 0xD800 )
per surrogate width : ( 4 ^ 5 )             =      1,024 ( 0x 400 )
Lo surrogate start  : ( 4 ^ 5 ) *     55    =     56,320 ( 0xDC00 )
Total surr. combos  : ( 4 ^ 5 ) * ( 4 ^ 5 ) =  1,048,576 ( 4 ^ 10 )

回复收藏 0 原文

我的黑色迷你裙 2025-01-19 02:58:09

看看代理对如何编码字符 U >= 0x10000：

U' = yyyyyyyyyyxxxxxxxxxx  // U - 0x10000
W1 = 110110yyyyyyyyyy      // 0xD800 + yyyyyyyyyy
W2 = 110111xxxxxxxxxx      // 0xDC00 + xxxxxxxxxx

(源）

如您所见，从 2x16 代理对的 32 位中，2x6 = 12 位“仅”用于传达以下信息：确实是一个代理对（而不仅仅是两个值<0x10000的字符）。这样您就有 32 - 12 = 20 位来存储 U'。

（从技术上讲，您还拥有一些 U <0x10000 值，其中一些值又保留给低代理项和高代理项，这意味着您最终会略高于 2^20可以通过 UTF-16 编码的代码点（但仍远低于 2^21），考虑到 UTF-16 支持的最高可能代码点是U+10FFFF 而不是 2^20 = 0x100000。）

Have a look at how surrogate pairs encode a character U >= 0x10000:

U' = yyyyyyyyyyxxxxxxxxxx  // U - 0x10000
W1 = 110110yyyyyyyyyy      // 0xD800 + yyyyyyyyyy
W2 = 110111xxxxxxxxxx      // 0xDC00 + xxxxxxxxxx

(source)

As you can see, from the 32 bits of the 2x16 surrogate pair, 2x6 = 12 bits are used "only" to convey the information that this is indeed a surrogate pair (and not simply two characters with a value < 0x10000). This leaves you with 32 - 12 = 20 bits to store U'.

(Technically, you additionally have some values for U < 0x10000, of which again some are reserved for low and high surrogates, which means you end up slightly above 2^20 codepoints which can be encoded by UTF-16 (but still well below 2^21), considering that the highest possible codepoint that is supported by UTF-16 is U+10FFFF and not 2^20 = 0x100000.)

回复收藏 0 原文

~没有更多了~