UTF16 字符总数

发布于 2024-10-16 23:18:38 字数 46 浏览 2 评论 0原文

您能计算出 UTF16 编码通过排列/组合表示 1,112,064 个数字吗?

Can you calculate that a UTF16 Encoding represents 1,112,064 numbers by permuations/commbinations?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

夜光 2024-10-23 23:18:38

UNICODE 标准第 3.9 节说:

每种编码形式将 Unicode 代码点 U+0000..U+D7FF 和 U+E000..U+10FFFF 映射到唯一的代码单元序列。

因此,UTF-16 可以表示的代码点(“字符”)数量为

0xD7FF + 1 + (0x10FFFF - 0xE000) + 1 = 1 112 064

UNICODE 标准通常为 32 位。然而,特定的编码保留较少数量的位来表示最常见的字符,这对它们可以合法表示的字符的实际数量施加了特定的限制。为了允许更长的位序列,进而允许描述比 8 (UTF-8) 或 16 (UTF-16) 位更长的代码点,定义了特殊的代理代码点。

此外,能够在给定编码中表示并不意味着它是有效的 - 它必须首先由 UNICODE 标准分配和描述。因此,没有数学公式可以计算出可以表示的字符数,并且数字 1 112 064 并不一定意味着有 1M 个有效字符。

有关详细讨论,请参阅 UNICODE 标准的第 3.9 节

The UNICODE standard is section 3.9 says:

Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences.

Hence the number of code points ('characters') that can be represented by UTF-16 is

0xD7FF + 1 + (0x10FFFF - 0xE000) + 1 = 1 112 064

The UNICODE standard is generally 32-bit. However, specific encodings reserve smaller amount of bits to represent the most common characters impose specific limitations on the real number of characters they can legally represent. To allow for longer bit sequences that in turn allow describing code points longer that 8 (UTF-8) or 16 (UTF-16) bits special surrogate code points are defined.

Also, being able to represent a given code point in the given encoding doesn't mean it is valid — it has to be allocated and described by the UNICODE standard first. Therefore there's no mathematical formula which would yield the number of characters that can be represented and the number 1 112 064 doesn't necessarily mean there are 1M valid characters.

For a detailed discussion see section 3.9 of the UNICODE standard.

东京女 2024-10-23 23:18:38

不能。UTF-16 表示的字符数只能通过规范得知,而不能通过数学得知。 UTF-16是人们制定的一组特定的编码规则,而不是某些公式的固有属性。

No. The number of characters represented by UTF-16 is only knowable by specification, not by mathematics. UTF-16 is a specific set of encoding rules laid out by people, not an intrinsic property of some formula.

虚拟世界 2024-10-23 23:18:38

UTF-16 代码单元共有三种:

  • 高代理(U+D800 到 U+DBFF)。其中有 1024 个。
  • 低代理(U+DC00 至 U+DFFF)。其中有 1024 个。
  • BMP 中可直接表示的字符。其中有 65536-2*1024=63488 个。

有 1024×1024 = 1,048,576 个可以通过代理(“补充字符”U+10000 到 U+10FFFF)表示。将 BMP 中的 63,488 个可表示字符相加,得到 1,112,064 个。

There are three kinds of UTF-16 code units:

  • High surrogates (U+D800 to U+DBFF). There are 1024 of these.
  • Low surrogates (U+DC00 to U+DFFF). There are 1024 of these.
  • Directly representable characters in the BMP. There are 65536-2*1024=63488 of these.

There are 1024×1024 = 1,048,576 that can be represented through surrogates (the "supplementary characters" U+10000 to U+10FFFF). Add the 63,488 representable characters in the BMP and you get 1,112,064.

伪心 2024-10-23 23:18:38

请参阅此处的答案 https://stackoverflow.com/questions/280182/

它几乎和规范一样好,嗯,它结合了一些规格。我将引用:

UTF-16是一种变长编码;它的字符占用 2 或 4 个字节。 0xD800-0xDFFF 范围内的 2 字节值保留用于构造 4 字节字符,所有 4 字节字符均由 0xD800-0xDBFF 范围内的两个字节后跟 0xDC00-0xDFFF 范围内的 2 个字节组成。因此,Unicode 不会分配 U+D800-U+DFFF 范围内的任何字符。

UTF-16 的容量:1,112,064

See the answer here https://stackoverflow.com/questions/280182/

It is almost as good as a specification, well, it combines some specifications. I'll quote:

UTF-16 is a variable-length code; its characters consume either 2 or 4 bytes. 2-byte values in the range 0xD800-0xDFFF are reserved for constructing 4-byte characters, and all 4-byte characters consist of two bytes in the range 0xD800-0xDBFF followed by 2 bytes in the range 0xDC00-0xDFFF. For this reason, Unicode does not assign any characters in the range U+D800-U+DFFF.

Capacity of UTF-16: 1,112,064

生死何惧 2024-10-23 23:18:38

您可以用 UTF-16 表示 1112064 个标量值,因为标准第 3.9 节中的定义 D76 定义了 1112064 个标量值,并且因为 UTF-16 编码形式(与所有 Unicode 编码形式一样)是所有 Unicode 的唯一表示形式标量值,如定义 D79 中所定义。

D76Unicode 标量值: 除高代理项和低代理项代码点之外的任何 Unicode 代码点。

  • 根据此定义,Unicode 标量值集由 0 到 D7FF 和 E000 到 10FFFF(含)范围组成。

D79Unicode 编码形式将每个 Unicode 标量值分配给唯一的代码单元序列。

当然,由于定义 D91 中列出的代理对编码机制,这些数字并不是完全任意的。鉴于表 3-5 中的位分布,无法对高于 10FFFF 的标​​量值进行编码。

You can represent 1112064 scalar values in UTF-16 because there are 1112064 scalar values as defined by definition D76 in section 3.9 of the Standard, and because the UTF-16 encoding form (like all Unicode encoding forms) is a unique representation of all Unicode scalar values, as defined in definition D79.

D76Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

  • As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive.

D79 – A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence.

Of course, these numbers aren't completely arbitrary due to the mechanism of the surrogate pair encoding laid out in definition D91. Given the bit distribution in Table 3-5, there is just no way to encode a scalar value higher than 10FFFF.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文