UTF16 字符总数
您能计算出 UTF16 编码通过排列/组合表示 1,112,064 个数字吗?
Can you calculate that a UTF16 Encoding represents 1,112,064 numbers by permuations/commbinations?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
UNICODE 标准第 3.9 节说:
因此,UTF-16 可以表示的代码点(“字符”)数量为
UNICODE 标准通常为 32 位。然而,特定的编码保留较少数量的位来表示最常见的字符,这对它们可以合法表示的字符的实际数量施加了特定的限制。为了允许更长的位序列,进而允许描述比 8 (UTF-8) 或 16 (UTF-16) 位更长的代码点,定义了特殊的代理代码点。
此外,能够在给定编码中表示并不意味着它是有效的 - 它必须首先由 UNICODE 标准分配和描述。因此,没有数学公式可以计算出可以表示的字符数,并且数字 1 112 064 并不一定意味着有 1M 个有效字符。
有关详细讨论,请参阅 UNICODE 标准的第 3.9 节。
The UNICODE standard is section 3.9 says:
Hence the number of code points ('characters') that can be represented by UTF-16 is
The UNICODE standard is generally 32-bit. However, specific encodings reserve smaller amount of bits to represent the most common characters impose specific limitations on the real number of characters they can legally represent. To allow for longer bit sequences that in turn allow describing code points longer that 8 (UTF-8) or 16 (UTF-16) bits special surrogate code points are defined.
Also, being able to represent a given code point in the given encoding doesn't mean it is valid — it has to be allocated and described by the UNICODE standard first. Therefore there's no mathematical formula which would yield the number of characters that can be represented and the number 1 112 064 doesn't necessarily mean there are 1M valid characters.
For a detailed discussion see section 3.9 of the UNICODE standard.
不能。UTF-16 表示的字符数只能通过规范得知,而不能通过数学得知。 UTF-16是人们制定的一组特定的编码规则,而不是某些公式的固有属性。
No. The number of characters represented by UTF-16 is only knowable by specification, not by mathematics. UTF-16 is a specific set of encoding rules laid out by people, not an intrinsic property of some formula.
UTF-16 代码单元共有三种:
有 1024×1024 = 1,048,576 个可以通过代理(“补充字符”U+10000 到 U+10FFFF)表示。将 BMP 中的 63,488 个可表示字符相加,得到 1,112,064 个。
There are three kinds of UTF-16 code units:
There are 1024×1024 = 1,048,576 that can be represented through surrogates (the "supplementary characters" U+10000 to U+10FFFF). Add the 63,488 representable characters in the BMP and you get 1,112,064.
请参阅此处的答案 https://stackoverflow.com/questions/280182/
它几乎和规范一样好,嗯,它结合了一些规格。我将引用:
See the answer here https://stackoverflow.com/questions/280182/
It is almost as good as a specification, well, it combines some specifications. I'll quote:
您可以用 UTF-16 表示 1112064 个标量值,因为标准第 3.9 节中的定义 D76 定义了 1112064 个标量值,并且因为 UTF-16 编码形式(与所有 Unicode 编码形式一样)是所有 Unicode 的唯一表示形式标量值,如定义 D79 中所定义。
当然,由于定义 D91 中列出的代理对编码机制,这些数字并不是完全任意的。鉴于表 3-5 中的位分布,无法对高于 10FFFF 的标量值进行编码。
You can represent 1112064 scalar values in UTF-16 because there are 1112064 scalar values as defined by definition D76 in section 3.9 of the Standard, and because the UTF-16 encoding form (like all Unicode encoding forms) is a unique representation of all Unicode scalar values, as defined in definition D79.
Of course, these numbers aren't completely arbitrary due to the mechanism of the surrogate pair encoding laid out in definition D91. Given the bit distribution in Table 3-5, there is just no way to encode a scalar value higher than 10FFFF.