为什么是“U”用于指定 Unicode 代码点?
为什么 Unicode 代码点显示为 U+
?
例如,U+2202
表示字符∂。
为什么不是 U-
(破折号或连字符)或其他字符?
Why do Unicode code points appear as U+
<codepoint>
?
For example, U+2202
represents the character ∂.
Why not U-
(dash or hyphen character) or anything else?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
字符“U+”是多集联合“⊎”U+228E 字符(内部带有加号的类似 U 的联合符号)的 ASCII 化版本,旨在将 Unicode 符号化为字符集的联合。请参阅 肯尼思·惠斯勒 (Kenneth Whistler) 在 Unicode 邮件列表中的解释。
The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets. See Kenneth Whistler’s explanation in the Unicode mailing list.
Unicode 标准需要一些符号来讨论代码点和字符名称。它采用了“U+”后跟四个或更多十六进制数字的约定,至少可以追溯到 < em>Unicode 标准,版本 2.0.0,于 1996 年发布(来源:Unicode 联盟网站上的存档 PDF 副本)。
“U+”符号很有用。它提供了一种将十六进制数字标记为 Unicode 代码点的方法,而不是八位位组、不受限制的 16 位数量或其他编码中的字符。它在运行文本中效果很好。 “U”表示“Unicode”。
我个人对 1990 年初软件行业关于 Unicode 讨论的记忆是,在 Unicode 1.0 和 Unicode 2.0 时代,“U+”后跟四个十六进制数字的约定很常见。当时,Unicode 被视为 16 位系统。随着 Unicode 3.0 的出现以及 U+010000 及以上代码点的字符编码,开始使用“U-”后跟六个十六进制数字的约定,专门用于突出显示数字中额外的两位数字。 (或者也许是相反,从“U-”到“U+”的转变。)根据我的经验,“U+”约定现在比“U-”约定更常见,并且很少有人使用“U+”和“U-”之间的区别来指示位数。
不过,我无法找到从“U+”到“U-”转变的文档。 1990 年代存档的邮件列表消息应该有证据,但我无法方便地指出任何证据。 Unicode 标准 2.0 声明:“Unicode 字符代码具有 16 位的统一宽度。” (第 2-3 页)。它规定了“单个 Unicode 值表示为 U+nnnn,其中 nnnn 是十六进制表示法的四位数字”(第 1-5 页) 。分配了代理值,但在 U+FFFF 之上没有定义字符代码,并且没有提及 UTF-16 或 UTF-32。它使用四位数字的“U+”。 Unicode 标准 3.0.0,于 2000 年发布,定义了 UTF-16(第 46-47 页)并讨论了 U+010000 及以上的代码点。它在某些地方使用四位数字的“U+”,在其他地方使用六位数字。我发现的最可靠的痕迹是Unicode 标准,版本 6.0.0 ,其中 BNF 语法符号表定义了符号
U+HHHH
和U-HHHHHHHH
(第 559 页)。“U+”表示法并不是表示 Unicode 代码点或代码单元的唯一约定。例如,Python 语言定义了以下字符串文字:
u'xyz'
表示 Unicode 字符串,Unicode 字符序列'\uxxxx'
表示具有由四个十六进制数字表示的 unicode 字符的字符串'\Uxxxxxxx'
表示带有由八个十六进制数字表示的 unicode 字符的字符串The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as The Unicode Standard, version 2.0.0, published in 1996 (source: archived PDF copy on Unicode Consortium web site).
The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".
My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.
I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The Unicode Standard 2.0 declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as U+nnnn, where nnnn is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits. The Unicode Standard 3.0.0, published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in The Unicode Standard, version 6.0.0, where a table of BNF syntax notation defines symbols
U+HHHH
andU-HHHHHHHH
(p. 559).The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the Python language defines the following string literals:
u'xyz'
to indicate a Unicode string, a sequence of Unicode characters'\uxxxx'
to indicate a string with a unicode character denoted by four hex digits'\Uxxxxxxxx'
to indicate a string with a unicode character denoted by eight hex digits这取决于您所讨论的 Unicode 标准的版本。来自维基百科:
It depends on what version of the Unicode standard you are talking about. From Wikipedia:
这只是一个约定,表明该值是 Unicode。有点像十六进制值的“0x”或“h”(
0xB9
或B9h
)。为什么是0xB9
而不是0hB9
(或&hB9
或$B9
)?只是因为这就是硬币翻转的方式:-)It is just a convention to show that the value is Unicode. A bit like '0x' or 'h' for hex values (
0xB9
orB9h
). Why0xB9
and not0hB9
(or&hB9
or$B9
)? Just because that's how the coin flipped :-)