当前位置：文江博客话题详情

为什么是“U”用于指定 Unicode 代码点？

发布于 2024-08-02 05:58:18 字数 196 浏览 11 评论 0原文

为什么 Unicode 代码点显示为 U+？

例如，U+2202 表示字符∂。

为什么不是 U- （破折号或连字符）或其他字符？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

只怪假的太真实 2024-08-09 05:58:18

字符“U+”是多集联合“⊎”U+228E 字符（内部带有加号的类似 U 的联合符号）的 ASCII 化版本，旨在将 Unicode 符号化为字符集的联合。请参阅肯尼思·惠斯勒 (Kenneth Whistler) 在 Unicode 邮件列表中的解释。

回复收藏 0 原文

说谎友 2024-08-09 05:58:18

Unicode 标准需要一些符号来讨论代码点和字符名称。它采用了“U+”后跟四个或更多十六进制数字的约定，至少可以追溯到 < em>Unicode 标准，版本 2.0.0，于 1996 年发布（来源：Unicode 联盟网站上的存档 PDF 副本）。

“U+”符号很有用。它提供了一种将十六进制数字标记为 Unicode 代码点的方法，而不是八位位组、不受限制的 16 位数量或其他编码中的字符。它在运行文本中效果很好。 “U”表示“Unicode”。

我个人对 1990 年初软件行业关于 Unicode 讨论的记忆是，在 Unicode 1.0 和 Unicode 2.0 时代，“U+”后跟四个十六进制数字的约定很常见。当时，Unicode 被视为 16 位系统。随着 Unicode 3.0 的出现以及 U+010000 及以上代码点的字符编码，开始使用“U-”后跟六个十六进制数字的约定，专门用于突出显示数字中额外的两位数字。（或者也许是相反，从“U-”到“U+”的转变。）根据我的经验，“U+”约定现在比“U-”约定更常见，并且很少有人使用“U+”和“U-”之间的区别来指示位数。

不过，我无法找到从“U+”到“U-”转变的文档。 1990 年代存档的邮件列表消息应该有证据，但我无法方便地指出任何证据。 Unicode 标准 2.0 声明：“Unicode 字符代码具有 16 位的统一宽度。” （第 2-3 页）。它规定了“单个 Unicode 值表示为 U+nnnn，其中 nnnn 是十六进制表示法的四位数字”（第 1-5 页）。分配了代理值，但在 U+FFFF 之上没有定义字符代码，并且没有提及 UTF-16 或 UTF-32。它使用四位数字的“U+”。 Unicode 标准 3.0.0，于 2000 年发布，定义了 UTF-16（第 46-47 页）并讨论了 U+010000 及以上的代码点。它在某些地方使用四位数字的“U+”，在其他地方使用六位数字。我发现的最可靠的痕迹是Unicode 标准，版本 6.0.0 ，其中 BNF 语法符号表定义了符号 U+HHHH 和 U-HHHHHHHH（第 559 页）。

“U+”表示法并不是表示 Unicode 代码点或代码单元的唯一约定。例如，Python 语言定义了以下字符串文字：

u'xyz' 表示 Unicode 字符串，Unicode 字符序列
'\uxxxx' 表示具有由四个十六进制数字表示的 unicode 字符的字符串
'\Uxxxxxxx' 表示带有由八个十六进制数字表示的 unicode 字符的字符串

The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as The Unicode Standard, version 2.0.0, published in 1996 (source: archived PDF copy on Unicode Consortium web site).

The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".

My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.

I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The Unicode Standard 2.0 declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as U+nnnn, where nnnn is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits. The Unicode Standard 3.0.0, published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in The Unicode Standard, version 6.0.0, where a table of BNF syntax notation defines symbols U+HHHH and U-HHHHHHHH (p. 559).

The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the Python language defines the following string literals:

u'xyz' to indicate a Unicode string, a sequence of Unicode characters
'\uxxxx' to indicate a string with a unicode character denoted by four hex digits
'\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hex digits

回复收藏 0 原文