为什么是“U”用于指定 Unicode 代码点?

发布于 2024-08-02 05:58:18 字数 196 浏览 11 评论 0原文

为什么 Unicode 代码点显示为 U+

例如,U+2202 表示字符

为什么不是 U- (破折号或连字符)或其他字符?

Why do Unicode code points appear as U+<codepoint>?

For example, U+2202 represents the character .

Why not U- (dash or hyphen character) or anything else?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

只怪假的太真实 2024-08-09 05:58:18

字符“U+”是多集联合“⊎”U+228E 字符(内部带有加号的类似 U 的联合符号)的 ASCII 化版本,旨在将 Unicode 符号化为字符集的联合。请参阅 肯尼思·惠斯勒 (Kenneth Whistler) 在 Unicode 邮件列表中的解释

The characters “U+” are an ASCIIfied version of the MULTISET UNION “⊎” U+228E character (the U-like union symbol with a plus sign inside it), which was meant to symbolize Unicode as the union of character sets. See Kenneth Whistler’s explanation in the Unicode mailing list.

说谎友 2024-08-09 05:58:18

Unicode 标准需要一些符号来讨论代码点和字符名称。它采用了“U+”后跟四个或更多十六进制数字的约定,至少可以追溯到 < em>Unicode 标准,版本 2.0.0,于 1996 年发布(来源:Unicode 联盟网站上的存档 PDF 副本)。

“U+”符号很有用。它提供了一种将十六进制数字标记为 Unicode 代码点的方法,而不是八位位组、不受限制的 16 位数量或其他编码中的字符。它在运行文本中效果很好。 “U”表示“Unicode”。

我个人对 1990 年初软件行业关于 Unicode 讨论的记忆是,在 Unicode 1.0 和 Unicode 2.0 时代,“U+”后跟四个十六进制数字的约定很常见。当时,Unicode 被视为 16 位系统。随着 Unicode 3.0 的出现以及 U+010000 及以上代码点的字符编码,开始使用“U-”后跟六个十六进制数字的约定,专门用于突出显示数字中额外的两位数字。 (或者也许是相反,从“U-”到“U+”的转变。)根据我的经验,“U+”约定现在比“U-”约定更常见,并且很少有人使用“U+”和“U-”之间的区别来指示位数。

不过,我无法找到从“U+”到“U-”转变的文档。 1990 年代存档的邮件列表消息应该有证据,但我无法方便地指出任何证据。 Unicode 标准 2.0 声明:“Unicode 字符代码具有 16 位的统一宽度。” (第 2-3 页)。它规定了“单个 Unicode 值表示为 U+nnnn,其中 nnnn 是十六进制表示法的四位数字”(第 1-5 页) 。分配了代理值,但在 U+FFFF 之上没有定义字符代码,并且没有提及 UTF-16 或 UTF-32。它使用四位数字的“U+”。 Unicode 标准 3.0.0,于 2000 年发布,定义了 UTF-16(第 46-47 页)并讨论了 U+010000 及以上的代码点。它在某些地方使用四位数字的“U+”,在其他地方使用六位数字。我发现的最可靠的痕迹是Unicode 标准,版本 6.0.0 ,其中 BNF 语法符号表定义了符号 U+HHHHU-HHHHHHHH(第 559 页)。

“U+”表示法并不是表示 Unicode 代码点或代码单元的唯一约定。例如,Python 语言定义了以下字符串文字

  • u'xyz' 表示 Unicode 字符串,Unicode 字符序列
  • '\uxxxx' 表示具有由四个十六进制数字表示的 unicode 字符的字符串
  • '\Uxxxxxxx' 表示带有由八个十六进制数字表示的 unicode 字符的字符串

The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as The Unicode Standard, version 2.0.0, published in 1996 (source: archived PDF copy on Unicode Consortium web site).

The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".

My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.

I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The Unicode Standard 2.0 declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as U+nnnn, where nnnn is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits. The Unicode Standard 3.0.0, published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in The Unicode Standard, version 6.0.0, where a table of BNF syntax notation defines symbols U+HHHH and U-HHHHHHHH (p. 559).

The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the Python language defines the following string literals:

  • u'xyz' to indicate a Unicode string, a sequence of Unicode characters
  • '\uxxxx' to indicate a string with a unicode character denoted by four hex digits
  • '\Uxxxxxxxx' to indicate a string with a unicode character denoted by eight hex digits
爱要勇敢去追 2024-08-09 05:58:18

这取决于您所讨论的 Unicode 标准的版本。来自维基百科

使用的标准的旧版本
类似的符号,但略有不同
不同的规则。例如,统一码
3.0使用“U-”后接八位数字,并允许使用“U+”
只需要精确四位数字即可
表示代码单元,而不是代码
点。

It depends on what version of the Unicode standard you are talking about. From Wikipedia:

Older versions of the standard used
similar notations, but with slightly
different rules. For example, Unicode
3.0 used "U-" followed by eight digits, and allowed "U+" to be used
only with exactly four digits to
indicate a code unit, not a code
point.

々眼睛长脚气 2024-08-09 05:58:18

这只是一个约定,表明该值是 Unicode。有点像十六进制值的“0x”或“h”(0xB9B9h)。为什么是 0xB9 而不是 0hB9(或 &hB9$B9)?只是因为这就是硬币翻转的方式:-)

It is just a convention to show that the value is Unicode. A bit like '0x' or 'h' for hex values (0xB9 or B9h). Why 0xB9 and not 0hB9 (or &hB9 or $B9)? Just because that's how the coin flipped :-)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文