具有不对称大小写的 Unicode 字符。为什么？

发布于 2024-12-06 04:08:31 字数 926 浏览 8 评论 0原文

为什么以下三个字符的 toLower、toUpper 结果不对称

/**
  * Written in the Scala programming language, typed into the Scala REPL.
  * Results commented accordingly.
  */
/* Unicode Character 'LATIN CAPITAL LETTER SHARP S' (U+1E9E) */
'\u1e9e'.toHexString == "1e9e" // true
'\u1e9e'.toLower.toHexString == "df" // "df" == "df"
'\u1e9e'.toHexString == '\u1e9e'.toLower.toUpper.toHexString // "1e9e" != "df"
/* Unicode Character 'KELVIN SIGN' (U+212A) */
'\u212a'.toHexString == "212a" // "212a" == "212a"
'\u212a'.toLower.toHexString == "6b" // "6b" == "6b"
'\u212a'.toHexString == '\u212a'.toLower.toUpper.toHexString // "212a" != "4b"
/* Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130) */
'\u0130'.toHexString == "130" // "130" == "130"
'\u0130'.toLower.toHexString == "69" // "69" == "69"
'\u0130'.toHexString == '\u0130'.toLower.toUpper.toHexString // "130" != "49"

原文

Why do the following three characters have not symmetric toLower, toUpper results

/**
  * Written in the Scala programming language, typed into the Scala REPL.
  * Results commented accordingly.
  */
/* Unicode Character 'LATIN CAPITAL LETTER SHARP S' (U+1E9E) */
'\u1e9e'.toHexString == "1e9e" // true
'\u1e9e'.toLower.toHexString == "df" // "df" == "df"
'\u1e9e'.toHexString == '\u1e9e'.toLower.toUpper.toHexString // "1e9e" != "df"
/* Unicode Character 'KELVIN SIGN' (U+212A) */
'\u212a'.toHexString == "212a" // "212a" == "212a"
'\u212a'.toLower.toHexString == "6b" // "6b" == "6b"
'\u212a'.toHexString == '\u212a'.toLower.toUpper.toHexString // "212a" != "4b"
/* Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130) */
'\u0130'.toHexString == "130" // "130" == "130"
'\u0130'.toLower.toHexString == "69" // "69" == "69"
'\u0130'.toHexString == '\u0130'.toLower.toUpper.toHexString // "130" != "49"

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

﹏半生如梦愿梦如真 2024-12-13 04:08:31

对于第一个，有这个解释：

在德语中，Sharp S（“ß”或 U+00df）是一个小写字母，它大写为字母“SS”。

换句话说，U+1E9E小写为U+00DF，但U+00DF的大写不是U+1E9E。

对于第二个，U+212A（开尔文符号）小写为 U+0068（拉丁文小写字母 K）。 U+0068 的大写字母是 U+004B（拉丁文大写字母 K）。这对我来说似乎是有道理的。

对于第三种情况，U+0130（上面带点的拉丁文大写字母 I）是土耳其/阿塞拜疆字符，小写为 U+0069（拉丁文小写字母 I）。我想，如果您在土耳其/阿塞拜疆语言环境中，您会得到 U+0069 的正确大写版本，但这不一定是通用的。

字符不一定需要具有对称的大小写转换。

编辑：为了回应下面 PhiLho 的评论，Unicode 6.0 规范对 U+212A (KELVIN SIGN) 有这样的规定：

三个类似字母的符号已被赋予与常规字母等效的规范：U+2126
欧姆符号、U+212A 开尔文符号和 U+212B 安斯特罗姆符号。在所有三种情况下，都应使用常规字母。如果根据 Unicode 标准附件 #15“Unicode 规范化形式”对文本进行规范化，则这三个字符将被其常规等效字符替换。

换句话说，您不应该真正使用 U+212A，而应该使用 U+004B（拉丁文大写字母 K），如果您规范化 Unicode 文本，则 U+212A 应替换为 U+004B。