具有不对称大小写的 Unicode 字符。为什么?
为什么以下三个字符的 toLower
、toUpper
结果不对称
/**
* Written in the Scala programming language, typed into the Scala REPL.
* Results commented accordingly.
*/
/* Unicode Character 'LATIN CAPITAL LETTER SHARP S' (U+1E9E) */
'\u1e9e'.toHexString == "1e9e" // true
'\u1e9e'.toLower.toHexString == "df" // "df" == "df"
'\u1e9e'.toHexString == '\u1e9e'.toLower.toUpper.toHexString // "1e9e" != "df"
/* Unicode Character 'KELVIN SIGN' (U+212A) */
'\u212a'.toHexString == "212a" // "212a" == "212a"
'\u212a'.toLower.toHexString == "6b" // "6b" == "6b"
'\u212a'.toHexString == '\u212a'.toLower.toUpper.toHexString // "212a" != "4b"
/* Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130) */
'\u0130'.toHexString == "130" // "130" == "130"
'\u0130'.toLower.toHexString == "69" // "69" == "69"
'\u0130'.toHexString == '\u0130'.toLower.toUpper.toHexString // "130" != "49"
Why do the following three characters have not symmetric toLower
, toUpper
results
/**
* Written in the Scala programming language, typed into the Scala REPL.
* Results commented accordingly.
*/
/* Unicode Character 'LATIN CAPITAL LETTER SHARP S' (U+1E9E) */
'\u1e9e'.toHexString == "1e9e" // true
'\u1e9e'.toLower.toHexString == "df" // "df" == "df"
'\u1e9e'.toHexString == '\u1e9e'.toLower.toUpper.toHexString // "1e9e" != "df"
/* Unicode Character 'KELVIN SIGN' (U+212A) */
'\u212a'.toHexString == "212a" // "212a" == "212a"
'\u212a'.toLower.toHexString == "6b" // "6b" == "6b"
'\u212a'.toHexString == '\u212a'.toLower.toUpper.toHexString // "212a" != "4b"
/* Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130) */
'\u0130'.toHexString == "130" // "130" == "130"
'\u0130'.toLower.toHexString == "69" // "69" == "69"
'\u0130'.toHexString == '\u0130'.toLower.toUpper.toHexString // "130" != "49"
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于第一个,有这个解释:
换句话说,U+1E9E小写为U+00DF,但U+00DF的大写不是U+1E9E。
对于第二个,U+212A(开尔文符号)小写为 U+0068(拉丁文小写字母 K)。 U+0068 的大写字母是 U+004B(拉丁文大写字母 K)。这对我来说似乎是有道理的。
对于第三种情况,U+0130(上面带点的拉丁文大写字母 I)是土耳其/阿塞拜疆字符,小写为 U+0069(拉丁文小写字母 I)。我想,如果您在土耳其/阿塞拜疆语言环境中,您会得到 U+0069 的正确大写版本,但这不一定是通用的。
字符不一定需要具有对称的大小写转换。
编辑:为了回应下面 PhiLho 的评论,Unicode 6.0 规范对 U+212A (KELVIN SIGN) 有这样的规定:
换句话说,您不应该真正使用 U+212A,而应该使用 U+004B(拉丁文大写字母 K),如果您规范化 Unicode 文本,则 U+212A 应替换为 U+004B。
For the first one, there is this explanation:
In other words, U+1E9E lower-cases to U+00DF, but the upper-case of U+00DF is not U+1E9E.
For the second one, U+212A (KELVIN SIGN) lower-cases to U+0068 (LATIN SMALL LETTER K). The upper-case of U+0068 is U+004B (LATIN CAPITAL LETTER K). This one seems to make sense to me.
For the third case, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) is a Turkish/Azerbaijani character that lower-cases to U+0069 (LATIN SMALL LETTER I). I would imagine that if you were somehow in a Turkish/Azerbaijani locale you'd get the proper upper-case version of U+0069, but that might not necessarily be universal.
Characters need not necessarily have symmetric upper- and lower-case transformations.
Edit: To respond to PhiLho's comment below, the Unicode 6.0 spec has this to say about U+212A (KELVIN SIGN):
In other words, you shouldn't really be using U+212A, you should be using U+004B (LATIN CAPITAL LETTER K) instead, and if you normalize your Unicode text, U+212A should be replaced with U+004B.
我可以参考另一篇关于 Unicode 和大小写的帖子吗?
认为某种语言的符号必须有大写和小写是一个常见的错误!
Java 中的 Unicode 正确标题大小写
May I refer to another post about Unicode and upper and lower case..
It is a common mistake to think that signs for a language have to be available in upper and lower case!
Unicode-correct title case in Java