为什么 Java 生态系统在整个软件堆栈中使用不同的字符编码?
例如,类文件使用 CESU-8(有时也称为 MUTF-8),但 Java 内部首先使用 UCS-2,现在使用 UTF-16。关于有效 Java 源文件的规范规定,最小符合标准的 Java 编译器只需接受 ASCII 字符。
这些选择的原因是什么?在整个 Java 生态系统中使用相同的编码不是更有意义吗?
For instance class files use CESU-8 (sometimes also called MUTF-8), but internally Java first used UCS-2 and now it uses UTF-16. The specification about valid Java source files says that a minimal conforming Java compiler only has to accept ASCII characters.
What's the reason for these choices? Wouldn't it make more sense to use the same encoding throughout the Java ecosystem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
源文件使用 ASCII 是因为当时人们认为期望人们拥有完全支持 Unicode 的文本编辑器是不合理的。此后情况有所改善,但仍然不完美。 Jave 中的整个
\uXXXX
本质上是 Java 中 C 的三字母的等价物。 (创建 C 时,某些键盘没有大括号,因此必须使用三字母!)创建Java 时,类文件格式使用 UTF-8,运行时使用 UCS-2。 Unicode 的代码点少于 64k,因此 16 位就足够了。后来,当额外的“平面”被添加到 Unicode 中时,UCS-2 被(几乎)兼容的 UTF-16 取代,UTF-8 被 CESU-8 取代(因此“兼容性编码方案...”)。
在类文件格式中,他们希望使用 UTF-8 来节省空间。类文件格式(包括 JVM 指令集)的设计非常注重紧凑性。
在运行时,他们希望使用 UCS-2,因为他们认为节省空间比避免处理可变宽度字符更重要。不幸的是,现在它是 UTF-16,这种事与愿违,因为一个代码点现在可以采用多个“字符”,更糟糕的是,“char”数据类型现在有点错误命名(一般来说,它不再对应于一个字符,但是相反,对应于 UTF-16 代码单元)。
ASCII for source files is because at the time it wasn't considered reasonable to expect people to have text editors with full Unicode support. Things have improved since, but they still aren't perfect. The whole
\uXXXX
thing in Jave is essentially Java's equivalent to C's trigraphs. (When C was created, some keyboards didn't have curly braces, so you had to use trigraphs!)At the time Java was created, the class file format used UTF-8 and the runtime used UCS-2. Unicode had less than 64k codepoints, so 16 bits was enough. Later, when additional "planes" were added to Unicode, UCS-2 was replaced with the (pretty much) compatible UTF-16, and UTF-8 was replaced with CESU-8 (hence "Compatibility Encoding Scheme...").
In the class file format they wanted to use UTF-8 to save space. The design of the class file format (including the JVM instruction set) was heavily geared towards compactness.
In the runtime they wanted to use UCS-2 because it was felt that saving space was less important than being able to avoid the need to deal with variable-width characters. Unfortunately, this kind of backfired now that it's UTF-16, because a codepoint can now take multiple "chars", and worse, the "char" datatype is now sort of misnamed (it no longer corresponds to a character, in general, but instead corresponds to a UTF-16 code-unit).
MUTF-8 提高效率,UCS2 提高效率。 :)
1993 年,UCS2 成为 Unicode;每个人都认为 65536 个字符应该足以满足每个人的需要。
后来,当人们清楚地意识到世界上确实有非常多的语言时,为时已晚,更不用说将“char”重新定义为 32 位了,这是一个可怕的想法,所以取而代之的是一种主要落后的-做出了兼容的选择。
在某种程度上,与 ASCII 和 UTF-8 之间的关系非常相似,不超出历史 UCS2 边界的 Java 字符串与其 UTF16 表示形式在位上是相同的。只有当你在这些线之外着色时,你才必须开始担心代理等。
MUTF-8 for efficiency, UCS2 for hysterical raisins. :)
In 1993, UCS2 was Unicode; everyone thought 65536 Characters Ought To Be Enough For Everyone.
Later on, when it became clear that indeed, there are an awful lot of languages in the world, it was too late, not to mention a terrible idea, to redefine 'char' to be 32 bits, so instead a mostly-backward-compatible choice was made.
In a way that's closely analogous to the relationship between ASCII and UTF-8, Java strings that don't stray outside the historical UCS2 boundaries are bit-identical to their UTF16 representation. It's only when you colour outside those lines that you have to start worrying about surrogates, etc.
这似乎是一个常见的软件开发问题。早期的代码是一种标准,通常在创建时最容易实现,然后后来的版本添加了对更新/更好/不太常见/更复杂标准的支持。
最小编译器可能只需要采用 ASCII,因为这是许多常见编辑器使用的。这些编辑器可能不适合使用 Java,也远不是一个完整的 IDE,但通常足以调整一个源文件。
Java 似乎试图设置更高的标准并处理 UTF 字符集,但他们也保留了 ASCII“救助”选项。我确信一些委员会会议的记录可以解释原因。
It seems to be a common software development problem. Early code is one standard, usually the simplest to implement at the time it was created, then later versions add in support for newer/better/less common/more complex standards.
A minimal complier probably only has to take ASCII because thats what many common editors use. These editors may not be ideal for working with Java and nowhere near a full IDE, but are often adequate to tweak one source file.
Java seems to have attempted to set the bar higher and handle UTF character sets but they also left that ASCII 'bailout' option in place. I'm sure there are notes from some committee meeting that explain why.