Java 与 C 中的字符大小
为什么 Java 中的字符需要比 C 中的字符多两倍的空间来存储?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
为什么 Java 中的字符需要比 C 中的字符多两倍的空间来存储?
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(5)
Java 中的字符是 16 位的,而 C 中的字符是 8 位的。
一个更普遍的问题是为什么会这样?
要找出为什么您需要查看历史并就该主题得出结论/意见。
当 C 在美国开发时,ASCII 在那里相当标准,你实际上只需要 7 位,但使用 8 位你也可以处理一些非 ASCII 字符。这似乎已经足够了。许多基于文本的协议,如 SMTP(电子邮件)、XML 和 FIX,仍然只使用 ASCII 字符。电子邮件和 XML 对非 ASCII 字符进行编码。二进制文件、套接字和流仍然只是 8 位字节。
顺便说一句:C 可以支持更宽的字符,但这不是普通的
char
当 Java 开发时,16 位似乎足以支持大多数语言。从那时起,unicode 已扩展到 65535 以上的字符,并且 Java 必须添加对 UTF-16 字符的代码点的支持,并且可以是一两个 16 位字符。
因此,将
byte
设为字节,将char
设为无符号 16 位值在当时是有意义的。顺便说一句:如果您的 JVM 支持
-XX:+UsecompressedStrings
它可以使用字节而不是仅使用 8 位字符的字符串的字符。In Java characters are 16-bit and C they are 8-bit.
A more general question is why is this so?
To find out why you need to look at history and come to conclusions/opinions on the subject.
When C was developed in the USA, ASCII was pretty standard there and you only really needed 7-bits, but with 8 you could handle some non-ASCII characters as well. It might seem more than enough. Many text based protocols like SMTP (email), XML and FIX, still only use ASCII character. Email and XML encode non ASCII characters. Binary files, sockets and stream are still only 8-bit byte native.
BTW: C can support wider characters, but that is not plain
char
When Java was developed 16-bit seemed like enough to support most languages. Since then unicode has been extended to characters above 65535 and Java has had to add support for codepoints which is UTF-16 characters and can be one or two 16-bit characters.
So making a
byte
a byte andchar
an unsigned 16-bit value made sense at the time.BTW: If your JVM supports
-XX:+UseCompressedStrings
it can use bytes instead of chars for Strings which only use 8-bit characters.因为Java使用Unicode,所以C一般默认使用ASCII。
Unicode 编码有多种风格,但 Java 使用 UTF-16,每个字符使用一个或两个 16 位代码单元。 ASCII 始终为每个字符使用一个字节。
Because Java uses Unicode, C generally uses ASCII by default.
There are various flavours of Unicode encoding, but Java uses UTF-16, which uses either one or two 16-bit code units per character. ASCII always uses one byte per character.
Java 是一种现代语言,诞生于早期 Unicode 时代(90 年代初),因此它像大多数语言一样默认支持 Unicode,作为一等公民新的当代语言(如 Python、Visual Basic 或 JavaScript...)、操作系统(Windows、Symbian、BREW...)和框架/接口/规范...(如 Qt、NTFS、Joliet) )。在设计这些内容时,Unicode 是一个固定 16 位字符集,采用 UCS-2,因此他们使用 16 位 Unicode 字符是有意义的。
相比之下,C 是一种“古老”语言,它比 Java 早几十年就被发明了,当时 Unicode 还很遥远。从一件事。那是 7 位 ASCII 和 8 位 EBCDIC 的时代,因此 C 使用 8 位 char† 作为 足以让
char
变量包含所有基本内容字符。当来到 Unicode 时代时,为了避免破坏旧代码,他们向 C90 引入了一种不同的字符类型,即wchar_t
。这又是 90 年代,Unicode 开始诞生。在任何情况下char
必须继续具有旧的大小,因为即使您使用更宽的字符(Java、Python、VB...都具有一个单独的byte
类型用于此目的)当然,后来 Unicode 联盟很快意识到 16 位是不够的,必须以某种方式修复它。他们通过将 UCS-2 更改为 UTF-16 来扩大代码点范围,以避免破坏使用 宽字符 并将 Unicode 作为 21 位字符集的旧代码(实际上最多 由于 UTF-16,U+10FFFF 而不是 U+1FFFFF)。不幸的是为时已晚,使用16位字符的旧实现陷入困境
后来我们看到了UTF-8,事实证明它远远优于 UTF-16,因为它独立于字节序,通常占用更少的空间,最重要的是它<强>不需要任何改变标准 C 字符串函数。大多数接收
char*
的用户函数将在没有特殊 Unicode 支持的情况下继续工作。Unix系统很幸运,因为它们在引入 UTF-8 后迁移到了 Unicode,因此继续使用 8 位
字符。 OTOH 所有现代 Win32 API 默认都在 16 位
wchar_t
上工作,因为 Windows 也是 Unicode 的早期采用者。因此,.NET 框架和 C# 也采用相同的方式,将 char 作为 16 位类型。说到
wchar_t
,它是如此不可移植,以至于 C 和 C++ 标准都需要引入新的字符类型char16_t
和char32_t
也就是说,大多数实现都致力于改善宽字符串情况。 Java 在 Java 6 中尝试了压缩字符串,并引入了Java 9 中的紧凑字符串。Python 正在转向 与 Python 中的
wchar_t*
相比,内部表示更灵活 3.3之前。 Firefox 和 Chrome 有单独的内部简单字符串的 8 位字符表示。还有关于.NET框架的讨论,并且C#11支持UTF-8 字符串文字。最近,Windows 正在逐步引入对旧 ANSI API 的 UTF-8 支持† 严格来说C 中的
char
只需要至少 8 位。请参阅哪些平台具有 8 位字符以外的字符?Java is a modern language that came up around the early Unicode era (in the beginning of the '90s), so it supports Unicode by default as a first class citizen like most new contemporary languages (like Python, Visual Basic or JavaScript...), OSes (Windows, Symbian, BREW...) and frameworks/interfaces/specifications... (like Qt, NTFS, Joliet). By the time those were designed, Unicode was a fixed 16-bit charset encoded in UCS-2, so it made sense for them to use 16-bit Unicode characters
In contrast C is an "ancient" language that was invented decades before Java, when Unicode was far from a thing. That's the age of 7-bit ASCII and 8-bit EBCDIC, thus C uses 8-bit char† as that's enough for a
char
variable to contain all basic characters. When coming to the Unicode times, to refrain from breaking old code they introduced a different character type to C90 which iswchar_t
. Again this is the '90s when Unicode began its life. In any caseschar
must continue to have the old size because you still need to access individual bytes even if you use wider characters (Java, Python, VB... all have a separatebyte
type for this purpose)Of course later the Unicode Consortium quickly realized that 16 bits are not enough and must fix it somehow. They widened the code-point range by changing UCS-2 to UTF-16 to avoid breaking old code that uses wide char and have Unicode as a 21-bit charset (actually up to U+10FFFF instead of U+1FFFFF because of UTF-16). Unfortunately it was too late and the old implementations that use 16-bit char got stuck
Later we saw the advent of UTF-8, which proved to be far superior to UTF-16 because it's independent of endianness, generally takes up less space, and most importantly it requires no changes in the standard C string functions. Most user functions that receive a
char*
will continue to work without special Unicode supportUnix systems are lucky because they migrated to Unicode later when UTF-8 had been introduced, therefore continue to use 8-bit
char
. OTOH all modern Win32 APIs work on 16-bitwchar_t
by default because Windows was also an early adopter of Unicode. As a result .NET framework and C# also go the same way by having char as a 16-bit type.Talking about
wchar_t
, it was so unportable that both C and C++ standards needed to introduce the new character typeschar16_t
andchar32_t
in their 2011 revisionsThat said, most implementations are working on improving the wide string situation. Java experimented with compressed string in Java 6 and introduced compact strings in Java 9. Python is moving to a more flexible internal representation compared to
wchar_t*
in Python before 3.3. Firefox and Chrome have separate internal 8-bit char representations for simple strings. There are also discussions on that for .NET framework, and C#11 supports UTF-8 string literals. And more recently Windows is gradually introducing UTF-8 support for the old ANSI APIs† Strictly speaking
char
in C is only required to have at least 8 bits. See What platforms have something other than 8-bit char?Java
char
是 UTF-16 编码的 Unicode 代码点,而 C 在大多数情况下使用 ASCII 编码。Java
char
is an UTF-16-encoded Unicode code point while C uses ASCII encoding in most of the cases.