Java 与 C 中的字符大小

发布于 2025-01-07 07:34:36 字数 40 浏览 3 评论 0 原文

为什么 Java 中的字符需要比 C 中的字符多两倍的空间来存储？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半枫 2025-01-14 07:34:36

Java 中的字符是 16 位的，而 C 中的字符是 8 位的。

一个更普遍的问题是为什么会这样？

要找出为什么您需要查看历史并就该主题得出结论/意见。

当 C 在美国开发时，ASCII 在那里相当标准，你实际上只需要 7 位，但使用 8 位你也可以处理一些非 ASCII 字符。这似乎已经足够了。许多基于文本的协议，如 SMTP（电子邮件）、XML 和 FIX，仍然只使用 ASCII 字符。电子邮件和 XML 对非 ASCII 字符进行编码。二进制文件、套接字和流仍然只是 8 位字节。

顺便说一句：C 可以支持更宽的字符，但这不是普通的 char

当 Java 开发时，16 位似乎足以支持大多数语言。从那时起，unicode 已扩展到 65535 以上的字符，并且 Java 必须添加对 UTF-16 字符的代码点的支持，并且可以是一两个 16 位字符。

因此，将 byte 设为字节，将 char 设为无符号 16 位值在当时是有意义的。

顺便说一句：如果您的 JVM 支持 -XX:+UsecompressedStrings 它可以使用字节而不是仅使用 8 位字符的字符串的字符。

回复收藏 0 原文

后eg是否自 2025-01-14 07:34:36

因为Java使用Unicode，所以C一般默认使用ASCII。

Unicode 编码有多种风格，但 Java 使用 UTF-16，每个字符使用一个或两个 16 位代码单元。 ASCII 始终为每个字符使用一个字节。

回复收藏 0 原文

任性一次 2025-01-14 07:34:36

Java 2 平台在 char 数组中使用 UTF-16 表示形式，并且
在 String 和 StringBuffer 类中。

回复收藏 0 原文

Saygoodbye 2025-01-14 07:34:36

Java 是一种现代语言，诞生于早期 Unicode 时代（90 年代初），因此它像大多数语言一样默认支持 Unicode，作为一等公民新的当代语言（如 Python、Visual Basic 或 JavaScript...）、操作系统（Windows、Symbian、BREW...）和框架/接口/规范...（如 Qt、NTFS、Joliet））。在设计这些内容时，Unicode 是一个固定 16 位字符集，采用 UCS-2，因此他们使用 16 位 Unicode 字符是有意义的。

相比之下，C 是一种“古老”语言，它比 Java 早几十年就被发明了，当时 Unicode 还很遥远。从一件事。那是 7 位 ASCII 和 8 位 EBCDIC 的时代，因此 C 使用 8 位 char^† 作为足以让 char 变量包含所有基本内容字符。当来到 Unicode 时代时，为了避免破坏旧代码，他们向 C90 引入了一种不同的字符类型，即 wchar_t。这又是 90 年代，Unicode 开始诞生。在任何情况下 char 必须继续具有旧的大小，因为即使您使用更宽的字符（Java、Python、VB...都具有一个单独的 byte 类型用于此目的）

当然，后来 Unicode 联盟很快意识到 16 位是不够的，必须以某种方式修复它。他们通过将 UCS-2 更改为 UTF-16 来扩大代码点范围，以避免破坏使用宽字符并将 Unicode 作为 21 位字符集的旧代码（实际上最多由于 UTF-16，U+10FFFF 而不是 U+1FFFFF）。不幸的是为时已晚，使用16位字符的旧实现陷入困境

后来我们看到了UTF-8，事实证明它远远优于 UTF-16，因为它独立于字节序，通常占用更少的空间，最重要的是它<强>不需要任何改变标准 C 字符串函数。大多数接收 char* 的用户函数将在没有特殊 Unicode 支持的情况下继续工作。Unix

系统很幸运，因为它们在引入 UTF-8 后迁移到了 Unicode，因此继续使用 8 位 字符。 OTOH 所有现代 Win32 API 默认都在 16 位 wchar_t 上工作，因为 Windows 也是 Unicode 的早期采用者。因此，.NET 框架和 C# 也采用相同的方式，将 char 作为 16 位类型。

说到 wchar_t，它是如此不可移植，以至于 C 和 C++ 标准都需要引入新的字符类型char16_t 和 char32_t

C 和 C++ 在各自标准的 2011 年修订版中都引入了固定大小字符类型 char16_t 和 char32_t，以提供 16 位和 32 位的明确表示Unicode 转换格式，让 wchar_t 实现定义

https://en.wikipedia.org/wiki/Wide_character#Programming_specifics

也就是说，大多数实现都致力于改善宽字符串情况。 Java 在 Java 6 中尝试了压缩字符串，并引入了Java 9 中的紧凑字符串。Python 正在转向与 Python 中的 wchar_t* 相比，内部表示更灵活 3.3之前。 Firefox 和 Chrome 有单独的内部简单字符串的 8 位字符表示。还有关于.NET框架的讨论，并且C#11支持UTF-8 字符串文字。最近，Windows 正在逐步引入对旧 ANSI API 的 UTF-8 支持

^† 严格来说C 中的 char 只需要至少 8 位。请参阅哪些平台具有 8 位字符以外的字符？

Java is a modern language that came up around the early Unicode era (in the beginning of the '90s), so it supports Unicode by default as a first class citizen like most new contemporary languages (like Python, Visual Basic or JavaScript...), OSes (Windows, Symbian, BREW...) and frameworks/interfaces/specifications... (like Qt, NTFS, Joliet). By the time those were designed, Unicode was a fixed 16-bit charset encoded in UCS-2, so it made sense for them to use 16-bit Unicode characters

In contrast C is an "ancient" language that was invented decades before Java, when Unicode was far from a thing. That's the age of 7-bit ASCII and 8-bit EBCDIC, thus C uses 8-bit char^† as that's enough for a char variable to contain all basic characters. When coming to the Unicode times, to refrain from breaking old code they introduced a different character type to C90 which is wchar_t. Again this is the '90s when Unicode began its life. In any cases char must continue to have the old size because you still need to access individual bytes even if you use wider characters (Java, Python, VB... all have a separate byte type for this purpose)

Of course later the Unicode Consortium quickly realized that 16 bits are not enough and must fix it somehow. They widened the code-point range by changing UCS-2 to UTF-16 to avoid breaking old code that uses wide char and have Unicode as a 21-bit charset (actually up to U+10FFFF instead of U+1FFFFF because of UTF-16). Unfortunately it was too late and the old implementations that use 16-bit char got stuck

Later we saw the advent of UTF-8, which proved to be far superior to UTF-16 because it's independent of endianness, generally takes up less space, and most importantly it requires no changes in the standard C string functions. Most user functions that receive a char* will continue to work without special Unicode support

Unix systems are lucky because they migrated to Unicode later when UTF-8 had been introduced, therefore continue to use 8-bit char. OTOH all modern Win32 APIs work on 16-bit wchar_t by default because Windows was also an early adopter of Unicode. As a result .NET framework and C# also go the same way by having char as a 16-bit type.

Talking about wchar_t, it was so unportable that both C and C++ standards needed to introduce the new character types char16_t and char32_t in their 2011 revisions

Both C and C++ introduced fixed-size character types char16_t and char32_t in the 2011 revisions of their respective standards to provide unambiguous representation of 16-bit and 32-bit Unicode transformation formats, leaving wchar_t implementation-defined

https://en.wikipedia.org/wiki/Wide_character#Programming_specifics

That said, most implementations are working on improving the wide string situation. Java experimented with compressed string in Java 6 and introduced compact strings in Java 9. Python is moving to a more flexible internal representation compared to wchar_t* in Python before 3.3. Firefox and Chrome have separate internal 8-bit char representations for simple strings. There are also discussions on that for .NET framework, and C#11 supports UTF-8 string literals. And more recently Windows is gradually introducing UTF-8 support for the old ANSI APIs

^† Strictly speaking char in C is only required to have at least 8 bits. See What platforms have something other than 8-bit char?