为什么 wchar_t 没有在 Linux/相关平台的代码中广泛使用?
这引起了我的兴趣,所以我想问 - 为什么 wchar_t
在 Linux/类似 Linux 的系统上没有像在 Windows 上那样广泛使用?具体来说,Windows API 在内部使用 wchar_t
,而我相信 Linux 没有,这反映在许多使用 char
类型的开源包中。
我的理解是,给定一个字符 c
需要多个字节来表示它,然后以 char[]
形式 c
分为几个部分char*
的组成,而它在 wchar_t[]
中形成单个单元。那么,始终使用 wchar_t
不是更容易吗?我是否错过了否定这种差异的技术原因?或者这只是一个收养问题?
This intrigues me, so I'm going to ask - for what reason is wchar_t
not used so widely on Linux/Linux-like systems as it is on Windows? Specifically, the Windows API uses wchar_t
internally whereas I believe Linux does not and this is reflected in a number of open source packages using char
types.
My understanding is that given a character c
which requires multiple bytes to represent it, then in a char[]
form c
is split over several parts of char*
whereas it forms a single unit in wchar_t[]
. Is it not easier, then, to use wchar_t
always? Have I missed a technical reason that negates this difference? Or is it just an adoption problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
wchar_t
是一个具有平台定义宽度的宽字符,这并没有多大帮助。UTF-8 字符每个字符跨 1-4 个字节。每个字符恰好跨越 2 个字节的 UCS-2 现已过时,并且无法表示完整的 Unicode 字符集。
支持 Unicode 的 Linux 应用程序往往在字节存储层之上正确地执行此操作。 Windows 应用程序往往会做出这样愚蠢的假设:只有两个字节就可以了。
wchar_t
的维基百科文章简要介绍了这一点。wchar_t
is a wide character with platform-defined width, which doesn't really help much.UTF-8 characters span 1-4 bytes per character. UCS-2, which spans exactly 2 bytes per character, is now obsolete and can't represent the full Unicode character set.
Linux applications that support Unicode tend to do so properly, above the byte-wise storage layer. Windows applications tend to make this silly assumption that only two bytes will do.
wchar_t
's Wikipedia article briefly touches on this.第一个在基于 Unix 的平台上使用 UTF-8 的人解释了 :
斜体部分与 Windows 系统不太相关,Windows 系统更喜欢单一应用程序 (Microsoft Office)、非多样化机器(一切都是 x86,因此都是小尾数)和单一操作系统供应商。
Unix 理念是拥有小型、单一用途的程序,这意味着需要进行严格字符操作的程序会更少。
如果您需要类别和大小写映射等字符属性,可以直接访问代码点的 UTF-32 确实更方便。
但 Widechars 在 Linux 上使用起来很困难,就像 UTF-8 在 Windows 上使用起来很困难一样。 GNU libc 没有
_wfopen
或_wstat< /code>
函数。
The first people to use UTF-8 on a Unix-based platform explained:
The italicized part is less relevant to Windows systems, which have a preference towards monolithic applications (Microsoft Office), non-diverse machines (everything's an x86 and thus little-endian), and a single OS vendor.
And the Unix philosophy of having small, single-purpose programs means fewer of them need to do serious character manipulation.
UTF-32, with code points directly accessible, is indeed more convenient if you need character properties like categories and case mappings.
But widechars are awkward to use on Linux for the same reason that UTF-8 is awkward to use on Windows. GNU libc has no
_wfopen
or_wstat
function.UTF-8 与 ASCII 兼容,因此可以在一定程度上忽略 Unicode。
通常,程序不关心(事实上,不需要关心)输入是什么,只要不存在可以终止字符串的 \0 即可。请参阅:
我发现需要 Unicode 支持的唯一情况是当我必须将多字节字符作为单个单元 (wchar_t) 时;例如,当必须计算字符串中的字符数而不是字节数时。 iconv 从 utf-8 到 wchar_t 很快就能做到这一点。对于零宽度空格和组合变音符号等更大的问题,需要像 icu 这样更重的东西 - 但你多久这样做一次?
UTF-8, being compatible to ASCII, makes it possible to ignore Unicode somewhat.
Often, programs don't care (and in fact, don't need to care) about what the input is, as long as there is not a \0 that could terminate strings. See:
The only times when I found I needed Unicode support is when I had to have a multibyte character as a single unit (wchar_t); e.g. when having to count the number of characters in a string, rather than bytes. iconv from utf-8 to wchar_t will quickly do that. For bigger issues like zero-width spaces and combining diacritics, something more heavy like icu is needed—but how often do you do that anyway?
wchar_t
在所有平台上的大小并不相同。在 Windows 上,它是使用两个字节的 UTF-16 代码单元。在其他平台上,它通常使用 4 个字节(对于 UCS-4/UTF-32)。因此,这些平台不太可能标准化使用wchar_t
,因为这会浪费大量空间。wchar_t
is not the same size on all platforms. On Windows it is a UTF-16 code unit that uses two bytes. On other platforms it typically uses 4 bytes (for UCS-4/UTF-32). It is therefore unlikely that these platforms would standardize on usingwchar_t
, since it would waste a lot of space.