为什么 wchar_t 没有在 Linux/相关平台的代码中广泛使用？

发布于 2024-10-10 07:53:42 字数 404 浏览 6 评论 0原文

这引起了我的兴趣，所以我想问 - 为什么 wchar_t 在 Linux/类似 Linux 的系统上没有像在 Windows 上那样广泛使用？具体来说，Windows API 在内部使用 wchar_t，而我相信 Linux 没有，这反映在许多使用 char 类型的开源包中。

我的理解是，给定一个字符 c 需要多个字节来表示它，然后以 char[] 形式 c 分为几个部分char* 的组成，而它在 wchar_t[] 中形成单个单元。那么，始终使用 wchar_t 不是更容易吗？我是否错过了否定这种差异的技术原因？或者这只是一个收养问题？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苏佲洛 2024-10-17 07:53:42

wchar_t 是一个具有平台定义宽度的宽字符，这并没有多大帮助。

UTF-8 字符每个字符跨 1-4 个字节。每个字符恰好跨越 2 个字节的 UCS-2 现已过时，并且无法表示完整的 Unicode 字符集。

支持 Unicode 的 Linux 应用程序往往在字节存储层之上正确地执行此操作。 Windows 应用程序往往会做出这样愚蠢的假设：只有两个字节就可以了。

wchar_t 的维基百科文章简要介绍了这一点。

回复收藏 0 原文

顾忌 2024-10-17 07:53:42

第一个在基于 Unix 的平台上使用 UTF-8 的人解释了 :

Unicode 标准 [当时的版本 1.1]
定义一个
足够的字符集，但
不合理的表述[UCS-2]。它指出
所有字符都是 16 位宽 [不再正确]
并以16位为单位进行通信和存储。
还预留了一对
字符数（十六进制 FFFE 和
FEFF）来检测字节顺序
传输的文本，需要状态
字节流。（统一码
联盟考虑的是文件，而不是
管道。）为了采用这种编码，我们
必须转换所有文本
进入和退出计划 9 之间
ASCII 和 Unicode，不能同时使用
完毕。在单个程序中，
命令其所有输入和输出，
可以将字符定义为
16 位数量；在一个上下文中
具有数百个网络的系统
在不同机器上的应用
不同的制造商 [斜体是我的]，它是
不可能。

斜体部分与 Windows 系统不太相关，Windows 系统更喜欢单一应用程序 (Microsoft Office)、非多样化机器（一切都是 x86，因此都是小尾数）和单一操作系统供应商。

Unix 理念是拥有小型、单一用途的程序，这意味着需要进行严格字符操作的程序会更少。

我们的工具和来源
申请已经
转换为与 Latin-1 一起使用，所以它
是“8位安全”，但是转换
Unicode 标准和 UTF[-8] 是
更多参与。有些程序不需要
完全改变：cat，例如，
解释它的参数字符串，
以 UTF[-8] 形式传递，作为文件名
它未经解释地传递给
open 系统调用，然后复制
从输入到输出的字节；它
从不根据事实做出决定
字节的值...大多数程序，
然而，需要适度的改变。
...实际上需要操作的工具很少
关于符文 [Unicode 代码点]
内部；更典型的是他们需要
只是为了寻找最后的斜杠
文件名和类似的琐碎任务。
170 个 C 源程序中……只有 23 个
现在包含单词Rune。
存储符文的程序
内部人士大多是那些
存在的理由是性格
操作：sam（文本编辑器），
sed、sort、tr、troff、8½（窗口
系统和终端仿真器）等
在。决定是否使用计算
runes 或 UTF 编码的字节字符串
需要平衡成本
读取时转换数据
根据转换成本写出
按需提供相关文本。对于节目
比如长时间运行的编辑器
具有相对恒定的数据集，
符文是更好的选择...

如果您需要类别和大小写映射等字符属性，可以直接访问代码点的 UTF-32 确实更方便。

但 Widechars 在 Linux 上使用起来很困难，就像 UTF-8 在 Windows 上使用起来很困难一样。 GNU libc 没有 _wfopen 或 _wstat< /code>函数。

The first people to use UTF-8 on a Unix-based platform explained:

The Unicode Standard [then at version 1.1]
defines an
adequate character set but an
unreasonable representation [UCS-2]. It states
that all characters are 16 bits wide [no longer true]
and are communicated and stored in 16-bit units.
It also reserves a pair
of characters (hexadecimal FFFE and
FEFF) to detect byte order in
transmitted text, requiring state in
the byte stream. (The Unicode
Consortium was thinking of files, not
pipes.) To adopt this encoding, we
would have had to convert all text
going into and out of Plan 9 between
ASCII and Unicode, which cannot be
done. Within a single program, in
command of all its input and output,
it is possible to define characters as
16-bit quantities; in the context of a
networked system with hundreds of
applications on diverse machines by
different manufacturers [italics mine], it is
impossible.

The italicized part is less relevant to Windows systems, which have a preference towards monolithic applications (Microsoft Office), non-diverse machines (everything's an x86 and thus little-endian), and a single OS vendor.

And the Unix philosophy of having small, single-purpose programs means fewer of them need to do serious character manipulation.

The source for our tools and
applications had already been
converted to work with Latin-1, so it
was ‘8-bit safe’, but the conversion
to the Unicode Standard and UTF[-8] is
more involved. Some programs needed no
change at all: cat, for instance,
interprets its argument strings,
delivered in UTF[-8], as file names
that it passes uninterpreted to the
open system call, and then just copies
bytes from its input to its output; it
never makes decisions based on the
values of the bytes...Most programs,
however, needed modest change.
...Few tools actually need to operate
on runes [Unicode code points]
internally; more typically they need
only to look for the final slash in a
file name and similar trivial tasks.
Of the 170 C source programs...only 23
now contain the word Rune.
The programs that do store runes
internally are mostly those whose
raison d’être is character
manipulation: sam (the text editor),
sed, sort, tr, troff, 8½ (the window
system and terminal emulator), and so
on. To decide whether to compute using
runes or UTF-encoded byte strings
requires balancing the cost of
converting the data when read and
written against the cost of converting
relevant text on demand. For programs
such as editors that run a long time
with a relatively constant dataset,
runes are the better choice...

UTF-32, with code points directly accessible, is indeed more convenient if you need character properties like categories and case mappings.

But widechars are awkward to use on Linux for the same reason that UTF-8 is awkward to use on Windows. GNU libc has no _wfopen or _wstat function.

回复收藏 0 原文

瀟灑尐姊 2024-10-17 07:53:42

UTF-8 与 ASCII 兼容，因此可以在一定程度上忽略 Unicode。

通常，程序不关心（事实上，不需要关心）输入是什么，只要不存在可以终止字符串的 \0 即可。请参阅：

char buf[whatever];
printf("Your favorite pizza topping is which?\n");
fgets(buf, sizeof(buf), stdin); /* Jalapeños */
printf("%s it shall be.\n", buf);

我发现需要 Unicode 支持的唯一情况是当我必须将多字节字符作为单个单元 (wchar_t) 时；例如，当必须计算字符串中的字符数而不是字节数时。 iconv 从 utf-8 到 wchar_t 很快就能做到这一点。对于零宽度空格和组合变音符号等更大的问题，需要像 icu 这样更重的东西 - 但你多久这样做一次？

UTF-8, being compatible to ASCII, makes it possible to ignore Unicode somewhat.

Often, programs don't care (and in fact, don't need to care) about what the input is, as long as there is not a \0 that could terminate strings. See:

char buf[whatever];
printf("Your favorite pizza topping is which?\n");
fgets(buf, sizeof(buf), stdin); /* Jalapeños */
printf("%s it shall be.\n", buf);

The only times when I found I needed Unicode support is when I had to have a multibyte character as a single unit (wchar_t); e.g. when having to count the number of characters in a string, rather than bytes. iconv from utf-8 to wchar_t will quickly do that. For bigger issues like zero-width spaces and combining diacritics, something more heavy like icu is needed—but how often do you do that anyway?

回复收藏 0 原文