Windows wchar_t 如何处理基本多语言平面之外的 unicode 字符?
我在这里和其他地方查看了许多其他帖子(见下文),但我仍然没有对这个问题的明确答案:Windows wchar_t 如何处理基本多语言平面之外的 unicode 字符?
即:
- 许多程序员似乎认为UTF-16是有害的因为它是一个可变长度的代码。
- wchar_t 在 Windows 上为 16 位宽,但是 32 位宽Unix/MacOS
- Windows API 使用宽字符,而不是 Unicode。
那么当你想在Windows上编写像
I've looked at a number of other posts here and elsewhere (see below), but I still don't have a clear answer to this question: How does windows wchar_t handle unicode characters outside the basic multilingual plane?
That is:
- many programmers seem to feel that UTF-16 is harmful because it is a variable-length code.
- wchar_t is 16-bits wide on windows, but 32-bits wide on Unix/MacOS
- The Windows APIs use wide-characters, not Unicode.
So what does Windows do when you want to code something like ???? (U+2008A) Han Character on Windows?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Windows stdlib 下的
wchar_t
实现是忽略 UTF-16 的:它只知道 16 位代码单元。因此,您可以将 UTF-16 代理序列放入字符串中,并且可以选择使用更高级别的处理将其视为单个字符。字符串实现不会为您提供任何帮助,也不会阻碍您;它允许您在字符串中包含任何代码单元序列,甚至是在解释为 UTF-16 时无效的代码单元序列。
Windows 的许多高级功能确实支持由 UTF-16 代理项组成的字符,这就是为什么您可以调用文件
The implementation of
wchar_t
under the Windows stdlib is UTF-16-oblivious: it knows only about 16-bit code units.So you can put a UTF-16 surrogate sequence in a string, and you can choose to treat that as a single character using higher level processing. The string implementation won't do anything to help you, nor to hinder you; it will let you include any sequence of code units in your string, even ones that would be invalid when interpreted as UTF-16.
Many of the higher-level features of Windows do support characters made out of UTF-16 surrogates, which is why you can call a file
????.txt
and see it both render correctly and edit correctly (taking a single keypress, not two, to move past the character) in programs like Explorer that support complex text layout (typically using Windows's Uniscribe library).But there are still places where you can see the UTF-16-obliviousness shining through, such as the fact you can create a file called
????.txt
in the same folder as????.txt
, where case-insensitivity would otherwise disallow it, or the fact that you can create[U+DC01][U+D801].txt
programmatically.This is how pedants can have a nice long and basically meaningless argument about whether Windows “supports” UTF-16 strings or only UCS-2.
Windows 过去使用 UCS-2,但在 Windows 2000 中采用了 UTF-16。Windows wchar_t API 现在生成和使用 UTF-16。
并非所有第三方程序都能正确处理此问题,因此 BMP 之外的数据可能会出现错误。
另请注意,UTF-16 作为一种可变长度编码,不符合与 wchar_t 一起使用的编码的 C 或 C++ 要求。这会导致一些问题,例如一些采用单个 wchar_t 的标准函数(例如 wctomb)无法在 Windows 上处理超出 BMP 的字符,以及 Windows 定义了一些使用更宽类型的附加函数以便能够处理单个字符BMP 之外。我忘记了它是什么函数,但我遇到了一个返回 int 而不是 wchar_t 的 Windows 函数(并且它不是 EOF 可能结果的函数)。
Windows used to use UCS-2 but adopted UTF-16 with Windows 2000. Windows wchar_t APIs now produce and consume UTF-16.
Not all third party programs handle this correctly and so may be buggy with data outside the BMP.
Also, note that UTF-16, being a variable length encoding, does not conform to the C or C++ requirements for an encoding used with wchar_t. This causes some problems such as some standard functions that take a single wchar_t, such as wctomb, can't handle characters beyond the BMP on Windows, and Windows defining some additional functions that use a wider type in order to be able to handle single characters outside the BMP. I forget what function it was, but I ran into a Windows function that returned int instead of wchar_t (and it wasn't one where EOF was a possible result).