Windows wchar_t 如何处理基本多语言平面之外的 unicode 字符?

发布于 2024-12-11 22:10:10 字数 594 浏览 0 评论 0原文

我在这里和其他地方查看了许多其他帖子(见下文),但我仍然没有对这个问题的明确答案:Windows wchar_t 如何处理基本多语言平面之外的 unicode 字符?

即:

那么当你想在Windows上编写像

I've looked at a number of other posts here and elsewhere (see below), but I still don't have a clear answer to this question: How does windows wchar_t handle unicode characters outside the basic multilingual plane?

That is:

So what does Windows do when you want to code something like ???? (U+2008A) Han Character on Windows?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

晨与橙与城 2024-12-18 22:10:10

Windows stdlib 下的 wchar_t 实现是忽略 UTF-16 的:它只知道 16 位代码单元。

因此,您可以将 UTF-16 代理序列放入字符串中,并且可以选择使用更高级别的处理将其视为单个字符。字符串实现不会为您提供任何帮助,也不会阻碍您;它允许您在字符串中包含任何代码单元序列,甚至是在解释为 UTF-16 时无效的代码单元序列。

Windows 的许多高级功能确实支持由 UTF-16 代理项组成的字符,这就是为什么您可以调用文件

The implementation of wchar_t under the Windows stdlib is UTF-16-oblivious: it knows only about 16-bit code units.

So you can put a UTF-16 surrogate sequence in a string, and you can choose to treat that as a single character using higher level processing. The string implementation won't do anything to help you, nor to hinder you; it will let you include any sequence of code units in your string, even ones that would be invalid when interpreted as UTF-16.

Many of the higher-level features of Windows do support characters made out of UTF-16 surrogates, which is why you can call a file ????.txt and see it both render correctly and edit correctly (taking a single keypress, not two, to move past the character) in programs like Explorer that support complex text layout (typically using Windows's Uniscribe library).

But there are still places where you can see the UTF-16-obliviousness shining through, such as the fact you can create a file called ????.txt in the same folder as ????.txt, where case-insensitivity would otherwise disallow it, or the fact that you can create [U+DC01][U+D801].txt programmatically.

This is how pedants can have a nice long and basically meaningless argument about whether Windows “supports” UTF-16 strings or only UCS-2.

听风念你 2024-12-18 22:10:10

Windows 过去使用 UCS-2,但在 Windows 2000 中采用了 UTF-16。Windows wchar_t API 现在生成和使用 UTF-16。

并非所有第三方程序都能正确处理此问题,因此 BMP 之外的数据可能会出现错误。

另请注意,UTF-16 作为一种可变长度编码,不符合与 wchar_t 一起使用的编码的 C 或 C++ 要求。这会导致一些问题,例如一些采用单个 wchar_t 的标准函数(例如 wctomb)无法在 Windows 上处理超出 BMP 的字符,以及 Windows 定义了一些使用更宽类型的附加函数以便能够处理单个字符BMP 之外。我忘记了它是什么函数,但我遇到了一个返回 int 而不是 wchar_t 的 Windows 函数(并且它不是 EOF 可能结果的函数)。

Windows used to use UCS-2 but adopted UTF-16 with Windows 2000. Windows wchar_t APIs now produce and consume UTF-16.

Not all third party programs handle this correctly and so may be buggy with data outside the BMP.

Also, note that UTF-16, being a variable length encoding, does not conform to the C or C++ requirements for an encoding used with wchar_t. This causes some problems such as some standard functions that take a single wchar_t, such as wctomb, can't handle characters beyond the BMP on Windows, and Windows defining some additional functions that use a wider type in order to be able to handle single characters outside the BMP. I forget what function it was, but I ran into a Windows function that returned int instead of wchar_t (and it wasn't one where EOF was a possible result).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文