wchar_t 在 Visual Studio 中是 2 字节，存储 UTF-16。 Unicode 感知应用程序如何处理 U+FFFF 以上的字符？

发布于 2024-10-06 14:01:39 字数 1588 浏览 16 评论 0原文

我们公司计划让我们的应用程序支持 Unicode，并且我们正在分析我们将遇到的问题。

特别是，我们的应用程序将严重依赖于字符串的长度，我们希望使用 wchar_t 作为基本字符类。

当处理必须以 UTF-16 的 2 个 16 位单元存储的字符（即 U+10000 以上的字符）时，就会出现问题。

简单的例子：

我有 UTF-8 字符串“蟂”（Unicode 字符 U+87C2，UTF-8 中：E8 9F 82）

因此，我设置了以下代码：

const unsigned char my_utf8_string[] = { 0xe8, 0x9f, 0x82, 0x00 };

// compute size of wchar_t buffer.
int nb_chars = ::MultiByteToWideChar(CP_UTF8,                                  // input is UTF8
                                     0,                                        // no flags
                                     reinterpret_cast<char *>(my_utf8_string), // input string (no worries about signedness)
                                     -1,                                       // input is zero-terminated
                                     NULL,                                     // no output this time
                                     0);                                       // need the necessary buffer size

// allocate
wchar_t *my_utf16_string = new wchar_t[nb_chars];

// convert
nb_chars = ::MultiByteToWideChar(CP_UTF8,
                                 0,
                                 reinterpret_cast<char *>(my_utf8_string),
                                 -1,
                                 my_widechar_string, // output buffer
                                 nb_chars);          // allocated size

好的，这有效，它分配了两次 16 位，我的wchar_t 的缓冲区包含 { 0x87c2, 0x0000 }。如果我将它存储在 std::wstring 中并计算大小，我得到 1。

现在，让我们将字符

原文

We are at our company planning to make our application Unicode-aware, and we are analyzing what problems we are going to encounter.

Particularly, our application will for example rely heavily on lengths of strings and we would like to use wchar_t as base character class.

The problem arises when dealing with characters that must be stored in 2 units of 16 bits in UTF-16, namely characters above U+10000.

Simple example:

I have the UTF-8 string "蟂" (Unicode character U+87C2, in UTF-8: E8 9F 82)

So, I set the following code:

const unsigned char my_utf8_string[] = { 0xe8, 0x9f, 0x82, 0x00 };

// compute size of wchar_t buffer.
int nb_chars = ::MultiByteToWideChar(CP_UTF8,                                  // input is UTF8
                                     0,                                        // no flags
                                     reinterpret_cast<char *>(my_utf8_string), // input string (no worries about signedness)
                                     -1,                                       // input is zero-terminated
                                     NULL,                                     // no output this time
                                     0);                                       // need the necessary buffer size

// allocate
wchar_t *my_utf16_string = new wchar_t[nb_chars];

// convert
nb_chars = ::MultiByteToWideChar(CP_UTF8,
                                 0,
                                 reinterpret_cast<char *>(my_utf8_string),
                                 -1,
                                 my_widechar_string, // output buffer
                                 nb_chars);          // allocated size

Okay, this works, it allocates twice 16 bits, and my buffer of wchar_t contains { 0x87c2, 0x0000 }. If I store it inside a std::wstring and compute the size, I get 1.

Now, let us take character ???? (U+104A2) as input, in UTF-8: F0 90 92 A2.

This time, it allocates space for three wchar_t and std::wstring::size returns 2 even though I consider that I only have one character.

This is problematic. Let us assume that we receive data in UTF-8. We can count Unicode characters simply by not counting bytes that equate to 10xxxxxx. We would like to import that data in an array of wchar_t to work with it. If we just allocate the number of characters plus one, it might be safe... until some person uses a character above U+FFFF. And then our buffer will be too short and our application will crash.

So, with the same string, encoded in different ways, functions that count characters in a string will return different values?

How are applications that work with Unicode strings designed in order to avoid this sort of annoyances?

Thank you for your replies.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

迎风吟唱 2024-10-13 14:01:39

您必须接受 std::wstring::size 确实不给出字符数。相反，它为您提供代码单元的数量。如果您有 16 位代码单元，则它确定字符串中包含多少个代码单元。计算 Unicode 字符数需要循环字符串。一旦你接受了它就不会再烦人了。

至于计算 UTF-8 中的字符：不需要。相反，您发布的代码很好：调用一次 MultiByteToWideChar 会告诉您需要多少个代码单元，然后您分配正确的数字 - 无论是用于 BMP 字符还是补充平面。如果您绝对想编写自己的计数例程，请使用其中两个例程：一个计算字符，另一个计算 16 位代码单元。如果前导字节为11110xxx，则需要统计两个码单元。

回复收藏 0 原文