wchar_t 在 Visual Studio 中是 2 字节,存储 UTF-16。 Unicode 感知应用程序如何处理 U+FFFF 以上的字符?
我们公司计划让我们的应用程序支持 Unicode,并且我们正在分析我们将遇到的问题。
特别是,我们的应用程序将严重依赖于字符串的长度,我们希望使用 wchar_t 作为基本字符类。
当处理必须以 UTF-16 的 2 个 16 位单元存储的字符(即 U+10000 以上的字符)时,就会出现问题。
简单的例子:
我有 UTF-8 字符串“蟂”(Unicode 字符 U+87C2,UTF-8 中:E8 9F 82)
因此,我设置了以下代码:
const unsigned char my_utf8_string[] = { 0xe8, 0x9f, 0x82, 0x00 };
// compute size of wchar_t buffer.
int nb_chars = ::MultiByteToWideChar(CP_UTF8, // input is UTF8
0, // no flags
reinterpret_cast<char *>(my_utf8_string), // input string (no worries about signedness)
-1, // input is zero-terminated
NULL, // no output this time
0); // need the necessary buffer size
// allocate
wchar_t *my_utf16_string = new wchar_t[nb_chars];
// convert
nb_chars = ::MultiByteToWideChar(CP_UTF8,
0,
reinterpret_cast<char *>(my_utf8_string),
-1,
my_widechar_string, // output buffer
nb_chars); // allocated size
好的,这有效,它分配了两次 16 位,我的wchar_t
的缓冲区包含 { 0x87c2, 0x0000 }。如果我将它存储在 std::wstring
中并计算大小,我得到 1。
现在,让我们将字符
We are at our company planning to make our application Unicode-aware, and we are analyzing what problems we are going to encounter.
Particularly, our application will for example rely heavily on lengths of strings and we would like to use wchar_t
as base character class.
The problem arises when dealing with characters that must be stored in 2 units of 16 bits in UTF-16, namely characters above U+10000.
Simple example:
I have the UTF-8 string "蟂" (Unicode character U+87C2, in UTF-8: E8 9F 82)
So, I set the following code:
const unsigned char my_utf8_string[] = { 0xe8, 0x9f, 0x82, 0x00 };
// compute size of wchar_t buffer.
int nb_chars = ::MultiByteToWideChar(CP_UTF8, // input is UTF8
0, // no flags
reinterpret_cast<char *>(my_utf8_string), // input string (no worries about signedness)
-1, // input is zero-terminated
NULL, // no output this time
0); // need the necessary buffer size
// allocate
wchar_t *my_utf16_string = new wchar_t[nb_chars];
// convert
nb_chars = ::MultiByteToWideChar(CP_UTF8,
0,
reinterpret_cast<char *>(my_utf8_string),
-1,
my_widechar_string, // output buffer
nb_chars); // allocated size
Okay, this works, it allocates twice 16 bits, and my buffer of wchar_t
contains { 0x87c2, 0x0000 }. If I store it inside a std::wstring
and compute the size, I get 1.
Now, let us take character ???? (U+104A2) as input, in UTF-8: F0 90 92 A2.
This time, it allocates space for three wchar_t and std::wstring::size returns 2 even though I consider that I only have one character.
This is problematic. Let us assume that we receive data in UTF-8. We can count Unicode characters simply by not counting bytes that equate to 10xxxxxx
. We would like to import that data in an array of wchar_t
to work with it. If we just allocate the number of characters plus one, it might be safe... until some person uses a character above U+FFFF. And then our buffer will be too short and our application will crash.
So, with the same string, encoded in different ways, functions that count characters in a string will return different values?
How are applications that work with Unicode strings designed in order to avoid this sort of annoyances?
Thank you for your replies.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您必须接受 std::wstring::size 确实不给出字符数。相反,它为您提供代码单元的数量。如果您有 16 位代码单元,则它确定字符串中包含多少个代码单元。计算 Unicode 字符数需要循环字符串。一旦你接受了它就不会再烦人了。
至于计算 UTF-8 中的字符:不需要。相反,您发布的代码很好:调用一次 MultiByteToWideChar 会告诉您需要多少个代码单元,然后您分配正确的数字 - 无论是用于 BMP 字符还是补充平面。如果您绝对想编写自己的计数例程,请使用其中两个例程:一个计算字符,另一个计算 16 位代码单元。如果前导字节为11110xxx,则需要统计两个码单元。
You have to accept that std::wstring::size does not give the number of characters. Instead, it gives you the number of code units. If you have 16-bit code units, it determines how many of them you have in the string. Computing the number of Unicode characters would require looping over the string. It won't be annoying anymore once you accept it.
As for counting characters in UTF-8: don't. Instead, the code you posted is fine: calling MultiByteToWideChar once will tell you how many code units you need, and you then allocate the right number - whether it's for BMP characters or supplementary planes. If you absolutely want to write your own counting routines, have two of them: one that counts characters, and one that counts 16-bit code units. If the lead byte is 11110xxx, you need to count two code units.
我建议您阅读 Unicode 官方网站上的以下常见问题解答:http://www.unicode。 org/faq//utf_bom.html
基本上,区分代码单元、代码点和字符非常重要。
I suggest you read the following FAQ from the official Unicode web site: http://www.unicode.org/faq//utf_bom.html
Basically, it is important to distinguish between code units, code points and characters.