C++带 ICU 的 UTF-8 输出
我正在努力开始使用 C++ ICU 库。我试图让最简单的例子起作用,但即使这样也失败了。我只想输出一个 UTF-8 字符串,然后从那里开始。
这是我所拥有的:
#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <iostream>
int main()
{
UnicodeString s = UNICODE_STRING_SIMPLE("привет");
std::cout << s << std::endl;
return 0;
}
这是输出:
$ g++ -I/sw/include -licucore -Wall -Werror -o icu_test main.cpp
$ ./icu_test
пÑивеÑ
我的终端和字体支持 UTF-8,并且我经常使用带有 UTF-8 的终端。我的源代码是UTF-8。
我认为也许我需要以某种方式将输出流设置为 UTF-8,因为 ICU 将字符串存储为 UTF-16,但我真的不确定,我认为 ustream.h 提供的运算符无论如何都会这样做。
任何帮助将不胜感激,谢谢。
I'm struggling to get started with the C++ ICU library. I have tried to get the simplest example to work, but even that has failed. I would just like to output a UTF-8 string and then go from there.
Here is what I have:
#include <unicode/unistr.h>
#include <unicode/ustream.h>
#include <iostream>
int main()
{
UnicodeString s = UNICODE_STRING_SIMPLE("привет");
std::cout << s << std::endl;
return 0;
}
Here is the output:
$ g++ -I/sw/include -licucore -Wall -Werror -o icu_test main.cpp
$ ./icu_test
пÑивеÑ
My terminal and font support UTF-8 and I regularly use the terminal with UTF-8. My source code is in UTF-8.
I think that perhaps I somehow need to set the output stream to UTF-8 because ICU stores strings as UTF-16, but I'm really not sure and I would have thought that the operators provided by ustream.h would do that anyway.
Any help would be appreciated, thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您只需将初始化程序更改为:
您使用的宏是仅适用于包含“不变字符”的字符串,即仅包含拉丁字母、数字和一些标点符号。
正如之前所说,输入/输出代码页很棘手。你说:
这可能是真的,但 ICU 并不知道这是真的。进程代码页可能不同(假设是 iso-8859-1),并且输出代码页可能不同(假设是 shift-jis)。那么程序就无法运行了。但是,使用 API UNICODE_STRING_SIMPLE 的不变字符仍然有效。
希望这有帮助。
srl,重症监护室开发公司
Your program will work if you just change the initializer to:
The macro you were using is only for strings that contain "invariant characters", i.e., only latin letters, digits, and some punctuation.
As was said before, input/output codepages are tricky. You said:
That may be true, but ICU doesn't know that's true. The process codepage might be different (let's say iso-8859-1), and the output codepage may be different (let's say shift-jis). Then, the program wouldn't work. But, the invariant characters using the API UNICODE_STRING_SIMPLE would still work.
Hope this helps.
srl, icu dev
如果将输出写入文件(使用终端中的管道进行重定向,或通过在程序本身中打开文件流)会发生什么?
这将确定是否是终端无法正确处理输出。
如果您检查调试器中的输出字符串会发生什么?它包含正确的值吗?找出字符串的 UTF-8 编码应该是什么样子,并将其与调试器中得到的内容进行比较。或者打印出每个字节的整数值,并验证这些值是否正确。
使用编码时,确定问题是出在程序本身还是出在文本输出到系统时发生的转换总是很棘手(但很重要)。将终端从等式中剔除,并验证您的程序是否生成正确的输出。
What happens if you write the output to a file (either redirecting using pipes from the terminal, or by opening a file stream in the program itself)
That would determine whether or not it is the terminal that fails to handle the output correctly.
What happens if you inspect the output string in the debugger? Does it contain the correct values? Find out what the UTF-8 encoding of your string should look like, and compare it against what you get in the debugger. Or print out the integral value of each byte, and verify that those are correct.
When working with encoding it is always tricky (but essential) to determine whether the problem lies in your program itself or in the conversion that happens when the text is output to the system. Take the terminal out of the equation and verify that your program generates the correct output.
operator<<(ostream, UnicodeString)
使用 ICU 的“默认转换器”在 UTF16 和字符之间进行转换。 AFAIU,“默认转换器”(如果您没有使用ucnv_setDefaultName()
显式设置它)取决于平台和 ICU 的编译方式。从ucnv_getDefaultName()
可以得到什么?operator<<(ostream, UnicodeString)
converts between UTF16 and chars by using ICU's "default converter". AFAIU, the "default converter" (if you don't set it explicitly withucnv_setDefaultName()
) depends on the platform and the way ICU was compiled. What do you get fromucnv_getDefaultName()
?