C++带 ICU 的 UTF-8 输出

发布于 2024-08-30 09:43:54 字数 654 浏览 4 评论 0原文

我正在努力开始使用 C++ ICU 库。我试图让最简单的例子起作用，但即使这样也失败了。我只想输出一个 UTF-8 字符串，然后从那里开始。

这是我所拥有的：

#include <unicode/unistr.h>
#include <unicode/ustream.h>

#include <iostream>

int main()
{
    UnicodeString s = UNICODE_STRING_SIMPLE("привет");

    std::cout << s << std::endl;

    return 0;
}

这是输出：

$ g++ -I/sw/include -licucore -Wall -Werror -o icu_test main.cpp 
$ ./icu_test 
Ð¿ÑÐ¸Ð²ÐµÑ

我的终端和字体支持 UTF-8，并且我经常使用带有 UTF-8 的终端。我的源代码是UTF-8。

我认为也许我需要以某种方式将输出流设置为 UTF-8，因为 ICU 将字符串存储为 UTF-16，但我真的不确定，我认为 ustream.h 提供的运算符无论如何都会这样做。

任何帮助将不胜感激，谢谢。

原文

I'm struggling to get started with the C++ ICU library. I have tried to get the simplest example to work, but even that has failed. I would just like to output a UTF-8 string and then go from there.

Here is what I have:

#include <unicode/unistr.h>
#include <unicode/ustream.h>

#include <iostream>

int main()
{
    UnicodeString s = UNICODE_STRING_SIMPLE("привет");

    std::cout << s << std::endl;

    return 0;
}

Here is the output:

$ g++ -I/sw/include -licucore -Wall -Werror -o icu_test main.cpp 
$ ./icu_test 
Ð¿ÑÐ¸Ð²ÐµÑ

My terminal and font support UTF-8 and I regularly use the terminal with UTF-8. My source code is in UTF-8.

I think that perhaps I somehow need to set the output stream to UTF-8 because ICU stores strings as UTF-16, but I'm really not sure and I would have thought that the operators provided by ustream.h would do that anyway.

Any help would be appreciated, thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

魂牵梦绕锁你心扉 2024-09-06 09:43:54

如果您只需将初始化程序更改为：

UnicodeString s("привет");

您使用的宏是仅适用于包含“不变字符”的字符串，即仅包含拉丁字母、数字和一些标点符号。

正如之前所说，输入/输出代码页很棘手。你说：

我的终端和字体支持 UTF-8 和
我经常使用终端
UTF-8。我的源代码采用 UTF-8 格式。

这可能是真的，但 ICU 并不知道这是真的。进程代码页可能不同（假设是 iso-8859-1），并且输出代码页可能不同（假设是 shift-jis）。那么程序就无法运行了。但是，使用 API UNICODE_STRING_SIMPLE 的不变字符仍然有效。

希望这有帮助。

srl，重症监护室开发公司

Your program will work if you just change the initializer to:

UnicodeString s("привет");

The macro you were using is only for strings that contain "invariant characters", i.e., only latin letters, digits, and some punctuation.

As was said before, input/output codepages are tricky. You said:

My terminal and font support UTF-8 and
I regularly use the terminal with
UTF-8. My source code is in UTF-8.

That may be true, but ICU doesn't know that's true. The process codepage might be different (let's say iso-8859-1), and the output codepage may be different (let's say shift-jis). Then, the program wouldn't work. But, the invariant characters using the API UNICODE_STRING_SIMPLE would still work.

Hope this helps.

srl, icu dev

回复收藏 0 原文

烟凡古楼 2024-09-06 09:43:54

如果将输出写入文件（使用终端中的管道进行重定向，或通过在程序本身中打开文件流）会发生什么？

这将确定是否是终端无法正确处理输出。

如果您检查调试器中的输出字符串会发生什么？它包含正确的值吗？找出字符串的 UTF-8 编码应该是什么样子，并将其与调试器中得到的内容进行比较。或者打印出每个字节的整数值，并验证这些值是否正确。

使用编码时，确定问题是出在程序本身还是出在文本输出到系统时发生的转换总是很棘手（但很重要）。将终端从等式中剔除，并验证您的程序是否生成正确的输出。

回复收藏 0 原文

情归归情 2024-09-06 09:43:54

operator<<(ostream, UnicodeString) 使用 ICU 的“默认转换器”在 UTF16 和字符之间进行转换。 AFAIU，“默认转换器”（如果您没有使用 ucnv_setDefaultName() 显式设置它）取决于平台和 ICU 的编译方式。从 ucnv_getDefaultName() 可以得到什么？

回复收藏 0 原文

~没有更多了~