在 C++ 中处理 UTF-8

发布于 2024-12-21 07:11:59 字数 477 浏览 7 评论 0原文

为了确定 C++ 是否适合我的项目，我想测试 UTF-8 功能。根据参考文献，我构建了这个示例：

#include <string>
#include <iostream>

using namespace std;

int main() {
    wstring str;
    while(getline(wcin, str)) {
        wcout << str << endl;
        if(str.empty()) break;
    }

    return 0;
}

但是当我输入 UTF-8 字符时，它表现错误：

$ > ./utf8 
Hello
Hello
für
f
$ >

不仅不打印 ü，而且还立即退出。 gdb 告诉我没有崩溃，而是正常退出，但我觉得很难相信。

原文

To find out if C++ is the right language for a project of mine, I wanna test the UTF-8 capabilities. According to references, I built this example:

#include <string>
#include <iostream>

using namespace std;

int main() {
    wstring str;
    while(getline(wcin, str)) {
        wcout << str << endl;
        if(str.empty()) break;
    }

    return 0;
}

But when I type in an UTF-8 character, it misbehaves:

$ > ./utf8 
Hello
Hello
für
f
$ >

Not only it doesn't print the ü, but also quits immediately. gdb told me there was no crash, but a normal exit, yet I find that hard to believe.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

谈情不如逗狗 2024-12-28 07:11:59

不要在 Linux 上使用 wstring。

std::wstring VS std::string

看一下第一个答案。我确信它回答了你的问题。

什么时候应该使用 std::wstring 而不是 std::string？
在 Linux 上？几乎从来没有（§）。
在 Windows 上？几乎总是 (§)。

回复收藏 0 原文

绝對不後悔。 2024-12-28 07:11:59

该语言本身与 unicode 或任何其他字符编码无关。它与操作系统相关。 Windows 使用 UTF16 来支持 unicode，这意味着使用宽字符（16 位宽字符）- wchar_t 或 std:wstring。每个使用字符串操作的 Win Api 函数都需要宽字符输入。

但基于 UNIX 的系统（即 Mac OS X 或 Linux）使用 UTF8。当然 - 这只是如何处理数组中的字节的问题，因此您可以将 UTF16 字符串存储在常见的 C 数组或 std:string 容器中。这就是为什么你在跨平台代码中看不到任何 wstrings；相反，所有字符串均按 UTF8 处理，并在必要时重新编码为 UTF16（在 Windows 上）。

您有更多选择来处理这个有点令人困惑的事情。我个人是按照上面提到的那样做的 - 在所有应用程序中严格使用 UTF8 编码，在与 Windows Api 交互时重新编码字符串，并直接在 Mac OS X 上使用它们。对于 win 重新编码，我使用了很棒的转换助手：

C++ UTF-8 转换助手（在 MSDN 上，可根据 Apache 许可证版本 2.0 获得）。

您还可以使用跨平台 Qt String，它定义了从 UTF8 到 UTF16 和其他编码（ANSI、拉丁语...）的转换函数。

所以上面的答案 - 在unix上始终使用UTF8（std::string, char），在Windows上使用UTF16（std::wstring, wchar_t）是正确的。

回复收藏 0 原文