C++ & Boost:编码/解码 UTF-8

发布于 2024-11-09 18:07:14 字数 1366 浏览 0 评论 0原文

我正在尝试做一个非常简单的任务:获取一个支持 unicode 的 wstring 并将其转换为 string,编码为 UTF8 字节,然后以相反的方式:获取包含 UTF8 字节的 string 并将其转换为支持 unicode 的 wstring

问题是,我需要它跨平台,我需要它与 Boost 一起工作......而我似乎无法找到让它工作的方法。我一直在玩

尝试将代码转换为使用 stringstream/wstringstream 而不是文件无论如何,但似乎没有任何作用。

例如,在 Python 中,它看起来像这样:

>>> u"שלום"
u'\u05e9\u05dc\u05d5\u05dd'
>>> u"שלום".encode("utf8")
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'.decode("utf8")
u'\u05e9\u05dc\u05d5\u05dd'

我最终想要的是:

wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
wstring ws(uchars);
string s = encode_utf8(ws); 
// s now holds "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9, 0x5dc, 0x5d5, 0x5dd}

我真的不想添加对 ICU 的另一个依赖项或本着这种精神的东西......但根据我的理解,这应该是可能的与升压。

一些示例代码将不胜感激!谢谢

I'm trying to do a very simple task: take a unicode-aware wstring and convert it to a string, encoded as UTF8 bytes, and then the opposite way around: take a string containing UTF8 bytes and convert it to unicode-aware wstring.

The problem is, I need it cross-platform and I need it work with Boost... and I just can't seem to figure a way to make it work. I've been toying with

Trying to convert the code to use stringstream/wstringstream instead of files of whatever, but nothing seems to work.

For instance, in Python it would look like so:

>>> u"שלום"
u'\u05e9\u05dc\u05d5\u05dd'
>>> u"שלום".encode("utf8")
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'.decode("utf8")
u'\u05e9\u05dc\u05d5\u05dd'

What I'm ultimately after is this:

wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
wstring ws(uchars);
string s = encode_utf8(ws); 
// s now holds "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9, 0x5dc, 0x5d5, 0x5dd}

I really don't want to add another dependency on the ICU or something in that spirit... but to my understanding, it should be possible with Boost.

Some sample code would greatly be appreciated! Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

独夜无伴 2024-11-16 18:07:14

谢谢大家,但最终我求助于 http://utfcpp.sourceforge.net/ - 这是一个标题 -唯一一个非常轻量且易于使用的库。我在这里分享一个演示代码,如果有人觉得它有用的话:

inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
    utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
    utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}

用法:

wstring ws(L"\u05e9\u05dc\u05d5\u05dd");
string s;
encode_utf8(ws, s);

Thanks everyone, but ultimately I resorted to http://utfcpp.sourceforge.net/ -- it's a header-only library that's very lightweight and easy to use. I'm sharing a demo code here, should anyone find it useful:

inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
    utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
    utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}

Usage:

wstring ws(L"\u05e9\u05dc\u05d5\u05dd");
string s;
encode_utf8(ws, s);
无人问我粥可暖 2024-11-16 18:07:14

注释中已经有一个 boost 链接,但在几乎标准的 C++0x 中,有 wstring_convert

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
int main()
{
    wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
    std::string s = conv.to_bytes(uchars);
    std::wstring ws2 = conv.from_bytes(s);
    std::cout << std::boolalpha
              << (s == "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d" ) << '\n'
              << (ws2 == uchars ) << '\n';
}

在使用 MS Visual Studio 2010 EE SP1 或使用 CLang++ 2.9 编译时执行此输出

true 
true

There's already a boost link in the comments, but in the almost-standard C++0x, there is wstring_convert that does this

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
int main()
{
    wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
    std::string s = conv.to_bytes(uchars);
    std::wstring ws2 = conv.from_bytes(s);
    std::cout << std::boolalpha
              << (s == "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d" ) << '\n'
              << (ws2 == uchars ) << '\n';
}

output when compiled with MS Visual Studio 2010 EE SP1 or with CLang++ 2.9

true 
true
少女净妖师 2024-11-16 18:07:14

Boost.Locale 在 Boost 1.48(2011 年 11 月 15 日)中发布,使得从 UTF8/16 到 UTF8/16 的转换变得更容易。

以下是文档中的一些方便的示例:

string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);

几乎与 Python 编码/解码一样简单:)

请注意 Boost .Locale 不是一个仅包含标头的库。

Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16

Here are some convenient examples from the docs:

string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);

Almost as easy as Python encoding/decoding :)

Note that Boost.Locale is not a header-only library.

野の 2024-11-16 18:07:14

有关处理 utf8 的 std::string/std::wstring直接替换,请参阅 TINYUTF8

结合使用,您可以漂亮地转换从/到 utf8 的每个编码都有很多,然后您可以通过上面的库进行处理。

For a drop-in replacement for std::string/std::wstring that handles utf8, see TINYUTF8.

In combination with <codecvt> you can convert pretty much from/to every encoding from/to utf8, which you then handle through the above library.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文