C++ & Boost:编码/解码 UTF-8
我正在尝试做一个非常简单的任务:获取一个支持 unicode 的 wstring
并将其转换为 string
,编码为 UTF8 字节,然后以相反的方式:获取包含 UTF8 字节的 string
并将其转换为支持 unicode 的 wstring
。
问题是,我需要它跨平台,我需要它与 Boost 一起工作......而我似乎无法找到让它工作的方法。我一直在玩
- http:// /www.edobashira.com/2010/03/using-boost-code-facet-for-reading-utf8.html 和
- http://www.boost.org/doc/libs/1_46_0 /libs/serialization/doc/codecvt.html
尝试将代码转换为使用 stringstream
/wstringstream
而不是文件无论如何,但似乎没有任何作用。
例如,在 Python 中,它看起来像这样:
>>> u"שלום"
u'\u05e9\u05dc\u05d5\u05dd'
>>> u"שלום".encode("utf8")
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'.decode("utf8")
u'\u05e9\u05dc\u05d5\u05dd'
我最终想要的是:
wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
wstring ws(uchars);
string s = encode_utf8(ws);
// s now holds "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9, 0x5dc, 0x5d5, 0x5dd}
我真的不想添加对 ICU 的另一个依赖项或本着这种精神的东西......但根据我的理解,这应该是可能的与升压。
一些示例代码将不胜感激!谢谢
I'm trying to do a very simple task: take a unicode-aware wstring
and convert it to a string
, encoded as UTF8 bytes, and then the opposite way around: take a string
containing UTF8 bytes and convert it to unicode-aware wstring
.
The problem is, I need it cross-platform and I need it work with Boost... and I just can't seem to figure a way to make it work. I've been toying with
- http://www.edobashira.com/2010/03/using-boost-code-facet-for-reading-utf8.html and
- http://www.boost.org/doc/libs/1_46_0/libs/serialization/doc/codecvt.html
Trying to convert the code to use stringstream
/wstringstream
instead of files of whatever, but nothing seems to work.
For instance, in Python it would look like so:
>>> u"שלום"
u'\u05e9\u05dc\u05d5\u05dd'
>>> u"שלום".encode("utf8")
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'.decode("utf8")
u'\u05e9\u05dc\u05d5\u05dd'
What I'm ultimately after is this:
wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
wstring ws(uchars);
string s = encode_utf8(ws);
// s now holds "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9, 0x5dc, 0x5d5, 0x5dd}
I really don't want to add another dependency on the ICU or something in that spirit... but to my understanding, it should be possible with Boost.
Some sample code would greatly be appreciated! Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
谢谢大家,但最终我求助于 http://utfcpp.sourceforge.net/ - 这是一个标题 -唯一一个非常轻量且易于使用的库。我在这里分享一个演示代码,如果有人觉得它有用的话:
用法:
Thanks everyone, but ultimately I resorted to http://utfcpp.sourceforge.net/ -- it's a header-only library that's very lightweight and easy to use. I'm sharing a demo code here, should anyone find it useful:
Usage:
注释中已经有一个 boost 链接,但在几乎标准的 C++0x 中,有
wstring_convert
在使用 MS Visual Studio 2010 EE SP1 或使用 CLang++ 2.9 编译时执行此输出
There's already a boost link in the comments, but in the almost-standard C++0x, there is
wstring_convert
that does thisoutput when compiled with MS Visual Studio 2010 EE SP1 or with CLang++ 2.9
Boost.Locale 在 Boost 1.48(2011 年 11 月 15 日)中发布,使得从 UTF8/16 到 UTF8/16 的转换变得更容易。
以下是文档中的一些方便的示例:
几乎与 Python 编码/解码一样简单:)
请注意 Boost .Locale 不是一个仅包含标头的库。
Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16
Here are some convenient examples from the docs:
Almost as easy as Python encoding/decoding :)
Note that Boost.Locale is not a header-only library.
有关处理 utf8 的
std::string
/std::wstring
的直接替换,请参阅 TINYUTF8。与
结合使用,您可以漂亮地转换从/到 utf8 的每个编码都有很多,然后您可以通过上面的库进行处理。For a drop-in replacement for
std::string
/std::wstring
that handles utf8, see TINYUTF8.In combination with
<codecvt>
you can convert pretty much from/to every encoding from/to utf8, which you then handle through the above library.