将 wstring 转换为 UTF-8 编码的字符串
我需要在 wstring 和 string 之间进行转换。我发现,使用 codecvt 方面应该可以解决问题,但它似乎不适用于 utf-8 语言环境。
我的想法是,当我将utf-8编码的文件读取到字符时,一个utf-8字符被读入两个普通字符(这就是utf-8的工作原理)。我想根据我在代码中使用的库的 wstring 表示形式创建这个 utf-8 字符串。
有人知道该怎么做吗?
我已经尝试过这个:
locale mylocale("cs_CZ.utf-8");
mbstate_t mystate;
wstring mywstring = L"čřžýáí";
const codecvt<wchar_t,char,mbstate_t>& myfacet =
use_facet<codecvt<wchar_t,char,mbstate_t> >(mylocale);
codecvt<wchar_t,char,mbstate_t>::result myresult;
size_t length = mywstring.length();
char* pstr= new char [length+1];
const wchar_t* pwc;
char* pc;
// translate characters:
myresult = myfacet.out (mystate,
mywstring.c_str(), mywstring.c_str()+length+1, pwc,
pstr, pstr+length+1, pc);
if ( myresult == codecvt<wchar_t,char,mbstate_t>::ok )
cout << "Translation successful: " << pstr << endl;
else cout << "failed" << endl;
return 0;
对于 cs_CZ.utf-8 语言环境返回“失败”,对于 cs_CZ.iso8859-2 语言环境可以正常工作。
I need to convert between wstring and string. I figured out, that using codecvt facet should do the trick, but it doesn't seem to work for utf-8 locale.
My idea is, that when I read utf-8 encoded file to chars, one utf-8 character is read into two normal characters (which is how utf-8 works). I'd like to create this utf-8 string from wstring representation for library I use in my code.
Does anybody know how to do it?
I already tried this:
locale mylocale("cs_CZ.utf-8");
mbstate_t mystate;
wstring mywstring = L"čřžýáí";
const codecvt<wchar_t,char,mbstate_t>& myfacet =
use_facet<codecvt<wchar_t,char,mbstate_t> >(mylocale);
codecvt<wchar_t,char,mbstate_t>::result myresult;
size_t length = mywstring.length();
char* pstr= new char [length+1];
const wchar_t* pwc;
char* pc;
// translate characters:
myresult = myfacet.out (mystate,
mywstring.c_str(), mywstring.c_str()+length+1, pwc,
pstr, pstr+length+1, pc);
if ( myresult == codecvt<wchar_t,char,mbstate_t>::ok )
cout << "Translation successful: " << pstr << endl;
else cout << "failed" << endl;
return 0;
which returns 'failed' for cs_CZ.utf-8 locale and works correctly for cs_CZ.iso8859-2 locale.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
下面的代码可能对你有帮助:)
The code below might help you :)
你的平台是什么?请注意,Windows 不支持 UTF-8 区域设置,因此这可以解释失败的原因。
要以依赖于平台的方式完成此操作,您可以使用 MultiByteToWideChar/< Windows 上的 a href="http://msdn.microsoft.com/en-us/library/dd374130%28VS.85%29.aspx" rel="noreferrer">WideCharToMultiByte 和 iconv。您也许可以使用一些增强魔法以独立于平台的方式完成此操作,但我自己还没有尝试过,所以我无法添加此选项。
What's your platform? Note that Windows does not support UTF-8 locales so this may explain why you're failing.
To get this done in a platform dependent way you can use MultiByteToWideChar/WideCharToMultiByte on Windows and iconv on Linux. You may be able to use some boost magic to get this done in a platform independent way, but I haven't tried it myself so I can't add about this option.
在 Windows 上,您必须使用 std::codecvt_utf8_utf16!否则,您的转换将在需要两个 16 位代码单元的 Unicode 代码点上失败。喜欢
On Windows you have to use std::codecvt_utf8_utf16<wchar_t>! Otherwise your conversion will fail on Unicode code points that need two 16 bit code units. Like ???? (U+1F609)
您可以使用 boost 的 utf_to_utf 转换器获取 char 格式以存储在 std::string 中。
You can use boost's utf_to_utf converter to get char format to store in std::string.
目前获得最多支持的答案与平台无关。它会破坏非 BMP 字符(即表情符号
The currently most upvoted answer is not platform-independent. It breaks on non-BMP characters (i.e. Emojis ????). JWiesemann already pointed this out in their answer, but their code will only work on windows.
So here's a correct platform-independent version:
On msvc this might generate some deprecation warnings. You can disable these by wrapping the functions in
See this answer to another question as to why it's ok to disable that warning.
locale 的作用是为程序提供有关外部编码的信息,但假设内部编码没有改变。如果您想输出 UTF-8,则需要从
wchar_t
而不是从char*
进行。您可以做的是将其输出为原始数据(而不是字符串),如果系统区域设置是 UTF-8,则应该正确解释它。
另外,当使用
(w)cout
/(w)cerr
/(w)cin
时,您需要在流中注入语言环境。What locale does is that it gives the program information about the external encoding, but assuming that the internal encoding didn't change. If you want to output UTF-8 you need to do it from
wchar_t
not fromchar*
.What you could do is output it as raw data (not string), it should be then correctly interpreted if the systems locale is UTF-8.
Plus when using
(w)cout
/(w)cerr
/(w)cin
you need to imbue the locale on the stream.Lexertl 库 有一个迭代器,可以让您执行此操作:
The Lexertl library has an iterator that lets you do this:
C++ 不知道 Unicode。使用外部库,例如 ICU(
UnicodeString
类 )或 Qt(QString
类),两者都支持Unicode,包括 UTF-8。C++ has no idea of Unicode. Use an external library such as ICU (
UnicodeString
class) or Qt (QString
class), both support Unicode, including UTF-8.