将 Unicode UTF-8 文件读入 wstring
如何在 Windows 平台上将 Unicode (UTF-8) 文件读取到 wstring
(s) 中?
How can I read a Unicode (UTF-8) file into wstring
(s) on the Windows platform?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
借助 C++11 支持,您可以使用 std::codecvt_utf8 facet ,它封装了之间的转换UTF-8 编码的字节字符串和 UCS2 或 UCS4 字符串和可用于读取和写入 UTF-8 文件(文本和二进制)。
为了使用 您通常创建的方面 语言环境对象 将特定于文化的信息封装为一组共同定义特定本地化环境的方面。一旦拥有语言环境对象,您就可以imbue 你的流缓冲区:
可以这样使用:
或者你可以设置 全局 C++ 语言环境 在使用字符串流之前,会导致所有将来的调用
std::locale< /code> 默认构造函数返回全局 C++ 语言环境的副本
(那么您不需要显式地使用它来注入流缓冲区):
With C++11 support, you can use std::codecvt_utf8 facet which encapsulates conversion between a UTF-8 encoded byte string and UCS2 or UCS4 character string and which can be used to read and write UTF-8 files, both text and binary.
In order to use facet you usually create locale object that encapsulates culture-specific information as a set of facets that collectively define a specific localized environment. Once you have a locale object, you can imbue your stream buffer with it:
which can be used like this:
Alternatively you can set the global C++ locale before you work with string streams which causes all future calls to the
std::locale
default constructor to return a copy of the global C++ locale (you don't need to explicitly imbue stream buffers with it then):根据 @Hans Passant 的评论,最简单的方法是使用 _wfopen_s。使用模式
rt, ccs=UTF-8
打开文件。这是另一个至少适用于 VC++ 2010 的纯 C++ 解决方案:
除了
locale::empty()
(这里locale::global()
也可能有效)和basic_ifstream
构造函数的wchar_t*
重载,这甚至应该非常符合标准(当然,其中“标准”意味着 C++0x)。According to a comment by @Hans Passant, the simplest way is to use _wfopen_s. Open the file with mode
rt, ccs=UTF-8
.Here is another pure C++ solution that works at least with VC++ 2010:
Except for
locale::empty()
(herelocale::global()
might work as well) and thewchar_t*
overload of thebasic_ifstream
constructor, this should even be pretty standard-compliant (where “standard” means C++0x, of course).以下是仅适用于 Windows 的特定于平台的函数:
像这样使用:
请注意,整个文件已加载到内存中,因此您可能不想将其用于非常大的文件。
Here's a platform-specific function for Windows only:
Use like so:
Note the entire file is loaded in to memory, so you might not want to use it for very large files.
最近处理所有的编码,都是这样解决的。最好使用 std::u32string ,因为它在所有平台上都有稳定的大小,并且大多数字体都支持 utf-32 格式。 (文件仍应为utf-8)
随意使用除
gcount
之外的标准函数,并将tellg
的结果保存到pos_type
仅有的。另外,请务必将分隔符传递给std::getline
(如果不这样做,该函数会给出异常std::bad_cast
)Recently dealt with all the encodings, solved this way. It is better to use
std::u32string
as it has stable size on all platforms, and most fonts work with utf-32 format. (the file should still be in utf-8)Feel free to use standard functions other than
gcount
, and save the result oftellg
topos_type
only. Also, be sure to pass separator tostd::getline
(if you don't do this, the function gives exceptionstd::bad_cast
)这个问题已在 对 C++ 的 std::wstring、UTF-16、UTF-8 以及在 Windows GUI 中显示字符串感到困惑。总之,wstring 基于 UCS-2 标准,该标准是 UTF-16 的前身。这是严格的两字节标准。我相信这涵盖了阿拉伯语。
This question was addressed in Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI. In sum, wstring is based upon the UCS-2 standard, which is the predecessor of UTF-16. This is a strictly two byte standard. I believe this covers Arabic.
这有点原始,但是如何将文件读取为普通旧字节,然后将字节缓冲区转换为 wchar_t* ?
像这样的东西:
This is a bit raw, but how about reading the file as plain old bytes then cast the byte buffer to wchar_t* ?
Something like: