(w)ifstream 支持不同的编码吗
当我使用 wifstream 将文本文件读取为宽字符串 (std::wstring) 时,流实现是否支持不同的编码 - 即它可以用于读取 ASCII、UTF-8 和 UTF-16 文件吗?
如果没有,我该怎么办?
(我需要阅读整个文件,如果这有什么不同的话)
When I read a text file to a wide character string (std::wstring) using an wifstream, does the stream implementation support different encodings - i.e. can it be used to read e.g. ASCII, UTF-8, and UTF-16 files?
If not, what would I have to do?
(I need to read the entire file, if that makes a difference)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
C++ 通过
std::locale
和 Facetstd::codecvt
来支持字符编码。 总体思路是,locale
对象描述了系统的各个方面,这些方面可能因文化、(人类)语言而异。 这些方面被分解为facet
,它们是定义如何构造本地化相关对象(包括 I/O 流)的模板参数。 当您从istream
读取或写入ostream
时,每个字符的实际写入都会通过区域设置的方面进行过滤。 这些方面不仅涵盖 Unicode 类型的编码,还涵盖各种特征,例如写入的数字大小(例如,使用逗号或句点)、货币、时间、大小写以及大量其他细节。然而,仅仅因为存在进行编码的设施并不意味着标准库实际上可以处理所有编码,也不意味着这样的代码可以简单地正确执行。 即使是像您应该读入的字符大小(更不用说编码部分)这样的基本事情也很困难,因为
wchar_t
可能太小(破坏数据)或太大(浪费空间) ,以及最常见的编译器(例如 Visual C++ 和 Gnu C++)在其实现大小方面存在差异。 所以你通常需要找到外部库来进行实际的编码。我能找到的涵盖所有基础的最简单的示例来自 Boost 的 UTF-8 codecvt 方面,有一个专门尝试编码 UTF-8 的示例( UCS4) 供 IO 流使用。 它看起来像这样,但我不建议只是逐字复制它。 需要对源代码进行更多挖掘理解它(我不声称理解):
要了解有关语言环境的更多信息,以及它们如何使用方面(包括
codecvt
),请查看以下内容:C++ supports character encodings by means of
std::locale
and the facetstd::codecvt
. The general idea is that alocale
object describes the aspects of the system that might vary from culture to culture, (human) language to language. These aspects are broken down intofacet
s, which are template arguments that define how localization-dependent objects (include I/O streams) are constructed. When you read from anistream
or write to aostream
, the actual writing of each character is filtered through the locale's facets. The facets cover not only encoding of Unicode types but such varied features as how large numbers are written (e.g. with commas or periods), currency, time, capitalization, and a slew of other details.However just because the facilities exist to do encodings doesn't mean the standard library actually handles all encodings, nor does it make such code simple to do right. Even such basic things as the size of character you should be reading into (let alone the encoding part) is difficult, as
wchar_t
can be too small (mangling your data), or too large (wasting space), and the most common compilers (e.g. Visual C++ and Gnu C++) do differ on how big their implementation is. So you generally need to find external libraries to do the actual encoding.The most straightforward example I can find that covers all the bases, is from Boost's UTF-8 codecvt facet, with an example that specifically tries to encode UTF-8 (UCS4) for use by IO streams. It looks like this, though I don't suggest just copying it verbatim. It takes a little more digging in the source to understand it (and I don't claim to):
To understand more about locales, and how they use facets (including
codecvt
), take a look at the following:ifstream
不关心文件的编码。 它只是从文件中读取字符(字节)。wifstream
读取宽字节(wchar_t
),但它仍然不知道有关文件编码的任何信息。wifstream
对于 UCS-2 来说已经足够好了——UCS-2 是 Unicode 的固定长度字符编码(每个字符用两个字节表示)。您可以使用 IBM ICU 库来处理 Unicode 文件。
ifstream
does not care about encoding of file. It just reads chars(bytes) from file.wifstream
reads wide bytes(wchar_t
), but it still doesn't know anything about file encoding.wifstream
is good enough for UCS-2 — fixed-length character encoding for Unicode (each character represented with two bytes).You could use IBM ICU library to deal with Unicode files.
宽字符串和宽字符流的设计早于UTF-8、UTF-16和Unicode。 如果您想了解技术,标准字符串和标准流不一定在 ASCII 上运行(只是基本上所有计算机都使用 ASCII;您可能有一台 EBCDIC 机器)。
Raymond Chen 曾经写了一系列说明如何使用不同的宽字符流/字符串类型。
The design of wide character string and wide character stream pre-dates UTF-8, UTF-16 and Unicode. If you want to get technical, the standard string and the standard stream don't necessarily operate on ASCII (it's just that basically all computers out there use ASCII; you could potentially have an EBCDIC machine).
Raymond Chen once wrote a series illustrating how to work with different wide character stream/string types.