(w)ifstream 支持不同的编码吗

发布于 2024-08-02 01:12:15 字数 156 浏览 7 评论 0原文

当我使用 wifstream 将文本文件读取为宽字符串 (std::wstring) 时，流实现是否支持不同的编码 - 即它可以用于读取 ASCII、UTF-8 和 UTF-16 文件吗？

如果没有，我该怎么办？

（我需要阅读整个文件，如果这有什么不同的话）

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜光 2024-08-09 01:12:16

C++ 通过 std::locale 和 Facet std::codecvt 来支持字符编码。总体思路是，locale 对象描述了系统的各个方面，这些方面可能因文化、（人类）语言而异。这些方面被分解为facet，它们是定义如何构造本地化相关对象（包括 I/O 流）的模板参数。当您从 istream 读取或写入 ostream 时，每个字符的实际写入都会通过区域设置的方面进行过滤。这些方面不仅涵盖 Unicode 类型的编码，还涵盖各种特征，例如写入的数字大小（例如，使用逗号或句点）、货币、时间、大小写以及大量其他细节。

然而，仅仅因为存在进行编码的设施并不意味着标准库实际上可以处理所有编码，也不意味着这样的代码可以简单地正确执行。即使是像您应该读入的字符大小（更不用说编码部分）这样的基本事情也很困难，因为 wchar_t 可能太小（破坏数据）或太大（浪费空间），以及最常见的编译器（例如 Visual C++ 和 Gnu C++）在其实现大小方面存在差异。所以你通常需要找到外部库来进行实际的编码。

iconv 通常被认为是正确的，但如何将其绑定到 C++ 机制的示例如下很难找到。
jla3ep 提及 libICU，非常全面，但是 C++ API 并没有尝试与标准很好地配合（据我所知：您可以扫描示例，看看您是否可以做得更好。）

我能找到的涵盖所有基础的最简单的示例来自 Boost 的 UTF-8 codecvt 方面，有一个专门尝试编码 UTF-8 的示例（ UCS4) 供 IO 流使用。它看起来像这样，但我不建议只是逐字复制它。需要对源代码进行更多挖掘理解它（我不声称理解）：

typedef wchar_t ucs4_t;

std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);

...

std::wifstream input_file("data.utf8");
input_file.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) { ... }

要了解有关语言环境的更多信息，以及它们如何使用方面（包括 codecvt），请查看以下内容：

Nathan Myers 有一个对语言环境和方面的详尽解释。迈尔斯是区域设置概念的设计者之一。如果您想仔细阅读，他有更正式的文档。
Apache 的标准库实现（以前称为 RogueWave）有一个完整的方面列表。
Nicolai Josuttis 的C++ 标准库第 14 章专门讨论了该主题。
Angelika Langer 和 Klaus Kreft 的标准 C++ IOStreams 和 Locales 写了一整本书。

C++ supports character encodings by means of std::locale and the facet std::codecvt. The general idea is that a locale object describes the aspects of the system that might vary from culture to culture, (human) language to language. These aspects are broken down into facets, which are template arguments that define how localization-dependent objects (include I/O streams) are constructed. When you read from an istream or write to a ostream, the actual writing of each character is filtered through the locale's facets. The facets cover not only encoding of Unicode types but such varied features as how large numbers are written (e.g. with commas or periods), currency, time, capitalization, and a slew of other details.

However just because the facilities exist to do encodings doesn't mean the standard library actually handles all encodings, nor does it make such code simple to do right. Even such basic things as the size of character you should be reading into (let alone the encoding part) is difficult, as wchar_t can be too small (mangling your data), or too large (wasting space), and the most common compilers (e.g. Visual C++ and Gnu C++) do differ on how big their implementation is. So you generally need to find external libraries to do the actual encoding.

iconv is generally acknowledge to be correct, but examples of how to bind it to the C++ mechanism are hard to find.
jla3ep mentions libICU, which is very thorough but the C++ API does not try to play nicely with the standard (As far as I can tell: you can scan the examples to see if you can do better.)

The most straightforward example I can find that covers all the bases, is from Boost's UTF-8 codecvt facet, with an example that specifically tries to encode UTF-8 (UCS4) for use by IO streams. It looks like this, though I don't suggest just copying it verbatim. It takes a little more digging in the source to understand it (and I don't claim to):

typedef wchar_t ucs4_t;

std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);

...

std::wifstream input_file("data.utf8");
input_file.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) { ... }

To understand more about locales, and how they use facets (including codecvt), take a look at the following:

Nathan Myers has a thorough explanation of locales and facets. Myers was one of the designers of the locale concept. He has more formal documentation if you want to wade through it.
Apache's Standard Library implementation (formerly RogueWave's) has a full list of facets.
Nicolai Josuttis' The C++ Standard Library Chapter 14 is devoted to the subject.
Angelika Langer and Klaus Kreft's Standard C++ IOStreams and Locales devotes a whole book.

回复收藏 0 原文

帅气称霸 2024-08-09 01:12:16

ifstream 不关心文件的编码。它只是从文件中读取字符（字节）。 wifstream 读取宽字节（wchar_t），但它仍然不知道有关文件编码的任何信息。 wifstream 对于 UCS-2 来说已经足够好了——UCS-2 是 Unicode 的固定长度字符编码（每个字符用两个字节表示）。

您可以使用 IBM ICU 库来处理 Unicode 文件。