(w)ifstream 支持不同的编码吗

发布于 2024-08-02 01:12:15 字数 156 浏览 1 评论 0原文

当我使用 wifstream 将文本文件读取为宽字符串 (std::wstring) 时,流实现是否支持不同的编码 - 即它可以用于读取 ASCII、UTF-8 和 UTF-16 文件吗?

如果没有,我该怎么办?

(我需要阅读整个文件,如果这有什么不同的话)

When I read a text file to a wide character string (std::wstring) using an wifstream, does the stream implementation support different encodings - i.e. can it be used to read e.g. ASCII, UTF-8, and UTF-16 files?

If not, what would I have to do?

(I need to read the entire file, if that makes a difference)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

夜光 2024-08-09 01:12:16

C++ 通过 std::locale 和 Facet std::codecvt 来支持字符编码。 总体思路是,locale 对象描述了系统的各个方面,这些方面可能因文化、(人类)语言而异。 这些方面被分解为facet,它们是定义如何构造本地化相关对象(包括 I/O 流)的模板参数。 当您从 istream 读取或写入 ostream 时,每个字符的实际写入都会通过区域设置的方面进行过滤。 这些方面不仅涵盖 Unicode 类型的编码,还涵盖各种特征,例如写入的数字大小(例如,使用逗号或句点)、货币、时间、大小写以及大量其他细节。

然而,仅仅因为存在进行编码的设施并不意味着标准库实际上可以处理所有编码,也不意味着这样的代码可以简单地正确执行。 即使是像您应该读入的字符大小(更不用说编码部分)这样的基本事情也很困难,因为 wchar_t 可能太小(破坏数据)或太大(浪费空间) ,以及最常见的编译器(例如 Visual C++ 和 Gnu C++)在其实现大小方面存在差异。 所以你通常需要找到外部库来进行实际的编码。

  • iconv 通常被认为是正确的,但如何将其绑定到 C++ 机制的示例如下很难找到。
  • jla3ep 提及 libICU,非常全面,但是 C++ API 并没有尝试与标准很好地配合(据我所知:您可以扫描 示例,看看您是否可以做得更好。)

我能找到的涵盖所有基础的最简单的示例来自 Boost 的 UTF-8 codecvt 方面,有一个专门尝试编码 UTF-8 的示例( UCS4) 供 IO 流使用。 它看起来像这样,但我不建议只是逐字复制它。 需要对源代码进行更多挖掘理解它(我不声称理解):

typedef wchar_t ucs4_t;

std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);

...

std::wifstream input_file("data.utf8");
input_file.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) { ... }

要了解有关语言环境的更多信息,以及它们如何使用方面(包括 codecvt),请查看以下内容:

C++ supports character encodings by means of std::locale and the facet std::codecvt. The general idea is that a locale object describes the aspects of the system that might vary from culture to culture, (human) language to language. These aspects are broken down into facets, which are template arguments that define how localization-dependent objects (include I/O streams) are constructed. When you read from an istream or write to a ostream, the actual writing of each character is filtered through the locale's facets. The facets cover not only encoding of Unicode types but such varied features as how large numbers are written (e.g. with commas or periods), currency, time, capitalization, and a slew of other details.

However just because the facilities exist to do encodings doesn't mean the standard library actually handles all encodings, nor does it make such code simple to do right. Even such basic things as the size of character you should be reading into (let alone the encoding part) is difficult, as wchar_t can be too small (mangling your data), or too large (wasting space), and the most common compilers (e.g. Visual C++ and Gnu C++) do differ on how big their implementation is. So you generally need to find external libraries to do the actual encoding.

  • iconv is generally acknowledge to be correct, but examples of how to bind it to the C++ mechanism are hard to find.
  • jla3ep mentions libICU, which is very thorough but the C++ API does not try to play nicely with the standard (As far as I can tell: you can scan the examples to see if you can do better.)

The most straightforward example I can find that covers all the bases, is from Boost's UTF-8 codecvt facet, with an example that specifically tries to encode UTF-8 (UCS4) for use by IO streams. It looks like this, though I don't suggest just copying it verbatim. It takes a little more digging in the source to understand it (and I don't claim to):

typedef wchar_t ucs4_t;

std::locale old_locale;
std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);

...

std::wifstream input_file("data.utf8");
input_file.imbue(utf8_locale);
ucs4_t item = 0;
while (ifs >> item) { ... }

To understand more about locales, and how they use facets (including codecvt), take a look at the following:

帅气称霸 2024-08-09 01:12:16

ifstream 不关心文件的编码。 它只是从文件中读取字符(字节)。 wifstream 读取宽字节(wchar_t),但它仍然不知道有关文件编码的任何信息。 wifstream 对于 UCS-2 来说已经足够好了——UCS-2 是 Unicode 的固定长度字符编码(每个字符用两个字节表示)。

您可以使用 IBM ICU 库来处理 Unicode 文件。

Unicode 国际组件 (ICU) 是一套成熟、可移植的 C/C++ 和 Java 库,用于 Unicode 支持、软件国际化 (I18N) 和全球化 (G11N),为所有平台上的应用程序提供相同的结果。

ICU 根据非限制性开源许可证发布,适用于商业软件以及其他开源或免费软件。

ifstream does not care about encoding of file. It just reads chars(bytes) from file. wifstream reads wide bytes(wchar_t), but it still doesn't know anything about file encoding. wifstream is good enough for UCS-2 — fixed-length character encoding for Unicode (each character represented with two bytes).

You could use IBM ICU library to deal with Unicode files.

The International Component for Unicode (ICU) is a mature, portable set of C/C++ and Java libraries for Unicode support, software internationalization (I18N) and globalization (G11N), giving applications the same results on all platforms.

ICU is released under a nonrestrictive open source license that is suitable for use with both commercial software and with other open source or free software.

他不在意 2024-08-09 01:12:16

宽字符串和宽字符流的设计早于UTF-8、UTF-16和Unicode。 如果您想了解技术,标准字符串和标准流不一定在 ASCII 上运行(只是基本上所有计算机都使用 ASCII;您可能有一台 EBCDIC 机器)。

Raymond Chen 曾经写了一系列说明如何使用不同的宽字符流/字符串类型

The design of wide character string and wide character stream pre-dates UTF-8, UTF-16 and Unicode. If you want to get technical, the standard string and the standard stream don't necessarily operate on ASCII (it's just that basically all computers out there use ASCII; you could potentially have an EBCDIC machine).

Raymond Chen once wrote a series illustrating how to work with different wide character stream/string types.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文