当前位置：文江博客话题详情

为什么 C++ 中的宽文件流会这样？默认情况下缩小写入数据？

发布于 2024-08-06 00:14:33 字数 905 浏览 5 评论 0 原文

老实说，我只是没有在 C++ 标准库中得到以下设计决策。将宽字符写入文件时，wofstream 会将 wchar_t 转换为 char 字符：

#include <fstream>
#include <string>

int main()
{
    using namespace std;

    wstring someString = L"Hello StackOverflow!";
    wofstream file(L"Test.txt");

    file << someString; // the output file will consist of ASCII characters!
}

我知道这与标准 有关>编解码器。提升。此外，还有马丁约克在这里。问题是为什么标准codecvt转换宽字符？为什么不按原样写人物呢！

另外，我们是否会使用 C++0x 获得真正的 unicode 流 还是我在这里遗漏了一些东西？

原文

Honestly, I just don't get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream converts wchar_t into char characters:

#include <fstream>
#include <string>

int main()
{
    using namespace std;

    wstring someString = L"Hello StackOverflow!";
    wofstream file(L"Test.txt");

    file << someString; // the output file will consist of ASCII characters!
}

I am aware that this has to do with the standard codecvt. There is codecvt for utf8 in Boost. Also, there is a codecvt for utf16 by Martin York here on SO. The question is why the standard codecvt converts wide-characters? why not write the characters as they are!

Also, are we gonna get real unicode streams with C++0x or am I missing something here?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

老娘不死你永远是小三 2024-08-13 00:14:33

第一个问题的非常部分的答案：文件是字节序列，因此，在处理wchar_t时，至少一些转换必须出现在 wchar_t 和 char 之间。 “智能”地进行此转换需要了解字符编码，因此这就是为什么通过在流的区域设置中使用切面，允许此转换与区域设置相关的原因。

然后，问题是如何在标准所需的唯一语言环境中进行转换：“经典”语言环境。对此没有“正确”的答案，因此标准对此非常模糊。我从你的问题中了解到，你认为在 wchar_t[] 和 char[] 之间盲目转换（或 memcpy()-ing）将是一个好方法。这并非不合理，而且实际上在某些实现中就是（或至少是）所做的。

另一个观点是，由于 codecvt 是一个语言环境方面，因此可以合理地期望使用“语言环境的编码”进行转换（我在这里很犹豫，因为这个概念非常模糊）。例如，人们希望土耳其语言环境使用 ISO-8859-9，或者希望日语语言环境使用 Shift JIS。通过相似性，“经典”语言环境将转换为该“语言环境的编码”。显然，微软选择简单地修剪（如果我们假设 wchar_t 代表 UTF-16 并且我们停留在基本的多语言平面上，这会导致 IS-8859-1），而我所知道的 Linux 实现决定坚持使用 ASCII。

对于你的第二个问题：

另外，我们是否会使用 C++0x 获得真正的 unicode 流，还是我在这里遗漏了一些东西？

在 n2857（我手头最新的 C++0x 草案）的 [locale.codecvt] 部分中，可以读到：

专业化 codecvt 在 UTF-16 和 UTF-8 编码方案之间进行转换，专业化 codecvt codecvt code> 在 UTF-32 和 UTF-8 编码方案之间进行转换。 codecvt 在窄字符和宽字符的本机字符集之间进行转换。

在 [locale.stdcvt] 部分中，我们发现：

对于方面codecvt_utf8：
— 方面应在程序内的 UTF-8 多字节序列和 UCS2 或 UCS4（取决于 Elem 的大小）之间进行转换。
[...]

对于方面codecvt_utf16：
— 方面应在程序内的 UTF-16 多字节序列和 UCS2 或 UCS4（取决于 Elem 的大小）之间进行转换。
[...]

对于方面codecvt_utf8_utf16：
— Facet 应在程序内的 UTF-8 多字节序列和 UTF-16（一个或两个 16 位代码）之间进行转换。

所以我猜这意味着“是”，但是您必须更准确地了解“真正的 unicode 流”的含义才能确定。

A very partial answer for the first question: A file is a sequence of bytes so, when dealing with wchar_t's, at least some conversion between wchar_t and char must occur. Making this conversion "intelligently" requires knowledge of the character encodings, so this is why this conversion is allowed to be locale-dependent, by virtue of using a facet in the stream's locale.

Then, the question is how that conversion should be made in the only locale required by the standard: the "classic" one. There is no "right" answer for that, and the standard is thus very vague about it. I understand from your question that you assume that blindly casting (or memcpy()-ing) between wchar_t[] and char[] would have been a good way. This is not unreasonable, and is in fact what is (or at least was) done in some implementations.

Another POV would be that, since a codecvt is a locale facet, it is reasonable to expect that the conversion is made using the "locale's encoding" (I'm handwavy here, as the concept is pretty fuzzy). For example, one would expect a Turkish locale to use ISO-8859-9, or a Japanese on to use Shift JIS. By similarity, the "classic" locale would convert to this "locale's encoding". Apparently, Microsoft chose to simply trim (which leads to IS-8859-1 if we assuming that wchar_t represents UTF-16 and that we stay in the basic multilingual plane), while the Linux implementation I know about decided stick to ASCII.

For your second question:

Also, are we gonna get real unicode streams with C++0x or am I missing something here?

In the [locale.codecvt] section of n2857 (the latest C++0x draft I have at hand), one can read:

The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encodings schemes. codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters.

In the [locale.stdcvt] section, we find:

For the facet codecvt_utf8:
— The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
[...]

For the facet codecvt_utf16:
— The facet shall convert between UTF-16 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
[...]

For the facet codecvt_utf8_utf16:
— The facet shall convert between UTF-8 multibyte sequences and UTF-16 (one or two 16-bit codes) within the program.

So I guess that this means "yes", but you'd have to be more precise about what you mean by "real unicode streams" to be sure.

回复收藏 0 原文

娇柔作态 2024-08-13 00:14:33

C++ 使用的字符集模型继承自 C，因此至少可以追溯到 1989 年。

有两个要点：

IO 是根据字符完成的。
区域设置的工作是确定序列化字符的宽度
默认区域设置（名为“C”）非常小（我不记得标准的限制，这里它只能处理 7 位 ASCII窄字符集和宽字符集）。
有一个环境确定的区域设置名为“”，

因此要获取任何内容，您必须设置区域设置。

简单程序

#include <locale>
#include <fstream>
#include <ostream>
#include <iostream>

int main()
{
    wchar_t c = 0x00FF;
    std::locale::global(std::locale(""));
    std::wofstream os("test.dat");
    os << c << std::endl;
    if (!os) {
        std::cout << "Output failed\n";
    }
}

如果我使用使用环境区域设置并将代码 0x00FF 的宽字符输出到文件的。如果我要求使用“C”语言环境，我会得到

$ env LC_ALL=C ./a.out
Output failed

该语言环境无法处理宽字符，并且我们会收到有关该问题的通知，因为 IO 失败。如果我运行询问 UTF-8 语言环境，我会得到

$ env LC_ALL=en_US.utf8 ./a.out
$ od -t x1 test.dat
0000000 c3 bf 0a
0000003

（od -t x1 只是转储以十六进制表示的文件），这正是我对 UTF-8 编码文件的期望。

The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.

Two main points:

IO is done in term of char.
it is the job of the locale to determine how wide chars are serialized
the default locale (named "C") is very minimal (I don't remember the constraints from the standard, here it is able to handle only 7-bit ASCII as narrow and wide character set).
there is an environment determined locale named ""

So to get anything, you have to set the locale.

If I use the simple program

#include <locale>
#include <fstream>
#include <ostream>
#include <iostream>

int main()
{
    wchar_t c = 0x00FF;
    std::locale::global(std::locale(""));
    std::wofstream os("test.dat");
    os << c << std::endl;
    if (!os) {
        std::cout << "Output failed\n";
    }
}

which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get

$ env LC_ALL=C ./a.out
Output failed

the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get

$ env LC_ALL=en_US.utf8 ./a.out
$ od -t x1 test.dat
0000000 c3 bf 0a
0000003

(od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.

回复收藏 0 原文

み零 2024-08-13 00:14:33

我不知道wofstream。但 C++0x 将包含保证宽度和符号（无符号）的新离散字符类型（char16_t、char32_t），可移植用于 UTF-8、UTF-16 和 UTF-32。此外，还将有新的字符串文字（例如，u"Hello!" 表示 UTF-16 编码的字符串文字）

查看最新的 C++0x 草案 (N2960)。

回复收藏 0 原文

朕就是辣么酷 2024-08-13 00:14:33

对于你的第一个问题，这是我的猜测。

IOStreams 库是在几个有关编码的前提下构建的。例如，对于 Unicode 和其他不常见的编码之间的转换，假设是这样的。

在您的程序中，您应该使用（固定宽度）宽字符编码。
只有外部存储才应使用（可变宽度）多字节编码。

我相信这就是 std::codecvt 的两个模板专业化存在的原因。一种在 char 类型之间映射（也许您只是使用 ASCII），另一种在 wchar_t（程序内部）和 char（外部设备）之间映射。因此，每当您需要执行到多字节编码的转换时，您应该逐字节进行。请注意，当您从多字节编码读取/写入每个字节时，您可以编写一个处理编码状态的方面。

以这种方式思考 C++ 标准的行为是可以理解的。毕竟，您使用的是宽字符 ASCII 编码（假设这是您平台上的默认设置并且您没有切换区域设置）字符串。 “自然”转换是将每个宽字符 ASCII 字符转换为普通（在本例中为一个字符）ASCII 字符。（转换存在并且很简单。）

顺便说一句，我不确定您是否知道，但您可以通过创建一个为转换返回 noconv 的方面来避免这种情况。然后，您的文件将包含宽字符。

回复收藏 0 原文

橘香 2024-08-13 00:14:33

看看这个：
Class basic_filebuf

您可以更改默认值通过使用 pubsetbuf 设置宽字符缓冲区来实现行为。
一旦你这样做了，输出将是 wchar_t 而不是 char。

换句话说，对于您的示例，您将拥有：

wofstream file(L"Test.txt", ios_base::binary); //binary is important to set!  
wchar_t buffer[128];  
file.rdbuf()->pubsetbuf(buffer, 128);  
file.put(0xFEFF); //this is the BOM flag, UTF16 needs this, but mirosoft's UNICODE doesn't, so you can skip this line, if any.  
file << someString; // the output file will consist of unicode characters! without the call to pubsetbuf, the out file will be ANSI (current regional settings)

Check this out:
Class basic_filebuf

You can alter the default behavior by setting a wide char buffer, using pubsetbuf.
Once you did that, the output will be wchar_t and not char.

In other words for your example you will have:

wofstream file(L"Test.txt", ios_base::binary); //binary is important to set!  
wchar_t buffer[128];  
file.rdbuf()->pubsetbuf(buffer, 128);  
file.put(0xFEFF); //this is the BOM flag, UTF16 needs this, but mirosoft's UNICODE doesn't, so you can skip this line, if any.  
file << someString; // the output file will consist of unicode characters! without the call to pubsetbuf, the out file will be ANSI (current regional settings)

回复收藏 0 原文

~没有更多了~