为什么 C++ 中的宽文件流会这样?默认情况下缩小写入数据?
老实说,我只是没有在 C++ 标准库中得到以下设计决策。将宽字符写入文件时,wofstream
会将 wchar_t
转换为 char
字符:
#include <fstream>
#include <string>
int main()
{
using namespace std;
wstring someString = L"Hello StackOverflow!";
wofstream file(L"Test.txt");
file << someString; // the output file will consist of ASCII characters!
}
我知道这与标准 有关>编解码器。
转换宽字符?为什么不按原样写人物呢!提升
。此外,还有 马丁约克在这里。问题是为什么标准codecvt
另外,我们是否会使用 C++0x 获得真正的 unicode 流
还是我在这里遗漏了一些东西?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
第一个问题的非常部分的答案:文件是字节序列,因此,在处理
wchar_t
时,至少一些转换必须出现在wchar_t
和char
之间。 “智能”地进行此转换需要了解字符编码,因此这就是为什么通过在流的区域设置中使用切面,允许此转换与区域设置相关的原因。然后,问题是如何在标准所需的唯一语言环境中进行转换:“经典”语言环境。对此没有“正确”的答案,因此标准对此非常模糊。我从你的问题中了解到,你认为在 wchar_t[] 和 char[] 之间盲目转换(或 memcpy()-ing)将是一个好方法。这并非不合理,而且实际上在某些实现中就是(或至少是)所做的。
另一个观点是,由于 codecvt 是一个语言环境方面,因此可以合理地期望使用“语言环境的编码”进行转换(我在这里很犹豫,因为这个概念非常模糊)。例如,人们希望土耳其语言环境使用 ISO-8859-9,或者希望日语语言环境使用 Shift JIS。通过相似性,“经典”语言环境将转换为该“语言环境的编码”。显然,微软选择简单地修剪(如果我们假设
wchar_t
代表 UTF-16 并且我们停留在基本的多语言平面上,这会导致 IS-8859-1),而我所知道的 Linux 实现决定坚持使用 ASCII。对于你的第二个问题:
在 n2857(我手头最新的 C++0x 草案)的 [locale.codecvt] 部分中,可以读到:
在 [locale.stdcvt] 部分中,我们发现:
所以我猜这意味着“是”,但是您必须更准确地了解“真正的 unicode 流”的含义才能确定。
A very partial answer for the first question: A file is a sequence of bytes so, when dealing with
wchar_t
's, at least some conversion betweenwchar_t
andchar
must occur. Making this conversion "intelligently" requires knowledge of the character encodings, so this is why this conversion is allowed to be locale-dependent, by virtue of using a facet in the stream's locale.Then, the question is how that conversion should be made in the only locale required by the standard: the "classic" one. There is no "right" answer for that, and the standard is thus very vague about it. I understand from your question that you assume that blindly casting (or memcpy()-ing) between wchar_t[] and char[] would have been a good way. This is not unreasonable, and is in fact what is (or at least was) done in some implementations.
Another POV would be that, since a codecvt is a locale facet, it is reasonable to expect that the conversion is made using the "locale's encoding" (I'm handwavy here, as the concept is pretty fuzzy). For example, one would expect a Turkish locale to use ISO-8859-9, or a Japanese on to use Shift JIS. By similarity, the "classic" locale would convert to this "locale's encoding". Apparently, Microsoft chose to simply trim (which leads to IS-8859-1 if we assuming that
wchar_t
represents UTF-16 and that we stay in the basic multilingual plane), while the Linux implementation I know about decided stick to ASCII.For your second question:
In the [locale.codecvt] section of n2857 (the latest C++0x draft I have at hand), one can read:
In the [locale.stdcvt] section, we find:
So I guess that this means "yes", but you'd have to be more precise about what you mean by "real unicode streams" to be sure.
C++ 使用的字符集模型继承自 C,因此至少可以追溯到 1989 年。
有两个要点:
因此要获取任何内容,您必须设置区域设置。
简单程序
如果我使用使用环境区域设置并将代码 0x00FF 的宽字符输出到文件的 。如果我要求使用“C”语言环境,我会得到
该语言环境无法处理宽字符,并且我们会收到有关该问题的通知,因为 IO 失败。如果我运行询问 UTF-8 语言环境,我会得到
(od -t x1 只是转储以十六进制表示的文件),这正是我对 UTF-8 编码文件的期望。
The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.
Two main points:
So to get anything, you have to set the locale.
If I use the simple program
which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get
the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get
(od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.
我不知道wofstream。但 C++0x 将包含保证宽度和符号(无符号)的新离散字符类型(char16_t、char32_t),可移植用于 UTF-8、UTF-16 和 UTF-32。此外,还将有新的字符串文字(例如,u"Hello!" 表示 UTF-16 编码的字符串文字)
查看最新的 C++0x 草案 (N2960)。
I don't know about wofstream. But C++0x will include new distict character types (char16_t, char32_t) of guaranteed width and signedness (unsigned) which can be portably used for UTF-8, UTF-16 and UTF-32. In addition, there will be new string literals (u"Hello!" for an UTF-16 coded string literal, for example)
Check out the most recent C++0x draft (N2960).
对于你的第一个问题,这是我的猜测。
IOStreams 库是在几个有关编码的前提下构建的。例如,对于 Unicode 和其他不常见的编码之间的转换,假设是这样的。
我相信这就是 std::codecvt 的两个模板专业化存在的原因。一种在 char 类型之间映射(也许您只是使用 ASCII),另一种在 wchar_t(程序内部)和 char(外部设备)之间映射。因此,每当您需要执行到多字节编码的转换时,您应该逐字节进行。请注意,当您从多字节编码读取/写入每个字节时,您可以编写一个处理编码状态的方面。
以这种方式思考 C++ 标准的行为是可以理解的。毕竟,您使用的是宽字符 ASCII 编码(假设这是您平台上的默认设置并且您没有切换区域设置)字符串。 “自然”转换是将每个宽字符 ASCII 字符转换为普通(在本例中为一个字符)ASCII 字符。 (转换存在并且很简单。)
顺便说一句,我不确定您是否知道,但您可以通过创建一个为转换返回 noconv 的方面来避免这种情况。然后,您的文件将包含宽字符。
For your first question, this is my guess.
The IOStreams library was constructed under a couple of premises regarding encodings. For converting between Unicode and other not so usual encodings, for example, it's assumed that.
I believe that is the reason for the existence of the two template specializations of std::codecvt. One that maps between char types (maybe you're simply working with ASCII) and another that maps between wchar_t (internal to your program) and char (external devices). So whenever you need to perform a conversion to a multibyte encoding you should do it byte-by-byte. Notice that you can write a facet that handles encoding state when you read/write each byte from/to the multibyte encoding.
Thinking this way the behavior of the C++ standard is understandable. After all, you're using wide-character ASCII encoded (assuming this is the default on your platform and you did not switch locales) strings. The "natural" conversion would be to convert each wide-character ASCII character to a ordinary (in this case, one char) ASCII character. (The conversion exists and is straightforward.)
By the way, I'm not sure if you know, but you can avoid this by creating a facet that returns noconv for the conversions. Then, you would have your file with wide-characters.
看看这个:
Class basic_filebuf
您可以更改默认值通过使用 pubsetbuf 设置宽字符缓冲区来实现行为。
一旦你这样做了,输出将是 wchar_t 而不是 char。
换句话说,对于您的示例,您将拥有:
Check this out:
Class basic_filebuf
You can alter the default behavior by setting a wide char buffer, using pubsetbuf.
Once you did that, the output will be wchar_t and not char.
In other words for your example you will have: