为什么 C++ 中的宽文件流会这样?默认情况下缩小写入数据?

发布于 2024-08-06 00:14:33 字数 905 浏览 5 评论 0 原文

老实说,我只是没有在 C++ 标准库中得到以下设计决策。将宽字符写入文件时,wofstream 会将 wchar_t 转换为 char 字符:

#include <fstream>
#include <string>

int main()
{
    using namespace std;

    wstring someString = L"Hello StackOverflow!";
    wofstream file(L"Test.txt");

    file << someString; // the output file will consist of ASCII characters!
}

我知道这与标准 有关>编解码器提升。此外,还有 马丁约克在这里。问题是为什么标准codecvt转换宽字符?为什么不按原样写人物呢!

另外,我们是否会使用 C++0x 获得真正的 unicode 流 还是我在这里遗漏了一些东西?

Honestly, I just don't get the following design decision in C++ Standard library. When writing wide characters to a file, the wofstream converts wchar_t into char characters:

#include <fstream>
#include <string>

int main()
{
    using namespace std;

    wstring someString = L"Hello StackOverflow!";
    wofstream file(L"Test.txt");

    file << someString; // the output file will consist of ASCII characters!
}

I am aware that this has to do with the standard codecvt. There is codecvt for utf8 in Boost. Also, there is a codecvt for utf16 by Martin York here on SO. The question is why the standard codecvt converts wide-characters? why not write the characters as they are!

Also, are we gonna get real unicode streams with C++0x or am I missing something here?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

老娘不死你永远是小三 2024-08-13 00:14:33

第一个问题的非常部分的答案:文件字节序列,因此,在处理wchar_t时,至少一些转换必须出现在 wchar_tchar 之间。 “智能”地进行此转换需要了解字符编码,因此这就是为什么通过在流的区域设置中使用切面,允许此转换与区域设置相关的原因。

然后,问题是如何在标准所需的唯一语言环境中进行转换:“经典”语言环境。对此没有“正确”的答案,因此标准对此非常模糊。我从你的问题中了解到,你认为在 wchar_t[] 和 char[] 之间盲目转换(或 memcpy()-ing)将是一个好方法。这并非不合理,而且实际上在某些实现中就是(或至少是)所做的。

另一个观点是,由于 codecvt 是一个语言环境方面,因此可以合理地期望使用“语言环境的编码”进行转换(我在这里很犹豫,因为这个概念非常模糊)。例如,人们希望土耳其语言环境使用 ISO-8859-9,或者希望日语语言环境使用 Shift JIS。通过相似性,“经典”语言环境将转换为该“语言环境的编码”。显然,微软选择简单地修剪(如果我们假设 wchar_t 代表 UTF-16 并且我们停留在基本的多语言平面上,这会导致 IS-8859-1),而我所知道的 Linux 实现决定坚持使用 ASCII。

对于你的第二个问题:

另外,我们是否会使用 C++0x 获得真正的 unicode 流,还是我在这里遗漏了一些东西?

在 n2857(我手头最新的 C++0x 草案)的 [locale.codecvt] 部分中,可以读到:

专业化 codecvt 在 UTF-16 和 UTF-8 编码方案之间进行转换,专业化 codecvt codecvt code> 在 UTF-32 和 UTF-8 编码方案之间进行转换。 codecvt 在窄字符和宽字符的本机字符集之间进行转换。

在 [locale.stdcvt] 部分中,我们发现:

对于方面codecvt_utf8
— 方面应在程序内的 UTF-8 多字节序列和 UCS2 或 UCS4(取决于 Elem 的大小)之间进行转换。
[...]

对于方面codecvt_utf16
— 方面应在程序内的 UTF-16 多字节序列和 UCS2 或 UCS4(取决于 Elem 的大小)之间进行转换。
[...]

对于方面codecvt_utf8_utf16
— Facet 应在程序内的 UTF-8 多字节序列和 UTF-16(一个或两个 16 位代码)之间进行转换。

所以我猜这意味着“是”,但是您必须更准确地了解“真正的 unicode 流”的含义才能确定。

A very partial answer for the first question: A file is a sequence of bytes so, when dealing with wchar_t's, at least some conversion between wchar_t and char must occur. Making this conversion "intelligently" requires knowledge of the character encodings, so this is why this conversion is allowed to be locale-dependent, by virtue of using a facet in the stream's locale.

Then, the question is how that conversion should be made in the only locale required by the standard: the "classic" one. There is no "right" answer for that, and the standard is thus very vague about it. I understand from your question that you assume that blindly casting (or memcpy()-ing) between wchar_t[] and char[] would have been a good way. This is not unreasonable, and is in fact what is (or at least was) done in some implementations.

Another POV would be that, since a codecvt is a locale facet, it is reasonable to expect that the conversion is made using the "locale's encoding" (I'm handwavy here, as the concept is pretty fuzzy). For example, one would expect a Turkish locale to use ISO-8859-9, or a Japanese on to use Shift JIS. By similarity, the "classic" locale would convert to this "locale's encoding". Apparently, Microsoft chose to simply trim (which leads to IS-8859-1 if we assuming that wchar_t represents UTF-16 and that we stay in the basic multilingual plane), while the Linux implementation I know about decided stick to ASCII.

For your second question:

Also, are we gonna get real unicode streams with C++0x or am I missing something here?

In the [locale.codecvt] section of n2857 (the latest C++0x draft I have at hand), one can read:

The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encodings schemes. codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters.

In the [locale.stdcvt] section, we find:

For the facet codecvt_utf8:
— The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
[...]

For the facet codecvt_utf16:
— The facet shall convert between UTF-16 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
[...]

For the facet codecvt_utf8_utf16:
— The facet shall convert between UTF-8 multibyte sequences and UTF-16 (one or two 16-bit codes) within the program.

So I guess that this means "yes", but you'd have to be more precise about what you mean by "real unicode streams" to be sure.

娇柔作态 2024-08-13 00:14:33

C++ 使用的字符集模型继承自 C,因此至少可以追溯到 1989 年。

有两个要点:

  • IO 是根据字符完成的。
  • 区域设置的工作是确定序列化字符的宽度
  • 默认区域设置(名为“C”)非常小(我不记得标准的限制,这里它只能处理 7 位 ASCII窄字符集和宽字符集)。
  • 有一个环境确定的区域设置名为“”,

因此要获取任何内容,您必须设置区域设置。

简单程序

#include <locale>
#include <fstream>
#include <ostream>
#include <iostream>

int main()
{
    wchar_t c = 0x00FF;
    std::locale::global(std::locale(""));
    std::wofstream os("test.dat");
    os << c << std::endl;
    if (!os) {
        std::cout << "Output failed\n";
    }
}

如果我使用使用环境区域设置并将代码 0x00FF 的宽字符输出到文件的 。如果我要求使用“C”语言环境,我会得到

$ env LC_ALL=C ./a.out
Output failed

该语言环境无法处理宽字符,并且我们会收到有关该问题的通知,因为 IO 失败。如果我运行询问 UTF-8 语言环境,我会得到

$ env LC_ALL=en_US.utf8 ./a.out
$ od -t x1 test.dat
0000000 c3 bf 0a
0000003

(od -t x1 只是转储以十六进制表示的文件),这正是我对 UTF-8 编码文件的期望。

The model used by C++ for charsets is inherited from C, and so dates back to at least 1989.

Two main points:

  • IO is done in term of char.
  • it is the job of the locale to determine how wide chars are serialized
  • the default locale (named "C") is very minimal (I don't remember the constraints from the standard, here it is able to handle only 7-bit ASCII as narrow and wide character set).
  • there is an environment determined locale named ""

So to get anything, you have to set the locale.

If I use the simple program

#include <locale>
#include <fstream>
#include <ostream>
#include <iostream>

int main()
{
    wchar_t c = 0x00FF;
    std::locale::global(std::locale(""));
    std::wofstream os("test.dat");
    os << c << std::endl;
    if (!os) {
        std::cout << "Output failed\n";
    }
}

which use the environment locale and output the wide character of code 0x00FF to a file. If I ask to use the "C" locale, I get

$ env LC_ALL=C ./a.out
Output failed

the locale has been unable to handle the wide character and we get notified of the problem as the IO failed. If I run ask an UTF-8 locale, I get

$ env LC_ALL=en_US.utf8 ./a.out
$ od -t x1 test.dat
0000000 c3 bf 0a
0000003

(od -t x1 just dump the file represented in hex), exactly what I expect for an UTF-8 encoded file.

み零 2024-08-13 00:14:33

我不知道wofstream。但 C++0x 将包含保证宽度和符号(无符号)的新离散字符类型(char16_t、char32_t),可移植用于 UTF-8、UTF-16 和 UTF-32。此外,还将有新的字符串文字(例如,u"Hello!" 表示 UTF-16 编码的字符串文字)

查看最新的 C++0x 草案 (N2960)

I don't know about wofstream. But C++0x will include new distict character types (char16_t, char32_t) of guaranteed width and signedness (unsigned) which can be portably used for UTF-8, UTF-16 and UTF-32. In addition, there will be new string literals (u"Hello!" for an UTF-16 coded string literal, for example)

Check out the most recent C++0x draft (N2960).

朕就是辣么酷 2024-08-13 00:14:33

对于你的第一个问题,这是我的猜测。

IOStreams 库是在几个有关编码的前提下构建的。例如,对于 Unicode 和其他不常见的编码之间的转换,假设是这样的。

  • 在您的程序中,您应该使用(固定宽度)宽字符编码。
  • 只有外部存储才应使用(可变宽度)多字节编码。

我相信这就是 std::codecvt 的两个模板专业化存在的原因。一种在 char 类型之间映射(也许您只是使用 ASCII),另一种在 wchar_t(程序内部)和 char(外部设备)之间映射。因此,每当您需要执行到多字节编码的转换时,您应该逐字节进行。请注意,当您从多字节编码读取/写入每个字节时,您可以编写一个处理编码状态的方面。

以这种方式思考 C++ 标准的行为是可以理解的。毕竟,您使用的是宽字符 ASCII 编码(假设这是您平台上的默认设置并且您没有切换区域设置)字符串。 “自然”转换是将每个宽字符 ASCII 字符转换为普通(在本例中为一个字符)ASCII 字符。 (转换存在并且很简单。)

顺便说一句,我不确定您是否知道,但您可以通过创建一个为转换返回 noconv 的方面来避免这种情况。然后,您的文件将包含宽字符。

For your first question, this is my guess.

The IOStreams library was constructed under a couple of premises regarding encodings. For converting between Unicode and other not so usual encodings, for example, it's assumed that.

  • Inside your program, you should use a (fixed-width) wide-character encoding.
  • Only external storage should use (variable-width) multibyte encodings.

I believe that is the reason for the existence of the two template specializations of std::codecvt. One that maps between char types (maybe you're simply working with ASCII) and another that maps between wchar_t (internal to your program) and char (external devices). So whenever you need to perform a conversion to a multibyte encoding you should do it byte-by-byte. Notice that you can write a facet that handles encoding state when you read/write each byte from/to the multibyte encoding.

Thinking this way the behavior of the C++ standard is understandable. After all, you're using wide-character ASCII encoded (assuming this is the default on your platform and you did not switch locales) strings. The "natural" conversion would be to convert each wide-character ASCII character to a ordinary (in this case, one char) ASCII character. (The conversion exists and is straightforward.)

By the way, I'm not sure if you know, but you can avoid this by creating a facet that returns noconv for the conversions. Then, you would have your file with wide-characters.

橘香 2024-08-13 00:14:33

看看这个:
Class basic_filebuf

您可以更改默认值通过使用 pubsetbuf 设置字符缓冲区来实现行为。
一旦你这样做了,输出将是 wchar_t 而不是 char。

换句话说,对于您的示例,您将拥有:

wofstream file(L"Test.txt", ios_base::binary); //binary is important to set!  
wchar_t buffer[128];  
file.rdbuf()->pubsetbuf(buffer, 128);  
file.put(0xFEFF); //this is the BOM flag, UTF16 needs this, but mirosoft's UNICODE doesn't, so you can skip this line, if any.  
file << someString; // the output file will consist of unicode characters! without the call to pubsetbuf, the out file will be ANSI (current regional settings)  

Check this out:
Class basic_filebuf

You can alter the default behavior by setting a wide char buffer, using pubsetbuf.
Once you did that, the output will be wchar_t and not char.

In other words for your example you will have:

wofstream file(L"Test.txt", ios_base::binary); //binary is important to set!  
wchar_t buffer[128];  
file.rdbuf()->pubsetbuf(buffer, 128);  
file.put(0xFEFF); //this is the BOM flag, UTF16 needs this, but mirosoft's UNICODE doesn't, so you can skip this line, if any.  
file << someString; // the output file will consist of unicode characters! without the call to pubsetbuf, the out file will be ANSI (current regional settings)  
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文