是否可以将两种字符集放在同一个文件中
我只是出于好奇才问这个问题。一般来说,就我而言,文件是用单一字符集存储的。但是字符集类型会保存在哪里呢?是否可以将两种字符串(如 std::string、std::wstring)放在同一个文件中?
I just asked this question out of curiousity. Generally as for as I am concerning files are stored with single character set. But where will be the the character set type saved? And is it possible to put two kind of strings(like std::string, std::wstring) in the same file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
字符集的引入是为了允许不同的程序以不同的方式解释同一组字符(即十进制值超过 127 的单字节字符,或者换句话说,其高位被设置)。如果您想在文件或流中部分切换字符集,则必须以某种方式(在文件中或带外)向程序发出信号。
至于混合
std::string
和std::wstring
,虽然这是可能的,但充其量也是令人困惑的。string
(通常)是 ASCII,而wstring
是 Unicode。生成文件时,您可以在其中放置一个信号或标记,告诉您的程序在读回文件时进行切换。通常,如果您需要多个字符集,则应该使用 Unicode(可以用
std::wstring
)。事实上,如果您要处理用户输入,则应该使用 Unicode。去阅读 Joel Spolsky 的 每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低限度(没有任何借口) !)。它应该有助于让事情变得更加清晰。
Character sets were introduced to allow different programs to interpret the same set of characters (namely, single-byte characters whose decimal value is over 127, or, to put it another way, whose high-order bit is set) in different ways. If you want to switch character sets part way through a file or stream, you would have to signal your program in some way, either in the file or out of band.
As to mixing
std::string
andstd::wstring
, while it is possible, it would be, at best, confusing.string
s are (generally) ASCII andwstring
s are Unicode. When generating your file, you could put a signal or marker in that would tell your program to switch when reading it back in.Generally, if you need more than one character set, you should be using Unicode (which can be represented with
std::wstring
). In fact, if you're handling user input at all, you should be using Unicode.Go read Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). It should help to make things more clear.
如果您的问题是关于源文件本身的编码,答案是 C++ 标准需要一个实现来支持以基本字符集编码的源文件。编译器实现可能支持其他字符集。有关详细信息,请参阅编译器手册。
关于在同一文件中使用
std::string
和std::wstring
变量,是的,您可以一起使用。If your question is about the encoding of the source file itself, the answer is that the C++ standard requires an implementation to support source files encoded in the base character set. Complier implementations may support additional character sets. Consult your compiler manuals for more information.
About using
std::string
andstd::wstring
variables in the same file, yes you can use the, together.字符编码是完全免费的。文件是字节的容器。您可以将文本编码为 ASCII、UTF8、Big5...编码字符的混合,但由您决定如何解释每个字符。
不过,约定是在文件的第一个位置放置一个标记,表示编码。 (参见维基百科上的 字节顺序标记)。
当使用 xml 时,这变得更加明确(但没有完全覆盖):编码必须位于第一行,并且该行必须采用 utf8 格式。 (如果省略编码则表示:“utf-8”)
Character encodings are totally free. Files are containers of bytes. You can encode text into a mixture of ASCII, UTF8, Big5, ... encoded characters, but it's up to you to tell how each is to be interpreted.
The convention, though, is to put a marker at the first position of the file, that denotes the encoding. (cfr Byte Order Mark on wikipedia).
When using xml, this has become far more explicit (yet not completely covered): the encoding has to be on the first line, and that line has to be in utf8. (If the encoding is omitted, it means: "utf-8")
文件只是一个字节序列。字节只是一个 8 位(在任何现代硬件上)二进制数,如果解释为无符号,范围为 0 到 255;如果解释为有符号,范围为 -128 到 127。
这些字节的含义取决于设计该特定文件格式的人。它可能包含以某种方式指示或在文件格式文档中指定的某种单一编码编码的字符序列,它可能包含不同编码的混乱,无法区分它们(我在现实中见过这样的事情,关键应用程序),它可以包含二进制和文本数据的混合,也可以包含与任何字符或字符集无关的二进制数据。
但是,如果您的文件格式不是二进制的,也就是说,如果它包含文本且仅包含文本,那么混合字符集通常是一个非常糟糕的主意。使用统一且与 ASCII 兼容的内容(例如 UTF-8)可能是最好的方法。即使采用二进制格式,以相同的编码对所有文本数据进行编码仍然是一个好主意。 UTF-8 或 UTF-16(甚至 UTF-32)似乎是不错的选择。但有时您必须处理不同的要求。例如,二进制格式可能具有“旧”版本的标头和“新”版本。旧的可能使用某种遗留字符集,而新的可能使用某种 Unicode。没关系。但当谈到纯文本格式时,我还没有看到一种广泛使用的允许混合字符集的格式。有些允许您为每个文件选择单个字符集,并在某处放置标记(例如 XML、HTML、Python 源)。
A file is just a sequence of bytes. A byte is just a 8-digit (on any modern hardware) binary number, ranging from 0 to 255 if interpreted as unsigned, or from -128 to 127 if interpreted as signed.
What these bytes mean it up to whoever is designed that particular file format. It could contain a sequence of characters encoded with some single encoding indicated in some way or specified in the file format documentation, it can contain an unholy mess of different encodings with no ways to distinguish between them (I've seen such things in real, critical applications), it can contain a mix of binary and text data, or it can contain binary data that has nothing to do with any characters or character sets whatsoever.
However, if your file format isn't binary, that is, if it contains text and only text, then it's generally an extremely bad idea to mix character sets. Using something uniform and ASCII-compatible like UTF-8 is probably the best way. Even in a binary format, it is still a good idea to encode all the text data in the same encoding. UTF-8 or UTF-16 (or even UTF-32) seem to be good choices there. Sometimes there are different requirements which you have to deal with, though. For example, a binary format may have the "old" version of header and the "new" one. The old one may be using some legacy character set, and the new one may be using some Unicode. That's fine. But when it comes to pure-text formats, I've yet to see a widely used format that allows to mix character sets. Some allow you to choose a single character set for each file, and to put a marker somewhere (like XML, HTML, Python sources).