在 C++ 中使用 Unicode 源代码
C++源代码的标准编码是什么? C++ 标准对此有什么规定吗? 我可以用 Unicode 编写 C++ 源代码吗?
例如,我可以在注释中使用汉字等非 ASCII 字符吗? 如果是这样,是允许使用完整的 Unicode 还是仅允许使用 Unicode 的子集? (例如,16 位第一页或其他任何名称。)
此外,我可以对字符串使用 Unicode 吗? 例如:
Wstring str=L"Strange chars: â Țđ ě €€";
What is the standard encoding of C++ source code? Does the C++ standard even say something about this? Can I write C++ source in Unicode?
For example, can I use non-ASCII characters such as Chinese characters in comments? If so, is full Unicode allowed or just a subset of Unicode? (e.g., that 16-bit first page or whatever it's called.)
Furthermore, can I use Unicode for strings? For example:
Wstring str=L"Strange chars: â Țđ ě €€";
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
C++ 中的编码相当复杂。 这是我的理解。
每个实现都必须支持基本源字符集中的字符。 其中包括§2.2/1(C++11 中的§2.3/1)中列出的常见字符。 这些字符应全部放入一个
char
中。 此外,实现必须支持一种名为通用字符名称(universal-character-names)的方式来命名其他字符,并且看起来像\uffff
或\Uffffffff
并可用于指代 Unicode 字符。 其中一部分可用于标识符(附录 E 中列出)。这一切都很好,但是从文件中的字符到源字符(在编译时使用)的映射是实现定义的。 这构成了所使用的编码。 这是它的字面意思(C++98 版本):
对于 gcc,您可以使用选项
-finput-charset=charset
更改它。 此外,您可以更改用于在运行时重新预设值的执行字符。 正确的选项是 char 的-fexec-charset=charset
(默认为utf-8
)和-fwide-exec-charset=charset
code> (默认为utf-16
或utf-32
,具体取决于wchar_t
的大小)。Encoding in C++ is quite a bit complicated. Here is my understanding of it.
Every implementation has to support characters from the basic source character set. These include common characters listed in §2.2/1 (§2.3/1 in C++11). These characters should all fit into one
char
. In addition implementations have to support a way to name other characters using a way calleduniversal-character-names
and look like\uffff
or\Uffffffff
and can be used to refer to Unicode characters. A subset of them are usable in identifiers (listed in Annex E).This is all nice, but the mapping from characters in the file, to source characters (used at compile time) is implementation defined. This constitutes the encoding used. Here is what it says literally (C++98 version):
For gcc, you can change it using the option
-finput-charset=charset
. Additionally, you can change the execution character used to represet values at runtime. The proper option for this is-fexec-charset=charset
for char (it defaults toutf-8
) and-fwide-exec-charset=charset
(which defaults to eitherutf-16
orutf-32
depending on the size ofwchar_t
).除了litb的帖子之外,MSVC++也支持Unicode。 据我所知,它从 BOM 中获取 Unicode 编码。 它绝对支持像
int (*♫)();
或const std::set; 这样的代码。 ∅;
如果您真的热衷于代码混淆:
In addition to litb's post, MSVC++ supports Unicode too. I understand it gets the Unicode encoding from the BOM. It definitely supports code like
int (*♫)();
orconst std::set<int> ∅;
If you're really into code obfuscuation:
据我所知,C++ 标准没有提及任何有关源代码文件编码的内容。
通常的编码是(或曾经是)7 位 ASCII —— 一些编译器(例如 Borland 的)会拒绝使用高位的 ASCII 字符。 如果您的编译器和编辑器接受 Unicode 字符,则没有任何技术原因不能使用它们 - 大多数基于 Linux 的现代工具和许多更好的基于 Windows 的编辑器都可以毫无问题地处理 UTF-8 编码,尽管我我不确定微软的编译器会。
编辑:看起来微软的编译器会接受 Unicode 编码的文件,但有时也会在 8 位 ASCII 上产生错误:
The C++ standard doesn't say anything about source-code file encoding, so far as I know.
The usual encoding is (or used to be) 7-bit ASCII -- some compilers (Borland's, for instance) would balk at ASCII characters that used the high-bit. There's no technical reason that Unicode characters can't be used, if your compiler and editor accept them -- most modern Linux-based tools, and many of the better Windows-based editors, handle UTF-8 encoding with no problem, though I'm not sure that Microsoft's compiler will.
EDIT: It looks like Microsoft's compilers will accept Unicode-encoded files, but will sometimes produce errors on 8-bit ASCII too:
这里有两个问题。 第一个是 C++ 代码(和注释)中允许使用哪些字符,例如变量名。 第二个是字符串和字符串文字中允许使用的字符。
如前所述,C++ 编译器必须支持代码和注释中允许使用的非常有限的基于 ASCII 的字符集。 在实践中,这个字符集对于一些欧洲字符集并不能很好地工作(特别是对于一些没有一些字符(例如方括号)可用的欧洲键盘),因此二合字母和三字母的概念是介绍了。 目前许多编译器接受超过此字符集的字符集,但没有任何保证。
对于字符串和字符串文字,C++ 有宽字符和宽字符串的概念。 但是,该字符集的编码未定义。 实际上,它几乎总是 Unicode,但我认为这里没有任何保证。 宽字符串文字看起来像 L“字符串文字”,并且可以将它们分配给 std::wstring。
C++11 添加了对 Unicode 字符串和字符串文字的显式支持,编码为 UTF-8、UTF-16 大端、UTF-16 小端、UTF-32 大端和 UTF-32 小端。
There are two issues at play here. The first is what characters are allowed in C++ code (and comments), such as variable names. The second is what characters are allowed in strings and string literals.
As noted, C++ compilers must support a very restricted ASCII-based character set for the characters allowed in code and comments. In practice, this character set didn't work very well with some European character sets (and especially with some European keyboards that didn't have a few characters -- like square brackets -- available), so the concept of digraphs and trigraphs was introduced. Many compilers accept more than this character set at this time, but there isn't any guarantee.
As for strings and string literals, C++ has the concept of a wide character and wide character string. However, the encoding for that character set is undefined. In practice it's almost always Unicode, but I don't think there's any guarantee here. Wide character string literals look like L"string literal", and these can be assigned to std::wstring's.
C++11 added explicit support for Unicode strings and string literals, encoded as UTF-8, UTF-16 big endian, UTF-16 little endian, UTF-32 big endian and UTF-32 little endian.
对于字符串编码,我认为您应该使用 \u 表示法,例如:
For encoding in strings I think you are meant to use the \u notation, e.g.:
还值得注意的是,C++ 中的宽字符本身并不是真正的 Unicode 字符串。 它们只是较大字符的字符串,通常为 16 位,但有时为 32 位。 这是实现定义的,不过,IIRC 你可以有一个 8 位
wchar_t
你对它们中的编码没有真正的保证,所以如果你试图做一些类似文本处理的事情,你会可能需要 typedef 来指定最适合您的 Unicode 实体的整数类型。C++1x 以 UTF-8 编码字符串文字 (
u8"text"
) 以及 UTF-16 和 UTF-32 数据类型 (char16_t
) 的形式提供额外的 unicode 支持> 和char32_t
IIRC)以及相应的字符串常量(u"text"
和U"text"
)。 不过,在没有\uxxxx
或\Uxxxxxxxx
常量的情况下指定的字符的编码仍然是实现定义的(并且对于文字之外的复杂字符串类型没有编码支持)It's also worth noting that wide characters in C++ aren't really Unicode strings as such. They are just strings of larger characters, usually 16, but sometimes 32 bits. This is implementation-defined, though, IIRC you can have an 8-bit
wchar_t
You have no real guarantee as to the encoding in them, so if you are trying to do something like text processing, you will probably want a typedef to the most suitable integer type to your Unicode entity.C++1x has additional unicode support in the form of UTF-8 encoding string literals (
u8"text"
), and UTF-16 and UTF-32 data types (char16_t
andchar32_t
IIRC) as well as corresponding string constants (u"text"
andU"text"
). The encoding on characters specified without\uxxxx
or\Uxxxxxxxx
constants is still implementation-defined, though (and there is no encoding support for complex string types outside the literals)在这种情况下,如果您收到 MSVC++ 警告 C4819,只需将源文件编码更改为“UTF-8 with Bom”即可。
GCC 4.1不支持这一点,但GCC 4.4支持,并且最新的Qt版本使用GCC 4.4,因此使用“UTF-8 with Bom”作为源文件编码。
In this context, if you get MSVC++ warning C4819, just change the source file coding to "UTF-8 with Bom".
GCC 4.1 doesn't support this, but GCC 4.4 does, and the latest Qt version uses GCC 4.4, so use "UTF-8 with Bom" as source file coding.
AFAIK 它不是标准化的,因为您可以将任何类型的字符放入宽字符串中。
您只需检查您的编译器是否设置为 Unicode 源代码即可使其正常工作。
AFAIK It's not standardized as you can put any type of characters in wide strings.
You just have to check that your compiler is set to Unicode source code to make it work right.