在 C++ 中使用 Unicode 源代码

发布于 2024-07-09 07:56:03 字数 282 浏览 6 评论 0原文

C++源代码的标准编码是什么? C++ 标准对此有什么规定吗? 我可以用 Unicode 编写 C++ 源代码吗?

例如,我可以在注释中使用汉字等非 ASCII 字符吗? 如果是这样,是允许使用完整的 Unicode 还是仅允许使用 Unicode 的子集? (例如,16 位第一页或其他任何名称。)

此外,我可以对字符串使用 Unicode 吗? 例如:

Wstring str=L"Strange chars: â Țđ ě €€";

What is the standard encoding of C++ source code? Does the C++ standard even say something about this? Can I write C++ source in Unicode?

For example, can I use non-ASCII characters such as Chinese characters in comments? If so, is full Unicode allowed or just a subset of Unicode? (e.g., that 16-bit first page or whatever it's called.)

Furthermore, can I use Unicode for strings? For example:

Wstring str=L"Strange chars: â Țđ ě €€";

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

伴我老 2024-07-16 07:56:03

C++ 中的编码相当复杂。 这是我的理解。

每个实现都必须支持基本源字符集中的字符。 其中包括§2.2/1(C++11 中的§2.3/1)中列出的常见字符。 这些字符应全部放入一个 char 中。 此外,实现必须支持一种名为通用字符名称(universal-character-names)的方式来命名其他字符,并且看起来像\uffff\Uffffffff并可用于指代 Unicode 字符。 其中一部分可用于标识符(附录 E 中列出)。

这一切都很好,但是从文件中的字符到源字符(在编译时使用)的映射是实现定义的。 这构成了所使用的编码。 这是它的字面意思(C++98 版本):

物理源文件字符为
映射,在实现定义的
方式,到基本源字符
设置(引入换行符
对于行尾指示器)如果
必要的。 三字母序列 (2.3)
替换为相应的
单字符内部
交涉。 任何源文件
字符不在基本源中
字符集 (2.2) 被替换为
通用字符名称,des-
点燃该角色。 (一个
实现可以使用任何内部
编码,只要是实际的
中遇到的扩展字符
源文件和相同的扩展
源文件中表示的字符
作为通用字符名称(即
使用 \uXXXX 表示法),是
同等处理。)

对于 gcc,您可以使用选项 -finput-charset=charset 更改它。 此外,您可以更改用于在运行时重新预设值的执行字符。 正确的选项是 char 的 -fexec-charset=charset(默认为 utf-8)和 -fwide-exec-charset=charset code> (默认为 utf-16utf-32,具体取决于 wchar_t 的大小)。

Encoding in C++ is quite a bit complicated. Here is my understanding of it.

Every implementation has to support characters from the basic source character set. These include common characters listed in §2.2/1 (§2.3/1 in C++11). These characters should all fit into one char. In addition implementations have to support a way to name other characters using a way called universal-character-names and look like \uffff or \Uffffffff and can be used to refer to Unicode characters. A subset of them are usable in identifiers (listed in Annex E).

This is all nice, but the mapping from characters in the file, to source characters (used at compile time) is implementation defined. This constitutes the encoding used. Here is what it says literally (C++98 version):

Physical source file characters are
mapped, in an implementation-defined
manner, to the basic source character
set (introducing new-line characters
for end-of-line indicators) if
necessary. Trigraph sequences (2.3)
are replaced by corresponding
single-character internal
representations. Any source file
character not in the basic source
character set (2.2) is replaced by the
universal-character-name that des-
ignates that character. (An
implementation may use any internal
encoding, so long as an actual
extended character encountered in the
source file, and the same extended
character expressed in the source file
as a universal-character-name (i.e.
using the \uXXXX notation), are
handled equivalently.)

For gcc, you can change it using the option -finput-charset=charset. Additionally, you can change the execution character used to represet values at runtime. The proper option for this is -fexec-charset=charset for char (it defaults to utf-8) and -fwide-exec-charset=charset (which defaults to either utf-16 or utf-32 depending on the size of wchar_t).

浴红衣 2024-07-16 07:56:03

除了litb的帖子之外,MSVC++也支持Unicode。 据我所知,它从 BOM 中获取 Unicode 编码。 它绝对支持像 int (*♫)();const std::set; 这样的代码。 ∅;
如果您真的热衷于代码混淆:

typedef void ‼; // Also known as \u203C
class ooɟ {
    operator ‼() {}
};

In addition to litb's post, MSVC++ supports Unicode too. I understand it gets the Unicode encoding from the BOM. It definitely supports code like int (*♫)(); or const std::set<int> ∅;
If you're really into code obfuscuation:

typedef void ‼; // Also known as \u203C
class ooɟ {
    operator ‼() {}
};
就是爱搞怪 2024-07-16 07:56:03

据我所知,C++ 标准没有提及任何有关源代码文件编码的内容。

通常的编码是(或曾经是)7 位 ASCII —— 一些编译器(例如 Borland 的)会拒绝使用高位的 ASCII 字符。 如果您的编译器和编辑器接受 Unicode 字符,则没有任何技术原因不能使用它们 - 大多数基于 Linux 的现代工具和许多更好的基于 Windows 的编辑器都可以毫无问题地处理 UTF-8 编码,尽管我我不确定微软的编译器会。

编辑:看起来微软的编译器会接受 Unicode 编码的文件,但有时也会在 8 位 ASCII 上产生错误:

warning C4819: The file contains a character that cannot be represented
in the current code page (932). Save the file in Unicode format to prevent
data loss.

The C++ standard doesn't say anything about source-code file encoding, so far as I know.

The usual encoding is (or used to be) 7-bit ASCII -- some compilers (Borland's, for instance) would balk at ASCII characters that used the high-bit. There's no technical reason that Unicode characters can't be used, if your compiler and editor accept them -- most modern Linux-based tools, and many of the better Windows-based editors, handle UTF-8 encoding with no problem, though I'm not sure that Microsoft's compiler will.

EDIT: It looks like Microsoft's compilers will accept Unicode-encoded files, but will sometimes produce errors on 8-bit ASCII too:

warning C4819: The file contains a character that cannot be represented
in the current code page (932). Save the file in Unicode format to prevent
data loss.
紫轩蝶泪 2024-07-16 07:56:03

这里有两个问题。 第一个是 C++ 代码(和注释)中允许使用哪些字符,例如变量名。 第二个是字符串和字符串文字中允许使用的字符。

如前所述,C++ 编译器必须支持代码和注释中允许使用的非常有限的基于 ASCII 的字符集。 在实践中,这个字符集对于一些欧洲字符集并不能很好地工作(特别是对于一些没有一些字符(例如方括号)可用的欧洲键盘),因此二合字母和三字母的概念是介绍了。 目前许多编译器接受超过此字符集的字符集,但没有任何保证。

对于字符串和字符串文字,C++ 有宽字符和宽字符串的概念。 但是,该字符集的编码未定义。 实际上,它几乎总是 Unicode,但我认为这里没有任何保证。 宽字符串文字看起来像 L“字符串文字”,并且可以将它们分配给 std::wstring。


C++11 添加了对 Unicode 字符串和字符串文字的显式支持,编码为 UTF-8、UTF-16 大端、UTF-16 小端、UTF-32 大端和 UTF-32 小端。

There are two issues at play here. The first is what characters are allowed in C++ code (and comments), such as variable names. The second is what characters are allowed in strings and string literals.

As noted, C++ compilers must support a very restricted ASCII-based character set for the characters allowed in code and comments. In practice, this character set didn't work very well with some European character sets (and especially with some European keyboards that didn't have a few characters -- like square brackets -- available), so the concept of digraphs and trigraphs was introduced. Many compilers accept more than this character set at this time, but there isn't any guarantee.

As for strings and string literals, C++ has the concept of a wide character and wide character string. However, the encoding for that character set is undefined. In practice it's almost always Unicode, but I don't think there's any guarantee here. Wide character string literals look like L"string literal", and these can be assigned to std::wstring's.


C++11 added explicit support for Unicode strings and string literals, encoded as UTF-8, UTF-16 big endian, UTF-16 little endian, UTF-32 big endian and UTF-32 little endian.

浪荡不羁 2024-07-16 07:56:03

对于字符串编码,我认为您应该使用 \u 表示法,例如:

std::wstring str = L"\u20AC"; // Euro character

For encoding in strings I think you are meant to use the \u notation, e.g.:

std::wstring str = L"\u20AC"; // Euro character
屋檐 2024-07-16 07:56:03

还值得注意的是,C++ 中的宽字符本身并不是真正的 Unicode 字符串。 它们只是较大字符的字符串,通常为 16 位,但有时为 32 位。 这是实现定义的,不过,IIRC 你可以有一个 8 位 wchar_t 你对它们中的编码没有真正的保证,所以如果你试图做一些类似文本处理的事情,你会可能需要 typedef 来指定最适合您的 Unicode 实体的整数类型。

C++1x 以 UTF-8 编码字符串文字 (u8"text") 以及 UTF-16 和 UTF-32 数据类型 (char16_t) 的形式提供额外的 unicode 支持> 和 char32_t IIRC)以及相应的字符串常量(u"text"U"text")。 不过,在没有 \uxxxx\Uxxxxxxxx 常量的情况下指定的字符的编码仍然是实现定义的(并且对于文字之外的复杂字符串类型没有编码支持)

It's also worth noting that wide characters in C++ aren't really Unicode strings as such. They are just strings of larger characters, usually 16, but sometimes 32 bits. This is implementation-defined, though, IIRC you can have an 8-bit wchar_t You have no real guarantee as to the encoding in them, so if you are trying to do something like text processing, you will probably want a typedef to the most suitable integer type to your Unicode entity.

C++1x has additional unicode support in the form of UTF-8 encoding string literals (u8"text"), and UTF-16 and UTF-32 data types (char16_t and char32_t IIRC) as well as corresponding string constants (u"text" and U"text"). The encoding on characters specified without \uxxxx or \Uxxxxxxxx constants is still implementation-defined, though (and there is no encoding support for complex string types outside the literals)

廻憶裏菂餘溫 2024-07-16 07:56:03

在这种情况下,如果您收到 MSVC++ 警告 C4819,只需将源文件编码更改为“UTF-8 with Bom”即可。

GCC 4.1不支持这一点,但GCC 4.4支持,并且最新的Qt版本使用GCC 4.4,因此使用“UTF-8 with Bom”作为源文件编码。

In this context, if you get MSVC++ warning C4819, just change the source file coding to "UTF-8 with Bom".

GCC 4.1 doesn't support this, but GCC 4.4 does, and the latest Qt version uses GCC 4.4, so use "UTF-8 with Bom" as source file coding.

客…行舟 2024-07-16 07:56:03

AFAIK 它不是标准化的,因为您可以将任何类型的字符放入宽字符串中。
您只需检查您的编译器是否设置为 Unicode 源代码即可使其正常工作。

AFAIK It's not standardized as you can put any type of characters in wide strings.
You just have to check that your compiler is set to Unicode source code to make it work right.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文