在 C++ 中使用 Unicode 源代码

发布于 2024-07-09 07:56:03 字数 282 浏览 6 评论 0原文

C++源代码的标准编码是什么？ C++ 标准对此有什么规定吗？我可以用 Unicode 编写 C++ 源代码吗？

例如，我可以在注释中使用汉字等非 ASCII 字符吗？如果是这样，是允许使用完整的 Unicode 还是仅允许使用 Unicode 的子集？（例如，16 位第一页或其他任何名称。）

此外，我可以对字符串使用 Unicode 吗？例如：

Wstring str=L"Strange chars: âÂ Čšđ ě €€";

原文

What is the standard encoding of C++ source code? Does the C++ standard even say something about this? Can I write C++ source in Unicode?

For example, can I use non-ASCII characters such as Chinese characters in comments? If so, is full Unicode allowed or just a subset of Unicode? (e.g., that 16-bit first page or whatever it's called.)

Furthermore, can I use Unicode for strings? For example:

Wstring str=L"Strange chars: âÂ Čšđ ě €€";

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伴我老 2024-07-16 07:56:03

C++ 中的编码相当复杂。这是我的理解。

每个实现都必须支持基本源字符集中的字符。其中包括§2.2/1（C++11 中的§2.3/1）中列出的常见字符。这些字符应全部放入一个 char 中。此外，实现必须支持一种名为通用字符名称（universal-character-names）的方式来命名其他字符，并且看起来像\uffff或\Uffffffff并可用于指代 Unicode 字符。其中一部分可用于标识符（附录 E 中列出）。

这一切都很好，但是从文件中的字符到源字符（在编译时使用）的映射是实现定义的。这构成了所使用的编码。这是它的字面意思（C++98 版本）：

物理源文件字符为
映射，在实现定义的
方式，到基本源字符
设置（引入换行符
对于行尾指示器）如果
必要的。三字母序列 (2.3)
替换为相应的
单字符内部
交涉。任何源文件
字符不在基本源中
字符集 (2.2) 被替换为
通用字符名称，des-
点燃该角色。（一个
实现可以使用任何内部
编码，只要是实际的
中遇到的扩展字符
源文件和相同的扩展
源文件中表示的字符
作为通用字符名称（即
使用 \uXXXX 表示法），是
同等处理。）

对于 gcc，您可以使用选项 -finput-charset=charset 更改它。此外，您可以更改用于在运行时重新预设值的执行字符。正确的选项是 char 的 -fexec-charset=charset（默认为 utf-8）和 -fwide-exec-charset=charset code> （默认为 utf-16 或 utf-32，具体取决于 wchar_t 的大小）。

回复收藏 0 原文

浴红衣 2024-07-16 07:56:03

除了litb的帖子之外，MSVC++也支持Unicode。据我所知，它从 BOM 中获取 Unicode 编码。它绝对支持像 int (*♫)(); 或 const std::set; 这样的代码。 ∅;
如果您真的热衷于代码混淆：

typedef void ‼; // Also known as \u203C
class ooɟ {
    operator ‼() {}
};

In addition to litb's post, MSVC++ supports Unicode too. I understand it gets the Unicode encoding from the BOM. It definitely supports code like int (*♫)(); or const std::set<int> ∅;
If you're really into code obfuscuation:

typedef void ‼; // Also known as \u203C
class ooɟ {
    operator ‼() {}
};

回复收藏 0 原文

就是爱搞怪 2024-07-16 07:56:03

据我所知，C++ 标准没有提及任何有关源代码文件编码的内容。

通常的编码是（或曾经是）7 位 ASCII —— 一些编译器（例如 Borland 的）会拒绝使用高位的 ASCII 字符。如果您的编译器和编辑器接受 Unicode 字符，则没有任何技术原因不能使用它们 - 大多数基于 Linux 的现代工具和许多更好的基于 Windows 的编辑器都可以毫无问题地处理 UTF-8 编码，尽管我我不确定微软的编译器会。

编辑：看起来微软的编译器会接受 Unicode 编码的文件，但有时也会在 8 位 ASCII 上产生错误：

warning C4819: The file contains a character that cannot be represented
in the current code page (932). Save the file in Unicode format to prevent
data loss.

The C++ standard doesn't say anything about source-code file encoding, so far as I know.

The usual encoding is (or used to be) 7-bit ASCII -- some compilers (Borland's, for instance) would balk at ASCII characters that used the high-bit. There's no technical reason that Unicode characters can't be used, if your compiler and editor accept them -- most modern Linux-based tools, and many of the better Windows-based editors, handle UTF-8 encoding with no problem, though I'm not sure that Microsoft's compiler will.

EDIT: It looks like Microsoft's compilers will accept Unicode-encoded files, but will sometimes produce errors on 8-bit ASCII too:

warning C4819: The file contains a character that cannot be represented
in the current code page (932). Save the file in Unicode format to prevent
data loss.

回复收藏 0 原文

紫轩蝶泪 2024-07-16 07:56:03

这里有两个问题。第一个是 C++ 代码（和注释）中允许使用哪些字符，例如变量名。第二个是字符串和字符串文字中允许使用的字符。

如前所述，C++ 编译器必须支持代码和注释中允许使用的非常有限的基于 ASCII 的字符集。在实践中，这个字符集对于一些欧洲字符集并不能很好地工作（特别是对于一些没有一些字符（例如方括号）可用的欧洲键盘），因此二合字母和三字母的概念是介绍了。目前许多编译器接受超过此字符集的字符集，但没有任何保证。

对于字符串和字符串文字，C++ 有宽字符和宽字符串的概念。但是，该字符集的编码未定义。实际上，它几乎总是 Unicode，但我认为这里没有任何保证。宽字符串文字看起来像 L“字符串文字”，并且可以将它们分配给 std::wstring。

C++11 添加了对 Unicode 字符串和字符串文字的显式支持，编码为 UTF-8、UTF-16 大端、UTF-16 小端、UTF-32 大端和 UTF-32 小端。

回复收藏 0 原文

浪荡不羁 2024-07-16 07:56:03

对于字符串编码，我认为您应该使用 \u 表示法，例如：

std::wstring str = L"\u20AC"; // Euro character

For encoding in strings I think you are meant to use the \u notation, e.g.:

std::wstring str = L"\u20AC"; // Euro character

回复收藏 0 原文

屋檐 2024-07-16 07:56:03

还值得注意的是，C++ 中的宽字符本身并不是真正的 Unicode 字符串。它们只是较大字符的字符串，通常为 16 位，但有时为 32 位。这是实现定义的，不过，IIRC 你可以有一个 8 位 wchar_t 你对它们中的编码没有真正的保证，所以如果你试图做一些类似文本处理的事情，你会可能需要 typedef 来指定最适合您的 Unicode 实体的整数类型。

C++1x 以 UTF-8 编码字符串文字 (u8"text") 以及 UTF-16 和 UTF-32 数据类型 (char16_t) 的形式提供额外的 unicode 支持> 和 char32_t IIRC）以及相应的字符串常量（u"text" 和 U"text"）。不过，在没有 \uxxxx 或 \Uxxxxxxxx 常量的情况下指定的字符的编码仍然是实现定义的（并且对于文字之外的复杂字符串类型没有编码支持）