文件编码如何影响 C++11 字符串文字?

发布于 2024-11-25 13:23:59 字数 822 浏览 2 评论 0原文

您可以在 C++11 中编写 UTF-8/16/32 字符串文字,方法是分别在字符串文字前添加 u8/u/U 前缀。编译器必须如何解释在这些新类型的字符串文字中包含非 ASCII 字符的 UTF-8 文件?我知道该标准没有指定文件编码,仅这一事实就会使源代码中非 ASCII 字符的解释完全未定义的行为,从而使该功能有点不太有用。

我知道您仍然可以使用 \uNNNN 转义单个 unicode 字符,但这对于例如完整的俄语或法语句子来说不是很可读,这些句子通常包含多个 unicode 字符。

我从各种来源了解到,u 应该等同于当前 Windows 实现上的 L 和 Linux 实现上的 U。因此,考虑到这一点,我还想知道旧的字符串文字修饰符所需的行为是什么......

对于代码示例猴子:

string utf8string a = u8"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf16string b = u"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf32string c = U"L'hôtel de ville doit être là-bas. Ça c'est un fait!";

在理想的世界中,所有这些字符串都会产生相同的内容(如:字符转换后),但我使用 C++ 的经验告诉我,这绝对是定义的实现,并且可能只有第一个可以做我想要的事情。

You can write UTF-8/16/32 string literals in C++11 by prefixing the string literal with u8/u/U respectively. How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals? I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.

I understand you can still escape single unicode characters with \uNNNN, but that is not very readable for, say, a full Russian, or French sentence, which typically contain more than one unicode character.

What I understand from various sources is that u should become equivalent to L on current Windows implementations and U on e.g. Linux implementations. So with that in mind, I'm also wondering what the required behavior is for the old string literal modifiers...

For the code-sample monkeys:

string utf8string a = u8"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf16string b = u"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf32string c = U"L'hôtel de ville doit être là-bas. Ça c'est un fait!";

In an ideal world, all of these strings produce the same content (as in: characters after conversion), but my experience with C++ has taught me that this is most definitely implementation defined and probably only the first will do what I want.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

执着的年纪 2024-12-02 13:23:59

在 GCC 中,使用 -finput-charset=charset

设置输入字符集,用于从输入文件的字符集转换为GCC使用的源字符集。如果语言环境未指定,或者 GCC 无法从语言环境获取此信息,则默认为 UTF-8。这可以被区域设置或此命令行选项覆盖。目前,如果存在冲突,则命令行选项优先。字符集可以是系统的“iconv”库例程支持的任何编码。

另请查看选项 -fexec-charset-fwide-exec-charset

最后,关于字符串文字:

char     a[] = "Hello";
wchar_t  b[] = L"Hello";
char16_t c[] = u"Hello";
char32_t d[] = U"Hello";

字符串文字的大小修饰符(LuU)仅决定类型。

In GCC, use -finput-charset=charset:

Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command line option. Currently the command line option takes precedence if there's a conflict. charset can be any encoding supported by the system's "iconv" library routine.

Also check out the options -fexec-charset and -fwide-exec-charset.

Finally, about string literals:

char     a[] = "Hello";
wchar_t  b[] = L"Hello";
char16_t c[] = u"Hello";
char32_t d[] = U"Hello";

The size modifier of the string literal (L, u, U) merely determines the type of the literal.

甜`诱少女 2024-12-02 13:23:59

编译器必须如何解释在这些新类型的字符串文字中包含非 ASCII 字符的 UTF-8 文件。据我所知,该标准没有指定文件编码,仅这一事实就会使源代码中非 ASCII 字符的解释完全未定义的行为,从而使该功能的用处稍显不足。

来自 n3290,2.2 翻译阶段 [lex.phases]

物理源文件字符被映射在一个
实现定义的方式,到基本源字符集
(引入换行符作为行尾指示符)如果
必要的。接受的物理源文件字符集是
实现定义的。 [这里有一些关于三字母组的信息。] 任何来源
不在基本源字符集(2.3)中的文件字符被替换
通过指定该字符的通用字符名称。 (一个
实现可以使用任何内部编码,只要实际的
源文件中遇到的扩展字符,同样
扩展字符在源文件中表示为
通用字符名称(即使用 \uXXXX 表示法),是
处理方式相同,除非此替换在 a 中恢复
原始字符串文字。)

有很多标准术语用于描述实现如何处理编码。这是我尝试对所发生的情况进行更简单的逐步描述:

物理源文件字符被映射到
实现定义的方式,到基本源字符集[...]

文件编码的问题是手动的;该标准只关心基本的源字符集,并为实现留出了空间。

任何来源
不在基本源字符集(2.3)中的文件字符被替换
通过指定该字符的通用字符名称。

基本源集是允许的字符的简单列表。 它不是 ASCII(请参阅进一步内容)。任何不在此列表中的内容都会被“转换”(至少在概念上)为 \uXXXX 形式。

所以无论使用什么样的字面量或者文件编码,源代码在概念上都转化为基本字符集+一堆\uXXXX。我之所以说概念性的,是因为实现实际上所做的通常更简单,例如因为它们可以直接处理 Unicode。重要的是,标准所称的扩展字符(即不是来自基本源集)在使用中与其等效的 \uXXXX 形式应该无法区分。请注意,C++03 可在 EBCDIC 平台上使用,因此您用 ASCII 进行的推理从一开始就存在缺陷。

最后,我描述的过程也发生在(非原始)字符串文字上。这意味着您的代码等效于您编写的代码:

string utf8string a = u8"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf16string b = u"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf32string c = U"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";

How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals. I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.

From n3290, 2.2 Phases of translation [lex.phases]

Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary. The set of physical source file characters accepted is
implementation-defined. [Here's a bit about trigraphs.] Any source
file character not in the basic source character set (2.3) is replaced
by the universal-character-name that designates that character. (An
implementation may use any internal encoding, so long as an actual
extended character encountered in the source file, and the same
extended character expressed in the source file as a
universal-character-name (i.e., using the \uXXXX notation), are
handled equivalently except where this replacement is reverted in a
raw string literal.)

There are a lot of Standard terms being used to describe how an implementation deals with encodings. Here's my attempt at as somewhat simpler, step-by-step description of what happens:

Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set [...]

The issue of file encodings is handwaved; the Standard only cares about the basic source character set and leaves room for the implementation to get there.

Any source
file character not in the basic source character set (2.3) is replaced
by the universal-character-name that designates that character.

The basic source set is a simple list of allowed characters. It is not ASCII (see further). Anything not in this list is 'transformed' (conceptually at least) to a \uXXXX form.

So no matter what kind of literal or file encoding is used, the source code is conceptually transformed into the basic character set + a bunch of \uXXXX. I say conceptually because what the implementations actually do is usually simpler, e.g. because they can deal with Unicode directly. The important part is that what the Standard call an extended character (i.e. not from the basic source set) should be indistinguishable in use from its equivalent \uXXXX form. Note that C++03 is available on e.g. EBCDIC platforms, so your reasoning in terms of ASCII is flawed from the get go.

Finally, the process I described happens to (non raw) string literals too. That means your code is equivalent as if you'd have written:

string utf8string a = u8"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf16string b = u"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
string utf32string c = U"L'h\u00F4tel de ville doit \u00EAtre l\u00E0-bas. \u00C7a c'est un fait!";
时光倒影 2024-12-02 13:23:59

原则上,只有当您通过使人类可见的方式输出字符串时,编码问题才重要,这不是编程语言如何定义的问题,因为它的定义仅涉及编码计算。因此,当您决定在编辑器中看到的内容是否与在输出中看到的内容(任何类型的图像,无论是在屏幕上还是在 pdf 中)相同时,您应该问自己哪种约定您的用户交互库和操作系统的编码方式假设。 (例如,这种信息对于 Qt5:对于 Qt5,您作为应用程序用户所看到的内容和您作为程序员所看到的内容是一致的,如果您的老式字符串文字的内容QString 在源文件中被编码为 utf8,除非您在应用程序执行过程中打开其他设置)。

作为结论,我认为 Kerrek SB 是对的,而 Damon 是错误的:确实,在代码中指定文字的方法应该指定其类型,而不是源文件中用于填充其内容的编码,因为文字的类型涉及对其进行的计算。像 u"string" 这样的东西只是一个“unicode codeunits”数组(即 char16_t 类型的值),无论操作系统或任何其他服务软件稍后做什么给他们,无论他们的工作是寻找您还是其他用户。您只需解决为自己添加另一个约定的问题,即在计算中的数字“含义”(即它们呈现 Unicode 代码)与您在文本编辑器中工作时在屏幕上的表示之间建立对应关系。作为一名程序员,如何以及是否使用该“含义”是另一个问题,而如何强制执行这种其他对应关系自然是实现定义的,因为它与编码计算无关,只与工具使用的舒适度有关。

In principle, questions of encoding only matter when you output your strings by making them visible to humans, which is not a question of how the programming language is defined, as its definition deals only with coding computation. So, when you decide, whether what you see in your editor is going to be the same as what you see in the output (any kind of images, be they on the screen or in a pdf), you should ask yourselves which convention the way your user-interaction library and your operating system were coded assumes. (Here is, for example, this kind of information for Qt5: with Qt5, what you see as a user of the application and what you see as its programmer coincides, if the contents of the old-fashioned string literals for your QStrings are encoded as utf8 in your source files, unless you turn on another setting in the course of the application's execution).

As a conclusion, I think Kerrek SB is right, and Damon is wrong: indeed, the methods of specifying a literal in the code ought to specify its type, not the encoding that is used in the source file for filling its contents, as the type of a literal is what concerns computation done to it. Something like u"string" is just an array of “unicode codeunits” (that is, values of type char16_t), whatever the operating system or any other service software later does to them and however their job looks for you or for another user. You just get to the problem of adding another convention for yourselves, that makes a correspondence between the “meaning” of numbers under computation (namely, they present the codes of Unicode), and their representation on your screen as you work in your text editor. How and whether you as a programmer use that “meaning” is another question, and how you could enforce this other correspondence is naturally going to be implementation-defined, because it has nothing to do with coding computation, only with comfortability of a tool's use.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文