文件编码如何影响 C++11 字符串文字?
您可以在 C++11 中编写 UTF-8/16/32 字符串文字,方法是分别在字符串文字前添加 u8
/u
/U
前缀。编译器必须如何解释在这些新类型的字符串文字中包含非 ASCII 字符的 UTF-8 文件?我知道该标准没有指定文件编码,仅这一事实就会使源代码中非 ASCII 字符的解释完全未定义的行为,从而使该功能有点不太有用。
我知道您仍然可以使用 \uNNNN
转义单个 unicode 字符,但这对于例如完整的俄语或法语句子来说不是很可读,这些句子通常包含多个 unicode 字符。
我从各种来源了解到,u
应该等同于当前 Windows 实现上的 L
和 Linux 实现上的 U
。因此,考虑到这一点,我还想知道旧的字符串文字修饰符所需的行为是什么......
对于代码示例猴子:
string utf8string a = u8"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf16string b = u"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf32string c = U"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
在理想的世界中,所有这些字符串都会产生相同的内容(如:字符转换后),但我使用 C++ 的经验告诉我,这绝对是定义的实现,并且可能只有第一个可以做我想要的事情。
You can write UTF-8/16/32 string literals in C++11 by prefixing the string literal with u8
/u
/U
respectively. How must the compiler interpret a UTF-8 file that has non-ASCII characters inside of these new types of string literals? I understand the standard does not specify file encodings, and that fact alone would make the interpretation of non-ASCII characters inside source code completely undefined behavior, making the feature just a tad less useful.
I understand you can still escape single unicode characters with \uNNNN
, but that is not very readable for, say, a full Russian, or French sentence, which typically contain more than one unicode character.
What I understand from various sources is that u
should become equivalent to L
on current Windows implementations and U
on e.g. Linux implementations. So with that in mind, I'm also wondering what the required behavior is for the old string literal modifiers...
For the code-sample monkeys:
string utf8string a = u8"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf16string b = u"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
string utf32string c = U"L'hôtel de ville doit être là-bas. Ça c'est un fait!";
In an ideal world, all of these strings produce the same content (as in: characters after conversion), but my experience with C++ has taught me that this is most definitely implementation defined and probably only the first will do what I want.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在 GCC 中,使用
-finput-charset=charset
:另请查看选项
-fexec-charset
和-fwide-exec-charset
。最后,关于字符串文字:
字符串文字的大小修饰符(
L
、u
、U
)仅决定类型。In GCC, use
-finput-charset=charset
:Also check out the options
-fexec-charset
and-fwide-exec-charset
.Finally, about string literals:
The size modifier of the string literal (
L
,u
,U
) merely determines the type of the literal.来自 n3290,2.2 翻译阶段 [lex.phases]
有很多标准术语用于描述实现如何处理编码。这是我尝试对所发生的情况进行更简单的逐步描述:
文件编码的问题是手动的;该标准只关心基本的源字符集,并为实现留出了空间。
基本源集是允许的字符的简单列表。 它不是 ASCII(请参阅进一步内容)。任何不在此列表中的内容都会被“转换”(至少在概念上)为
\uXXXX
形式。所以无论使用什么样的字面量或者文件编码,源代码在概念上都转化为基本字符集+一堆
\uXXXX
。我之所以说概念性的,是因为实现实际上所做的通常更简单,例如因为它们可以直接处理 Unicode。重要的是,标准所称的扩展字符(即不是来自基本源集)在使用中与其等效的\uXXXX
形式应该无法区分。请注意,C++03 可在 EBCDIC 平台上使用,因此您用 ASCII 进行的推理从一开始就存在缺陷。最后,我描述的过程也发生在(非原始)字符串文字上。这意味着您的代码等效于您编写的代码:
From n3290, 2.2 Phases of translation [lex.phases]
There are a lot of Standard terms being used to describe how an implementation deals with encodings. Here's my attempt at as somewhat simpler, step-by-step description of what happens:
The issue of file encodings is handwaved; the Standard only cares about the basic source character set and leaves room for the implementation to get there.
The basic source set is a simple list of allowed characters. It is not ASCII (see further). Anything not in this list is 'transformed' (conceptually at least) to a
\uXXXX
form.So no matter what kind of literal or file encoding is used, the source code is conceptually transformed into the basic character set + a bunch of
\uXXXX
. I say conceptually because what the implementations actually do is usually simpler, e.g. because they can deal with Unicode directly. The important part is that what the Standard call an extended character (i.e. not from the basic source set) should be indistinguishable in use from its equivalent\uXXXX
form. Note that C++03 is available on e.g. EBCDIC platforms, so your reasoning in terms of ASCII is flawed from the get go.Finally, the process I described happens to (non raw) string literals too. That means your code is equivalent as if you'd have written:
原则上,只有当您通过使人类可见的方式输出字符串时,编码问题才重要,这不是编程语言如何定义的问题,因为它的定义仅涉及编码计算。因此,当您决定在编辑器中看到的内容是否与在输出中看到的内容(任何类型的图像,无论是在屏幕上还是在 pdf 中)相同时,您应该问自己哪种约定您的用户交互库和操作系统的编码方式假设。 (例如,这种信息对于 Qt5:对于 Qt5,您作为应用程序用户所看到的内容和您作为程序员所看到的内容是一致的,如果您的老式字符串文字的内容QString 在源文件中被编码为 utf8,除非您在应用程序执行过程中打开其他设置)。
作为结论,我认为 Kerrek SB 是对的,而 Damon 是错误的:确实,在代码中指定文字的方法应该指定其类型,而不是源文件中用于填充其内容的编码,因为文字的类型涉及对其进行的计算。像
u"string"
这样的东西只是一个“unicode codeunits”数组(即char16_t
类型的值),无论操作系统或任何其他服务软件稍后做什么给他们,无论他们的工作是寻找您还是其他用户。您只需解决为自己添加另一个约定的问题,即在计算中的数字“含义”(即它们呈现 Unicode 代码)与您在文本编辑器中工作时在屏幕上的表示之间建立对应关系。作为一名程序员,如何以及是否使用该“含义”是另一个问题,而如何强制执行这种其他对应关系自然是实现定义的,因为它与编码计算无关,只与工具使用的舒适度有关。In principle, questions of encoding only matter when you output your strings by making them visible to humans, which is not a question of how the programming language is defined, as its definition deals only with coding computation. So, when you decide, whether what you see in your editor is going to be the same as what you see in the output (any kind of images, be they on the screen or in a pdf), you should ask yourselves which convention the way your user-interaction library and your operating system were coded assumes. (Here is, for example, this kind of information for Qt5: with Qt5, what you see as a user of the application and what you see as its programmer coincides, if the contents of the old-fashioned string literals for your QStrings are encoded as utf8 in your source files, unless you turn on another setting in the course of the application's execution).
As a conclusion, I think Kerrek SB is right, and Damon is wrong: indeed, the methods of specifying a literal in the code ought to specify its type, not the encoding that is used in the source file for filling its contents, as the type of a literal is what concerns computation done to it. Something like
u"string"
is just an array of “unicode codeunits” (that is, values of typechar16_t
), whatever the operating system or any other service software later does to them and however their job looks for you or for another user. You just get to the problem of adding another convention for yourselves, that makes a correspondence between the “meaning” of numbers under computation (namely, they present the codes of Unicode), and their representation on your screen as you work in your text editor. How and whether you as a programmer use that “meaning” is another question, and how you could enforce this other correspondence is naturally going to be implementation-defined, because it has nothing to do with coding computation, only with comfortability of a tool's use.