内部和外部编码与 Unicode
由于此问题的评论中存在许多发帖者传播的错误信息:C++ ABI 问题列表
我创建这个是为了澄清。
- C 风格字符串使用什么编码?
- Linux 使用 UTF-8 来编码字符串吗?
- 外部编码与窄字符串和宽字符串使用的编码有何关系?
Since there was a lot of missinformation spread by several posters in the comments for this question: C++ ABI issues list
I have created this one to clarify.
- What are the encodings used for C style strings?
- Is Linux using UTF-8 to encode strings?
- How does external encoding relate to the encoding used by narrow and wide strings?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
实现定义。甚至是应用程序定义的;标准
并没有真正对应用程序的用途施加任何限制
他们,并期望很多行为取决于区域设置。全部
真正定义的实现是字符串中使用的编码
文字。
在什么意义上。大多数操作系统都会忽略大部分编码;你会
如果
'\0'
不是空字节,则会出现问题,但即使 EBCDIC 也满足这一要求要求。否则,根据上下文,会有一些
可能很重要的其他字符(路径名中的
'/'
,例如);所有这些都使用 Unicode 中的前 128 种编码,因此
将采用 UTF-8 进行单字节编码。举个例子,我用过
Linux 下的文件名采用 UTF-8 和 ISO 8859-1。唯一真实的
问题在于显示它们:例如,如果您在
xterm
中执行ls
,ls
和xterm
将假定文件名位于相同的位置编码作为显示字体。
这主要取决于区域设置。根据区域设置,它是
窄字符串的内部编码很可能不
对应于字符串文字所使用的内容。 (但是怎么可能
否则,因为字符串文字的编码必须在以下位置确定
编译时,作为窄字符的内部编码
字符串取决于用于读取它的区域设置,并且可能会有所不同
字符串到下一个。)
如果您正在 Linux 中开发新应用程序,我强烈建议
建议对所有内容使用 Unicode,对宽字符使用 UTF-32
字符串,UTF-8 用于窄字符串。但不要指望
字符串中前 128 个编码点之外的任何内容
文字。
Implementation defined. Or even application defined; the standard
doesn't really put any restrictions on what an application does with
them, and expects a lot of the behavior to depend on the locale. All
that is really implemenation defined is the encoding used in string
literals.
In what sense. Most of the OS ignores most of the encodings; you'll
have problems if
'\0'
isn't a nul byte, but even EBCDIC meets thatrequirement. Otherwise, depending on the context, there will be a few
additional characters which may be significant (a
'/'
in path names,for example); all of these use the first 128 encodings in Unicode, so
will have a single byte encoding in UTF-8. As an example, I've used
both UTF-8 and ISO 8859-1 for filenames under Linux. The only real
issue is displaying them: if you do
ls
in anxterm
, for example,ls
and thexterm
will assume that the filenames are in the sameencoding as the display font.
That mainly depends on the locale. Depending on the locale, it's
quite possible for the internal encoding of a narrow character string not to
correspond to that used for string literals. (But how could it be
otherwise, since the encoding of a string literal must be determined at
compile time, where as the internal encoding for narrow character
strings depends on the locale used to read it, and can vary from one
string to the next.)
If you're developing a new application in Linux, I would strongly
recommend using Unicode for everything, with UTF-32 for wide character
strings, and UTF-8 for narrow character strings. But don't count on
anything outside the first 128 encoding points working in string
literals.
wchar_t
) 使用 UTF-32,对 (char
) 使用 ASCII。请注意,ASCII 只是 7 位编码。 Windows 一直使用 UCS-2,直到 Windows 2000,更高版本使用变量编码 UTF-16(对于wchar_t
)。wchar_t
) and ASCII for (char
). Note that ASCII is just 7bit encoding. Windows was using UCS-2 until Windows 2000, later versions use variable encoding UTF-16 (forwchar_t
).