内部和外部编码与 Unicode

发布于 2024-12-05 21:06:11 字数 253 浏览 0 评论 0原文

由于此问题的评论中存在许多发帖者传播的错误信息:C++ ABI 问题列表

我创建这个是为了澄清。

  1. C 风格字符串使用什么编码?
  2. Linux 使用 UTF-8 来编码字符串吗?
  3. 外部编码与窄字符串和宽字符串使用的编码有何关系?

Since there was a lot of missinformation spread by several posters in the comments for this question: C++ ABI issues list

I have created this one to clarify.

  1. What are the encodings used for C style strings?
  2. Is Linux using UTF-8 to encode strings?
  3. How does external encoding relate to the encoding used by narrow and wide strings?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

淡看悲欢离合 2024-12-12 21:06:11
  1. 实现定义。甚至是应用程序定义的;标准
    并没有真正对应用程序的用途施加任何限制
    他们,并期望很多行为取决于区域设置。全部
    真正定义的实现是字符串中使用的编码
    文字。

  2. 在什么意义上。大多数操作系统都会忽略大部分编码;你会
    如果 '\0' 不是空字节,则会出现问题,但即使 EBCDIC 也满足这一要求
    要求。否则,根据上下文,会有一些
    可能很重要的其他字符(路径名中的 '/'
    例如);所有这些都使用 Unicode 中的前 128 种编码,因此
    将采用 UTF-8 进行单字节编码。举个例子,我用过
    Linux 下的文件名采用 UTF-8 和 ISO 8859-1。唯一真实的
    问题在于显示它们:例如,如果您在 xterm 中执行 ls
    lsxterm 将假定文件名位于相同的位置
    编码作为显示字体。

  3. 这主要取决于区域设置。根据区域设置,它是
    窄字符串的内部编码很可能不
    对应于字符串文字所使用的内容。 (但是怎么可能
    否则,因为字符串文字的编码必须在以下位置确定
    编译时,作为窄字符的内部编码
    字符串取决于用于读取它的区域设置,并且可能会有所不同
    字符串到下一个。)

如果您正在 Linux 中开发新应用程序,我强烈建议
建议对所有内容使用 Unicode,对宽字符使用 UTF-32
字符串,UTF-8 用于窄字符串。但不要指望
字符串中前 128 个编码点之外的任何内容
文字。

  1. Implementation defined. Or even application defined; the standard
    doesn't really put any restrictions on what an application does with
    them, and expects a lot of the behavior to depend on the locale. All
    that is really implemenation defined is the encoding used in string
    literals.

  2. In what sense. Most of the OS ignores most of the encodings; you'll
    have problems if '\0' isn't a nul byte, but even EBCDIC meets that
    requirement. Otherwise, depending on the context, there will be a few
    additional characters which may be significant (a '/' in path names,
    for example); all of these use the first 128 encodings in Unicode, so
    will have a single byte encoding in UTF-8. As an example, I've used
    both UTF-8 and ISO 8859-1 for filenames under Linux. The only real
    issue is displaying them: if you do ls in an xterm, for example,
    ls and the xterm will assume that the filenames are in the same
    encoding as the display font.

  3. That mainly depends on the locale. Depending on the locale, it's
    quite possible for the internal encoding of a narrow character string not to
    correspond to that used for string literals. (But how could it be
    otherwise, since the encoding of a string literal must be determined at
    compile time, where as the internal encoding for narrow character
    strings depends on the locale used to read it, and can vary from one
    string to the next.)

If you're developing a new application in Linux, I would strongly
recommend using Unicode for everything, with UTF-32 for wide character
strings, and UTF-8 for narrow character strings. But don't count on
anything outside the first 128 encoding points working in string
literals.

天暗了我发光 2024-12-12 21:06:11
  1. 这取决于架构。大多数 Unix 体系结构对宽字符串 (wchar_t) 使用 UTF-32,对 (char) 使用 ASCII。请注意,ASCII 只是 7 位编码。 Windows 一直使用 UCS-2,直到 Windows 2000,更高版本使用变量编码 UTF-16(对于 wchar_t)。
  2. 不会。Linux 上的大多数系统调用都是与编码无关的(它们不关心编码是什么,因为它们不以任何方式解释它)。外部编码实际上是由您当前的区域设置定义的。
  3. 窄字符串和宽字符串使用的内部编码是固定的,它不会随着区域设置的变化而改变。通过更改语言环境,您可以更改对进入/离开程序的数据进行编码和解码的翻译函数(假设您坚持使用标准 C 文本函数)。
  1. This depends on the architecture. Most Unix architectures are using UTF-32 for wide strings (wchar_t) and ASCII for (char). Note that ASCII is just 7bit encoding. Windows was using UCS-2 until Windows 2000, later versions use variable encoding UTF-16 (for wchar_t).
  2. No. Most system calls on Linux are encoding agnostic (they don't care what the encoding is, since they are not interpreting it in any way). External encoding is actually defined by your current locale.
  3. The internal encoding used by narrow and wide strings is fixed, it does not change with changing locale. By changing the locale you are chaning the translation functions that encode and decode data which enters/leaves your program (assuming you stick with standard C text functions).
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文