在哪里可以找到每个 C99 字符集的所有字符表？

发布于 2024-09-27 12:52:29 字数 784 浏览 4 评论 0原文

我正在为以下每个 C 字符集中的每个字符寻找一个表（或生成一个表的方法）：

基本字符集
基本执行字符集
基本源字符集
执行字符集
扩展字符
集源字符集

C99 提到了所有其中 6 个位于第 5.2.1 节下。然而，我发现它读起来非常晦涩，而且缺乏细节。

它唯一明确定义的字符集是基本执行字符集和基本源字符集：

52 个大小写字母拉丁字母：
ABCDEFGHIJKLMNOPQRSTU VWXY Z
abcdefghijklmnopqrstu vwxy z
十位小数：
0 1 2 3 4 5 6 7 8 9
29 个图形字符：
！ " # % & ' ( ) * + , – . / : ; < = > ? [ \ ] ^ _ { | } ~
4 个空白字符：
空格、水平制表符、垂直制表符、换页

这些与基本字符集相同，但我猜测 C99 没有明确说明这一点，其余的字符集对我来说有点神秘

。您可以提供帮助：）

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

深海蓝天 2024-10-04 12:52:29

除了您提到的基本字符集之外，其余所有字符集都是实现定义的。这意味着它们可以是任何东西，但实现（即 C 编译器/库/工具链实现）必须记录这些决策。这里的关键段落是：

§3.4.1 实现定义的行为
未指定的行为，其中每个实现都记录了如何做出选择
§3.4.2 特定于区域设置的行为
取决于每个实施文档中的当地国籍、文化和语言惯例的行为
§5.2.1.1 字符集
应定义两组字符及其关联的整理序列：写入源文件的字符集（源字符集）和在执行环境中解释的字符集（执行字符集）设置）。每个集合进一步分为一个基本字符集，其内容由本子条款给出，以及一组零个或多个特定于语言环境的成员（它们不是基本字符集）称为扩展字符。该组合集也称为扩展字符集。执行字符集成员的值是实现定义的。

因此，请查看 C 编译器的文档以了解其他字符集是什么。例如，在我的 gcc 手册页中，一些命令行选项指出：

   -fexec-charset=charset
       Set the execution character set, used for string and character
       constants.  The default is UTF-8.  charset can be any encoding
       supported by the system's "iconv" library routine.

   -fwide-exec-charset=charset
       Set the wide execution character set, used for wide string and
       character constants.  The default is UTF-32 or UTF-16, whichever
       corresponds to the width of "wchar_t".  As with -fexec-charset,
       charset can be any encoding supported by the system's "iconv"
       library routine; however, you will have problems with encodings
       that do not fit exactly in "wchar_t".

   -finput-charset=charset
       Set the input character set, used for translation from the
       character set of the input file to the source character set used by
       GCC.  If the locale does not specify, or GCC cannot get this
       information from the locale, the default is UTF-8.  This can be
       overridden by either the locale or this command line option.
       Currently the command line option takes precedence if there's a
       conflict.  charset can be any encoding supported by the system's
       "iconv" library routine.

要获取 iconv 支持的编码列表，请运行 iconv -l。我的系统有 143 种不同的编码可供选择。

Except for the Basic Character Set as you mentioned, all of the rest of the character sets are implementation-defined. That means that they could be anything, but the implementation (that is, the C compiler/libraries/toolchain implementation) must document those decisions. The key paragraphs here are:

§3.4.1 implementation-defined behavior
unspecified behavior where each implementation documents how the choice is made
§3.4.2 locale-specific behavior
behavior that depends on local conventions of nationality, culture, and language that each implementation documents
§5.2.1.1 Character sets
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.

So, look at your C compiler's documentation to find out what the other character sets are. For example, in my man page for gcc, some of the command line options state:

   -fexec-charset=charset
       Set the execution character set, used for string and character
       constants.  The default is UTF-8.  charset can be any encoding
       supported by the system's "iconv" library routine.

   -fwide-exec-charset=charset
       Set the wide execution character set, used for wide string and
       character constants.  The default is UTF-32 or UTF-16, whichever
       corresponds to the width of "wchar_t".  As with -fexec-charset,
       charset can be any encoding supported by the system's "iconv"
       library routine; however, you will have problems with encodings
       that do not fit exactly in "wchar_t".

   -finput-charset=charset
       Set the input character set, used for translation from the
       character set of the input file to the source character set used by
       GCC.  If the locale does not specify, or GCC cannot get this
       information from the locale, the default is UTF-8.  This can be
       overridden by either the locale or this command line option.
       Currently the command line option takes precedence if there's a
       conflict.  charset can be any encoding supported by the system's
       "iconv" library routine.

To get a list of the encodings supported by iconv, run iconv -l. My system has 143 different encodings to choose from.

回复收藏 0 原文