在哪里可以找到每个 C99 字符集的所有字符表?
我正在为以下每个 C 字符集中的每个字符寻找一个表(或生成一个表的方法):
- 基本字符集
- 基本执行字符集
- 基本源字符集
- 执行字符集
- 扩展字符
- 集 源字符集
C99 提到了所有其中 6 个位于第 5.2.1 节下。然而,我发现它读起来非常晦涩,而且缺乏细节。
它唯一明确定义的字符集是基本执行字符集和基本源字符集:
52 个大小写字母 拉丁字母:
ABCDEFGHIJKLMNOPQRSTU VWXY Z
abcdefghijklmnopqrstu vwxy z
十位小数:
0 1 2 3 4 5 6 7 8 9
29 个图形字符:
! " # % & ' ( ) * + , – . / : ; < = > ? [ \ ] ^ _ { | } ~
4 个空白字符:
空格、水平制表符、垂直制表符、换页
这些与基本字符集相同,但我猜测 C99 没有明确说明这一点,其余的字符集对我来说有点神秘
。您可以提供帮助:)
I'm looking for a table (or a way to generate one) for every character in each of the following C Character Sets:
- Basic Character Set
- Basic Execution Character Set
- Basic Source Character Set
- Execution Character Set
- Extended Character Set
- Source Character Set
C99 mentions all six of these under section 5.2.1. However, I've found it extremely cryptic to read and lacking in detail.
The only character sets that it clearly defines is the Basic Execution Character Set and the Basic Source Character Set:
52 upper- and lower-case letters in
the Latin alphabet:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
Ten decimal digits:
0 1 2 3 4 5 6 7 8 9
29 graphic characters:
! " # % & ' ( ) * + , – . / : ; < = > ? [ \ ] ^ _ { | } ~
4 whitespace characters:
space, horizontal tab, vertical tab, form feed
I believe these are the same as the Basic Character Set, though I'm guessing as C99 does not explicitly state this. The remaining Character Sets are a bit of a mystery to me.
Thanks for any help you can offer! :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
除了您提到的基本字符集之外,其余所有字符集都是实现定义的。这意味着它们可以是任何东西,但实现(即 C 编译器/库/工具链实现)必须记录这些决策。这里的关键段落是:
因此,请查看 C 编译器的文档以了解其他字符集是什么。例如,在我的 gcc 手册页中,一些命令行选项指出:
要获取 iconv 支持的编码列表,请运行 iconv -l。我的系统有 143 种不同的编码可供选择。
Except for the Basic Character Set as you mentioned, all of the rest of the character sets are implementation-defined. That means that they could be anything, but the implementation (that is, the C compiler/libraries/toolchain implementation) must document those decisions. The key paragraphs here are:
So, look at your C compiler's documentation to find out what the other character sets are. For example, in my man page for gcc, some of the command line options state:
To get a list of the encodings supported by
iconv
, runiconv -l
. My system has 143 different encodings to choose from.据我所知,该标准并未将基本字符集视为与源字符集和执行字符集不同的东西。该标准规定它涉及 2 个字符集 - 源字符集和执行字符集。其中每一个都有一个“基本”和“扩展”组件(其中任何一个的扩展组件都可以是空集)。
您有一个“源字符集”,它由“基本源字符集”和零个或多个“扩展字符”组成。基本源字符集和扩展字符的组合称为扩展源字符集。
对于执行字符集也是如此(有一个基本执行字符集与零个或多个扩展字符组合构成扩展执行字符集)。
标准(和您的问题)枚举必须位于基本字符集中的字符 - 基本集中可以有其他字符。
至于每个字符集的基本“范围”和扩展“范围”之间的差异,基本字符集成员的值必须适合一个字节 - 该限制不适用于扩展字符。另请注意,这并不一定意味着源文件编码必须是单字节编码。
源字符集中的字符值不需要与执行字符集中的值一致(例如,源字符集可能由 ASCII 组成,而执行字符集可能由 EBCDIC 组成)。
As far as I see, the standard doesn't talk about a basic character set as something distinct form the source character set and execution character set. The standard lays out that there are 2 character sets it's concerned with - the source character set and execution character set. each of these has a 'basic' and 'extended' component (and the extended component of either can be the empty set).
You have a "source character set" that is comprised of a "basic source character set" and zero or more "extended characters". The combination of the basic source character set and those extended characters is called the extended source character set.
Similarly for the execution character set (there's a basic execution character set that combined with zero or more extended characters make up the extended execution characters set).
The standard (and your question) enumerate characters that must be in the basic characters set - there can be other characters in the basic set.
As far as the difference between the basic 'range' and the extended 'range' of each character set, the values of the members of the basic character set must fit within a byte - that restriction doesn't hold for the extended characters. Also note, that this doesn't necessarily mean that the source file encoding must a single-byte encoding.
The values of characters in the source character sets do not need to agree with the values in the execution character sets (for example, the source character set might be comprised of ASCII, while the execution character set might be EBCDIC).
您可能会看一下 GNU iconv。其中,它可以打印或转换 Java 和 C99 字符串。
iconv
是libiconv
的命令行接口,它很可能是 C99 编译器在内部用于这些字符转换的接口。输入 iconv -l 来查看系统上可用的字符串。您将需要从源代码重新编译才能更改该集。
在 OS X 上,我有 141 个字符集。在 Ubuntu 上,我有 1,168 个字符集(其中大部分是别名)。
You might have a look a GNU iconv. Among many others, it will print or convert both Java and C99 strings.
iconv
is a command line interface tolibiconv
which, very likely, is what your C99 compiler is using internally for these character conversions.Type
iconv -l
to see what strings are available on your system. You will need to recompile from source to change that set.On OS X, I have 141 character sets. On Ubuntu, I have 1,168 character sets (with most of those being aliases).