在哪里可以找到每个 C99 字符集的所有字符表?

发布于 2024-09-27 12:52:29 字数 784 浏览 4 评论 0原文

我正在为以下每个 C 字符集中的每个字符寻找一个表(或生成一个表的方法):

  • 基本字符集
  • 基本执行字符集
  • 基本源字符集
  • 执行字符集
  • 扩展字符
  • 集 源字符集

C99 提到了所有其中 6 个位于第 5.2.1 节下。然而,我发现它读起来非常晦涩,而且缺乏细节。

它唯一明确定义的字符集是基本执行字符集基本源字符集

52 个大小写字母 拉丁字母:

ABCDEFGHIJKLMNOPQRSTU VWXY Z

abcdefghijklmnopqrstu vwxy z

十位小数:

0 1 2 3 4 5 6 7 8 9

29 个图形字符:

! " # % & ' ( ) * + , – . / : ; < = > ? [ \ ] ^ _ { | } ~

4 个空白字符:

空格、水平制表符、垂直制表符、换页

这些与基本字符集相同,但我猜测 C99 没有明确说明这一点,其余的字符集对我来说有点神秘

。您可以提供帮助:)

I'm looking for a table (or a way to generate one) for every character in each of the following C Character Sets:

  • Basic Character Set
  • Basic Execution Character Set
  • Basic Source Character Set
  • Execution Character Set
  • Extended Character Set
  • Source Character Set

C99 mentions all six of these under section 5.2.1. However, I've found it extremely cryptic to read and lacking in detail.

The only character sets that it clearly defines is the Basic Execution Character Set and the Basic Source Character Set:

52 upper- and lower-case letters in
the Latin alphabet:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

a b c d e f g h i j k l m n o p q r s t u v w x y z

Ten decimal digits:

0 1 2 3 4 5 6 7 8 9

29 graphic characters:

! " # % & ' ( ) * + , – . / : ; < = > ? [ \ ] ^ _ { | } ~

4 whitespace characters:

space, horizontal tab, vertical tab, form feed

I believe these are the same as the Basic Character Set, though I'm guessing as C99 does not explicitly state this. The remaining Character Sets are a bit of a mystery to me.

Thanks for any help you can offer! :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

深海蓝天 2024-10-04 12:52:29

除了您提到的基本字符集之外,其余所有字符集都是实现定义的。这意味着它们可以是任何东西,但实现(即 C 编译器/库/工具链实现)必须记录这些决策。这里的关键段落是:

§3.4.1 实现定义的行为
未指定的行为,其中每个实现都记录了如何做出选择

§3.4.2 特定于区域设置的行为
取决于每个实施文档中的当地国籍、文化和语言惯例的行为

§5.2.1.1 字符集
应定义两组字符及其关联的整理序列:写入源文件的字符集(源字符集)和在执行环境中解释的字符集(执行字符集)设置)。每个集合进一步分为一个基本字符集,其内容由本子条款给出,以及一组零个或多个特定于语言环境的成员(它们不是基本字符集)称为扩展字符。该组合集也称为扩展字符集。执行字符集成员的值是实现定义的

因此,请查看 C 编译器的文档以了解其他字符集是什么。例如,在我的 gcc 手册页中,一些命令行选项指出:

   -fexec-charset=charset
       Set the execution character set, used for string and character
       constants.  The default is UTF-8.  charset can be any encoding
       supported by the system's "iconv" library routine.

   -fwide-exec-charset=charset
       Set the wide execution character set, used for wide string and
       character constants.  The default is UTF-32 or UTF-16, whichever
       corresponds to the width of "wchar_t".  As with -fexec-charset,
       charset can be any encoding supported by the system's "iconv"
       library routine; however, you will have problems with encodings
       that do not fit exactly in "wchar_t".

   -finput-charset=charset
       Set the input character set, used for translation from the
       character set of the input file to the source character set used by
       GCC.  If the locale does not specify, or GCC cannot get this
       information from the locale, the default is UTF-8.  This can be
       overridden by either the locale or this command line option.
       Currently the command line option takes precedence if there's a
       conflict.  charset can be any encoding supported by the system's
       "iconv" library routine.

要获取 iconv 支持的编码列表,请运行 iconv -l。我的系统有 143 种不同的编码可供选择。

Except for the Basic Character Set as you mentioned, all of the rest of the character sets are implementation-defined. That means that they could be anything, but the implementation (that is, the C compiler/libraries/toolchain implementation) must document those decisions. The key paragraphs here are:

§3.4.1 implementation-defined behavior
unspecified behavior where each implementation documents how the choice is made

§3.4.2 locale-specific behavior
behavior that depends on local conventions of nationality, culture, and language that each implementation documents

§5.2.1.1 Character sets
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.

So, look at your C compiler's documentation to find out what the other character sets are. For example, in my man page for gcc, some of the command line options state:

   -fexec-charset=charset
       Set the execution character set, used for string and character
       constants.  The default is UTF-8.  charset can be any encoding
       supported by the system's "iconv" library routine.

   -fwide-exec-charset=charset
       Set the wide execution character set, used for wide string and
       character constants.  The default is UTF-32 or UTF-16, whichever
       corresponds to the width of "wchar_t".  As with -fexec-charset,
       charset can be any encoding supported by the system's "iconv"
       library routine; however, you will have problems with encodings
       that do not fit exactly in "wchar_t".

   -finput-charset=charset
       Set the input character set, used for translation from the
       character set of the input file to the source character set used by
       GCC.  If the locale does not specify, or GCC cannot get this
       information from the locale, the default is UTF-8.  This can be
       overridden by either the locale or this command line option.
       Currently the command line option takes precedence if there's a
       conflict.  charset can be any encoding supported by the system's
       "iconv" library routine.

To get a list of the encodings supported by iconv, run iconv -l. My system has 143 different encodings to choose from.

盗梦空间 2024-10-04 12:52:29

据我所知,该标准并未将基本字符集视为与源字符集和执行字符集不同的东西。该标准规定它涉及 2 个字符集 - 源字符集和执行字符集。其中每一个都有一个“基本”和“扩展”组件(其中任何一个的扩展组件都可以是空集)。

您有一个“源字符集”,它由“基本源字符集”和零个或多个“扩展字符”组成。基本源字符集和扩展字符的组合称为扩展源字符集。

对于执行字符集也是如此(有一个基本执行字符集与零个或多个扩展字符组合构成扩展执行字符集)。

标准(和您的问题)枚举必须位于基本字符集中的字符 - 基本集中可以有其他字符。

至于每个字符集的基本“范围”和扩展“范围”之间的差异,基本字符集成员的值必须适合一个字节 - 该限制不适用于扩展字符。另请注意,这并不一定意味着源文件编码必须是单字节编码。

源字符集中的字符值不需要与执行字符集中的值一致(例如,源字符集可能由 ASCII 组成,而执行字符集可能由 EBCDIC 组成)。

As far as I see, the standard doesn't talk about a basic character set as something distinct form the source character set and execution character set. The standard lays out that there are 2 character sets it's concerned with - the source character set and execution character set. each of these has a 'basic' and 'extended' component (and the extended component of either can be the empty set).

You have a "source character set" that is comprised of a "basic source character set" and zero or more "extended characters". The combination of the basic source character set and those extended characters is called the extended source character set.

Similarly for the execution character set (there's a basic execution character set that combined with zero or more extended characters make up the extended execution characters set).

The standard (and your question) enumerate characters that must be in the basic characters set - there can be other characters in the basic set.

As far as the difference between the basic 'range' and the extended 'range' of each character set, the values of the members of the basic character set must fit within a byte - that restriction doesn't hold for the extended characters. Also note, that this doesn't necessarily mean that the source file encoding must a single-byte encoding.

The values of characters in the source character sets do not need to agree with the values in the execution character sets (for example, the source character set might be comprised of ASCII, while the execution character set might be EBCDIC).

執念 2024-10-04 12:52:29

您可能会看一下 GNU iconv。其中,它可以打印或转换 Java 和 C99 字符串。 iconvlibiconv 的命令行接口,它很可能是 C99 编译器在内部用于这些字符转换的接口。

输入 iconv -l 来查看系统上可用的字符串。您将需要从源代码重新编译才能更改该集。

在 OS X 上,我有 141 个字符集。在 Ubuntu 上,我有 1,168 个字符集(其中大部分是别名)。

You might have a look a GNU iconv. Among many others, it will print or convert both Java and C99 strings. iconv is a command line interface to libiconv which, very likely, is what your C99 compiler is using internally for these character conversions.

Type iconv -l to see what strings are available on your system. You will need to recompile from source to change that set.

On OS X, I have 141 character sets. On Ubuntu, I have 1,168 character sets (with most of those being aliases).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文