为什么 C 字符文字是 int 而不是 char?
在 C++ 中,sizeof('a') == sizeof(char) == 1
。 这很直观,因为 'a'
是字符文字,并且 sizeof(char) == 1
由标准定义。
但是在 C 中,<代码>sizeof('a') == sizeof(int)。 也就是说,看起来 C 字符文字实际上是整数。 有谁知道为什么? 我可以找到很多关于这个 C 怪癖的提及,但没有解释它为什么存在。
In C++, sizeof('a') == sizeof(char) == 1
. This makes intuitive sense, since 'a'
is a character literal, and sizeof(char) == 1
as defined by the standard.
In C however, sizeof('a') == sizeof(int)
. That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
我不知道C中的字符文字是int类型的具体原因。 但在 C++ 中,有充分的理由不这样做。 考虑一下:
您会期望对 print 的调用选择采用字符的第二个版本。 将字符字面量设置为 int 将使这变得不可能。 请注意,在 C++ 中,具有多个字符的文字仍然具有 int 类型,尽管它们的值是实现定义的。 因此,
'ab'
的类型为int
,而'a'
的类型为char
。I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:
You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So,
'ab'
has typeint
, while'a'
has typechar
.在我的 MacBook 上使用 GCC ,我尝试:
运行时给出:
这表明一个字符是8 位,就像你怀疑的那样,但字符文字是一个 int。
Using GCC on my MacBook, I try:
which when run gives:
which suggests that a character is 8 bits, like you suspect, but a character literal is an int.
当 C 被编写时,PDP-11 的 MACRO-11 汇编语言had:
这种事情在汇编语言中很常见 - 低 8 位将保存字符代码,其他位清零。 PDP-11 甚至有:
这提供了一种将两个字符加载到低字节和高字节的便捷方法16位寄存器。 然后,您可以将它们写入其他地方,更新一些文本数据或屏幕内存。
因此,将字符提升到寄存器大小的想法是很正常和可取的。 但是,假设您需要将“A”放入寄存器中,而不是作为硬编码操作码的一部分,而是从主内存中的某个位置获取,其中包含:
如果您只想从该主内存中将“A”读入寄存器中,你会读哪一本?
某些 CPU 可能只支持直接将 16 位值读入 16 位寄存器,这意味着在 20 或 22 处读取将需要清除“X”中的位,具体取决于 CPU 的字节序 之一或其他需要移至低位字节。
某些 CPU 可能需要内存对齐读取,这意味着涉及的最低地址必须是数据大小的倍数:您可能能够从地址 24 和 25 读取,但不能从 27 和 28 读取。
因此,编译器生成代码将“A”放入寄存器可能更愿意浪费一点额外的内存,并将值编码为 0“A”或“A”0 - 取决于字节顺序,并确保它正确对齐(即不在奇数内存中)地址)。
我的猜测是,C 只是继承了这种以 CPU 为中心的行为,考虑到占用内存寄存器大小的字符常量,从而证实了 C 作为“高级汇编程序”的普遍评估。
(参见 PDP-11 MACRO-11 第 6-25 页的 6.3.3
语言参考手册)
Back when C was being written, the PDP-11's MACRO-11 assembly language had:
This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:
This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.
So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:
If you want to read just an 'A' from this main memory into a register, which one would you read?
Some CPUs may only directly support reading a 16 bit value into a 16-bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.
So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0—depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).
My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".
(See 6.3.3 on page 6-25 of PDP-11 MACRO-11
Language Reference Manual)
我记得读过 K&R 并看到一个代码片段,它会一次读取一个字符,直到它击中了 EOF。 由于所有字符都是文件/输入流中的有效字符,这意味着 EOF 不能是任何字符值。 该代码将读取的字符放入
int
中,测试是否存在 EOF,如果不是,则转换为char
。我意识到这并不能完全回答您的问题,但如果 EOF 文字是,则其余字符文字为 sizeof(int) 是有意义的。
I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. The code put the read character into an
int
, tested for EOF, and converted to achar
if it wasn't.I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.
我还没有看到它的基本原理(C char 文字是 int 类型),但这里有一些 Stroustrup 不得不说一下(来自设计与进化 11.2.1 - 细粒度分辨率):
所以在大多数情况下,它不会造成任何问题。
I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):
So for the most part, it shouldn't cause any problems.
造成这种情况的历史原因是,C 及其前身 B 最初是在 DEC PDP 各种字长的小型机,支持8位ASCII,但只能进行算术运算在寄存器上。 (但是,不是 PDP-11;后者是后来出现的。)C 的早期版本将
int
定义为机器的本机字长,任何小于int
的值都需要加宽为int
以便传递到函数或从函数传出,或用于按位、逻辑或算术表达式,因为这就是底层硬件的工作方式。这也是为什么整数提升规则仍然规定任何小于
int
的数据类型都会提升为int
。 C 实现也允许使用个补码数学代替类似的补码历史原因。 与十六进制相比,八进制字符转义和八进制常量是一等公民的原因同样是那些早期的 DEC 小型计算机的字大小可分为三字节块,而不是四字节半咬。The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined
int
to be the native word size of the machine, and any value smaller than anint
needed to be widened toint
in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.That is also why the integer promotion rules still say that any data type smaller than an
int
is promoted toint
. C implementations are also allowed to use ones' complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hexadecimal is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks, but not four-byte nibbles.这与语言规范无关,但在硬件中,CPU 通常只有一个寄存器大小(比如说 32 位),因此每当它实际对 char 进行操作(通过加、减或比较)时,都会有一个隐式的加载到寄存器时转换为 int。
编译器会在每次操作后正确屏蔽和移动数字,因此,如果您将 2 添加到(无符号字符)254,它将环绕为 0 而不是 256,但在芯片内部它实际上是一个 int直到您将其保存回内存。
这是一个学术观点,因为无论如何,语言都可以指定 8 位文字类型,但在这种情况下,语言规范恰好更准确地反映了 CPU 真正在做什么。
(x86 爱好者可能会注意到,有一个本机addh opcode 一步添加短宽寄存器,但在 RISC 核心内,这转化为两个步骤:添加数字,然后扩展符号,就像 PowerPC< 上的 add/extsh 对/a>。)
This is only tangential to the language specification, but in hardware the CPU usually only has one register size—32 bits, let's say—and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register.
The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.
It's sort of an academic point, because the language could have specified an 8-bit literal type anyway, but in this case the language specification happens to reflect more closely what the CPU is really doing.
(x86 wonks may note that there is eg a native addh opcode that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, and then extend sign, like an add/extsh pair on the PowerPC.)
这才是正确的行为,叫“积分促销”。 在其他情况下也可能发生(主要是二元运算符,如果我没记错的话)。
为了确定起见,我检查了我的《专家 C 编程:深层秘密》副本,并确认 char 文字不以 int 类型开头强>。 它最初是 char 类型,但当它在表达式中使用时,它提升为int。 以下内容摘自书中:
This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).
Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:
关于同一主题的讨论
Discussion on same subject
最初的问题是“为什么?”
原因是文字字符的定义已经演变和改变,同时试图保持与现有代码的向后兼容。
在 C 早期的黑暗时期,根本没有类型。 当我第一次学习用 C 编程时,类型已经被引入,但是函数没有原型来告诉调用者参数类型是什么。 相反,它被标准化为作为参数传递的所有内容要么是 int 的大小(包括所有指针),要么是 double 。
这意味着,当您编写函数时,所有非 double 参数都以 int 形式存储在堆栈中,无论您如何声明它们,并且编译器将代码放入函数中来为您处理此问题。
这使得事情有些不一致,因此当 K&R 撰写他们的著名书籍时,他们提出了这样的规则:在任何表达式中,字符文字始终会提升为 int,而不仅仅是函数参数。
当 ANSI 委员会首次标准化 C 时,他们改变了这条规则,使字符文字只是一个 int,因为这似乎是实现相同目标的更简单的方法。
当设计 C++ 时,所有函数都需要有完整的原型(这在 C 中仍然不要求,尽管它被普遍认为是良好实践)。 因此,决定将字符文字存储在 char 中。 在 C++ 中这样做的优点是带有 char 参数的函数和带有 int 参数的函数具有不同的签名。 这个优点在C中是没有的。
这就是它们不同的原因。 进化...
The original question is "why?"
The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.
In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.
This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.
This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.
When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.
When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.
This is why they are different. Evolution...