当前位置：文江博客话题详情

为什么 C 字符文字是 int 而不是 char？

发布于 2024-07-10 19:52:22 字数 264 浏览 12 评论 0原文

在 C++ 中，sizeof('a') == sizeof(char) == 1。这很直观，因为 'a' 是字符文字，并且 sizeof(char) == 1 由标准定义。

但是在 C 中，<代码>sizeof('a') == sizeof(int)。也就是说，看起来 C 字符文字实际上是整数。有谁知道为什么？我可以找到很多关于这个 C 怪癖的提及，但没有解释它为什么存在。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦旅人picnic 2024-07-17 19:52:23

我不知道C中的字符文字是int类型的具体原因。但在 C++ 中，有充分的理由不这样做。考虑一下：

void print(int);
void print(char);

print('a');

您会期望对 print 的调用选择采用字符的第二个版本。将字符字面量设置为 int 将使这变得不可能。请注意，在 C++ 中，具有多个字符的文字仍然具有 int 类型，尽管它们的值是实现定义的。因此，'ab' 的类型为 int，而 'a' 的类型为 char。

I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:

void print(int);
void print(char);

print('a');

You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.

回复收藏 0 原文

み零 2024-07-17 19:52:23

在我的 MacBook 上使用 GCC ，我尝试：

#include <stdio.h>

#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
  test('a');
  test("a");
  test("");
  test(char);
  test(short);
  test(int);
  test(long);
  test((char)0x0);
  test((short)0x0);
  test((int)0x0);
  test((long)0x0);
  return 0;
};

运行时给出：

'a':    4
"a":    2
"":     1
char:   1
short:  2
int:    4
long:   4
(char)0x0:      1
(short)0x0:     2
(int)0x0:       4
(long)0x0:      4

这表明一个字符是8 位，就像你怀疑的那样，但字符文字是一个 int。

Using GCC on my MacBook, I try:

#include <stdio.h>

#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
  test('a');
  test("a");
  test("");
  test(char);
  test(short);
  test(int);
  test(long);
  test((char)0x0);
  test((short)0x0);
  test((int)0x0);
  test((long)0x0);
  return 0;
};

which when run gives:

'a':    4
"a":    2
"":     1
char:   1
short:  2
int:    4
long:   4
(char)0x0:      1
(short)0x0:     2
(int)0x0:       4
(long)0x0:      4

which suggests that a character is 8 bits, like you suspect, but a character literal is an int.

回复收藏 0 原文

爱情眠于流年 2024-07-17 19:52:23

当 C 被编写时，PDP-11 的 MACRO-11 汇编语言had：

MOV #'A, R0      // 8-bit character encoding for 'A' into 16 bit register

这种事情在汇编语言中很常见 - 低 8 位将保存字符代码，其他位清零。 PDP-11 甚至有：

MOV #"AB, R0     // 16-bit character encoding for 'A' (low byte) and 'B'

这提供了一种将两个字符加载到低字节和高字节的便捷方法16位寄存器。然后，您可以将它们写入其他地方，更新一些文本数据或屏幕内存。

因此，将字符提升到寄存器大小的想法是很正常和可取的。但是，假设您需要将“A”放入寄存器中，而不是作为硬编码操作码的一部分，而是从主内存中的某个位置获取，其中包含：

address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'

如果您只想从该主内存中将“A”读入寄存器中，你会读哪一本？

某些 CPU 可能只支持直接将 16 位值读入 16 位寄存器，这意味着在 20 或 22 处读取将需要清除“X”中的位，具体取决于 CPU 的字节序之一或其他需要移至低位字节。
某些 CPU 可能需要内存对齐读取，这意味着涉及的最低地址必须是数据大小的倍数：您可能能够从地址 24 和 25 读取，但不能从 27 和 28 读取。

因此，编译器生成代码将“A”放入寄存器可能更愿意浪费一点额外的内存，并将值编码为 0“A”或“A”0 - 取决于字节顺序，并确保它正确对齐（即不在奇数内存中）地址）。

我的猜测是，C 只是继承了这种以 CPU 为中心的行为，考虑到占用内存寄存器大小的字符常量，从而证实了 C 作为“高级汇编程序”的普遍评估。

（参见 PDP-11 MACRO-11 第 6-25 页的 6.3.3
语言参考手册）

Back when C was being written, the PDP-11's MACRO-11 assembly language had:

MOV #'A, R0      // 8-bit character encoding for 'A' into 16 bit register

This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:

MOV #"AB, R0     // 16-bit character encoding for 'A' (low byte) and 'B'

This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.

So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:

address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'

If you want to read just an 'A' from this main memory into a register, which one would you read?

Some CPUs may only directly support reading a 16 bit value into a 16-bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.

So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0—depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).

My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".

(See 6.3.3 on page 6-25 of PDP-11 MACRO-11
Language Reference Manual)

回复收藏 0 原文

江南烟雨〆相思醉 2024-07-17 19:52:23

我记得读过 K&R 并看到一个代码片段，它会一次读取一个字符，直到它击中了 EOF。由于所有字符都是文件/输入流中的有效字符，这意味着 EOF 不能是任何字符值。该代码将读取的字符放入 int 中，测试是否存在 EOF，如果不是，则转换为 char。

我意识到这并不能完全回答您的问题，但如果 EOF 文字是，则其余字符文字为 sizeof(int) 是有意义的。

int r;
char buffer[1024], *p; // Don't use in production - buffer overflow likely
p = buffer;

while ((r = getc(file)) != EOF)
{
  *(p++) = (char) r;
}

I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. The code put the read character into an int, tested for EOF, and converted to a char if it wasn't.

I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.

int r;
char buffer[1024], *p; // Don't use in production - buffer overflow likely
p = buffer;

while ((r = getc(file)) != EOF)
{
  *(p++) = (char) r;
}

回复收藏 0 原文

桃酥萝莉 2024-07-17 19:52:23

我还没有看到它的基本原理（C char 文字是 int 类型），但这里有一些 Stroustrup 不得不说一下（来自设计与进化 11.2.1 - 细粒度分辨率）：

在 C 语言中，诸如 'a' 之类的字符文字的类型是 int。
令人惊讶的是，在 C++ 中赋予 'a' 类型 char 不会导致任何兼容性问题。
除了病态的例子 sizeof('a')，每个可以表达的结构
在 C 和 C++ 中都给出相同的结果。

所以在大多数情况下，它不会造成任何问题。

回复收藏 0 原文

思念满溢 2024-07-17 19:52:23

造成这种情况的历史原因是，C 及其前身 B 最初是在 DEC PDP 各种字长的小型机，支持8位ASCII，但只能进行算术运算在寄存器上。（但是，不是 PDP-11；后者是后来出现的。）C 的早期版本将 int 定义为机器的本机字长，任何小于 int 的值都需要加宽为 int 以便传递到函数或从函数传出，或用于按位、逻辑或算术表达式，因为这就是底层硬件的工作方式。

这也是为什么整数提升规则仍然规定任何小于 int 的数据类型都会提升为 int。 C 实现也允许使用个补码数学代替类似的补码历史原因。与十六进制相比，八进制字符转义和八进制常量是一等公民的原因同样是那些早期的 DEC 小型计算机的字大小可分为三字节块，而不是四字节半咬。

回复收藏 0 原文

掐死时间 2024-07-17 19:52:23

这与语言规范无关，但在硬件中，CPU 通常只有一个寄存器大小（比如说 32 位），因此每当它实际对 char 进行操作（通过加、减或比较）时，都会有一个隐式的加载到寄存器时转换为 int。

编译器会在每次操作后正确屏蔽和移动数字，因此，如果您将 2 添加到（无符号字符）254，它将环绕为 0 而不是 256，但在芯片内部它实际上是一个 int直到您将其保存回内存。

这是一个学术观点，因为无论如何，语言都可以指定 8 位文字类型，但在这种情况下，语言规范恰好更准确地反映了 CPU 真正在做什么。

（x86 爱好者可能会注意到，有一个本机addh opcode 一步添加短宽寄存器，但在 RISC 核心内，这转化为两个步骤：添加数字，然后扩展符号，就像 PowerPC< 上的 add/extsh 对/a>。）

回复收藏 0 原文

你げ笑在眉眼 2024-07-17 19:52:23

这才是正确的行为，叫“积分促销”。在其他情况下也可能发生（主要是二元运算符，如果我没记错的话）。

为了确定起见，我检查了我的《专家 C 编程：深层秘密》副本，并确认 char 文字不以 int 类型开头强>。它最初是 char 类型，但当它在表达式中使用时，它提升为int。以下内容摘自书中：

字符文字的类型为 int 且
他们通过遵守规则到达那里
用于从 char 类型升级。这是
第 1 页的 K&R 1 中的介绍过于简单
39 其中说：
表达式中的每个字符都是
转换成 int...注意
表达式中的所有浮点数都是
转换为双....因为
函数参数是一个表达式，
类型转换也会发生在
参数传递给函数：
特别是 char 和 Short 变成 int，
浮动变为双精度。