为什么 C 字符文字是 int 而不是 char?

发布于 2024-07-10 19:52:22 字数 264 浏览 12 评论 0原文

在 C++ 中,sizeof('a') == sizeof(char) == 1。 这很直观,因为 'a' 是字符文字,并且 sizeof(char) == 1 由标准定义。

但是在 C 中,<代码>sizeof('a') == sizeof(int)。 也就是说,看起来 C 字符文字实际上是整数。 有谁知道为什么? 我可以找到很多关于这个 C 怪癖的提及,但没有解释它为什么存在。

In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.

In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

梦旅人picnic 2024-07-17 19:52:23

我不知道C中的字符文字是int类型的具体原因。 但在 C++ 中,有充分的理由不这样做。 考虑一下:

void print(int);
void print(char);

print('a');

您会期望对 print 的调用选择采用字符的第二个版本。 将字符字面量设置为 int 将使这变得不可能。 请注意,在 C++ 中,具有多个字符的文字仍然具有 int 类型,尽管它们的值是实现定义的。 因此,'ab' 的类型为 int,而 'a' 的类型为 char

I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:

void print(int);
void print(char);

print('a');

You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.

み零 2024-07-17 19:52:23

在我的 MacBook 上使用 GCC ,我尝试:

#include <stdio.h>

#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
  test('a');
  test("a");
  test("");
  test(char);
  test(short);
  test(int);
  test(long);
  test((char)0x0);
  test((short)0x0);
  test((int)0x0);
  test((long)0x0);
  return 0;
};

运行时给出:

'a':    4
"a":    2
"":     1
char:   1
short:  2
int:    4
long:   4
(char)0x0:      1
(short)0x0:     2
(int)0x0:       4
(long)0x0:      4

这表明一个字符是8 位,就像你怀疑的那样,但字符文字是一个 int。

Using GCC on my MacBook, I try:

#include <stdio.h>

#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
  test('a');
  test("a");
  test("");
  test(char);
  test(short);
  test(int);
  test(long);
  test((char)0x0);
  test((short)0x0);
  test((int)0x0);
  test((long)0x0);
  return 0;
};

which when run gives:

'a':    4
"a":    2
"":     1
char:   1
short:  2
int:    4
long:   4
(char)0x0:      1
(short)0x0:     2
(int)0x0:       4
(long)0x0:      4

which suggests that a character is 8 bits, like you suspect, but a character literal is an int.

爱情眠于流年 2024-07-17 19:52:23

当 C 被编写时,PDP-11 的 MACRO-11 汇编语言had:

MOV #'A, R0      // 8-bit character encoding for 'A' into 16 bit register

这种事情在汇编语言中很常见 - 低 8 位将保存字符代码,其他位清零。 PDP-11 甚至有:

MOV #"AB, R0     // 16-bit character encoding for 'A' (low byte) and 'B'

这提供了一种将两个字符加载到低字节和高字节的便捷方法16位寄存器。 然后,您可以将它们写入其他地方,更新一些文本数据或屏幕内存。

因此,将字符提升到寄存器大小的想法是很正常和可取的。 但是,假设您需要将“A”放入寄存器中,而不是作为硬编码操作码的一部分,而是从主内存中的某个位置获取,其中包含:

address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'

如果您只想从该主内存中将“A”读入寄存器中,你会读哪一本?

  • 某些 CPU 可能只支持直接将 16 位值读入 16 位寄存器,这意味着在 20 或 22 处读取将需要清除“X”中的位,具体取决于 CPU 的字节序 之一或其他需要移至低位字节。

  • 某些 CPU 可能需要内存对齐读取,这意味着涉及的最低地址必须是数据大小的倍数:您可能能够从地址 24 和 25 读取,但不能从 27 和 28 读取。

因此,编译器生成代码将“A”放入寄存器可能更愿意浪费一点额外的内存,并将值编码为 0“A”或“A”0 - 取决于字节顺序,并确保它正确对齐(即不在奇数内存中)地址)。

我的猜测是,C 只是继承了这种以 CPU 为中心的行为,考虑到占用内存寄存器大小的字符常量,从而证实了 C 作为“高级汇编程序”的普遍评估。

(参见 PDP-11 MACRO-11 第 6-25 页的 6.3.3
语言参考手册

Back when C was being written, the PDP-11's MACRO-11 assembly language had:

MOV #'A, R0      // 8-bit character encoding for 'A' into 16 bit register

This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:

MOV #"AB, R0     // 16-bit character encoding for 'A' (low byte) and 'B'

This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.

So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:

address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'

If you want to read just an 'A' from this main memory into a register, which one would you read?

  • Some CPUs may only directly support reading a 16 bit value into a 16-bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.

  • Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.

So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0—depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).

My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".

(See 6.3.3 on page 6-25 of PDP-11 MACRO-11
Language Reference Manual
)

江南烟雨〆相思醉 2024-07-17 19:52:23

我记得读过 K&R 并看到一个代码片段,它会一次读取一个字符,直到它击中了 EOF。 由于所有字符都是文件/输入流中的有效字符,这意味着 EOF 不能是任何字符值。 该代码将读取的字符放入 int 中,测试是否存在 EOF,如果不是,则转换为 char

我意识到这并不能完全回答您的问题,但如果 EOF 文字是,则其余字符文字为 sizeof(int) 是有意义的。

int r;
char buffer[1024], *p; // Don't use in production - buffer overflow likely
p = buffer;

while ((r = getc(file)) != EOF)
{
  *(p++) = (char) r;
}

I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. The code put the read character into an int, tested for EOF, and converted to a char if it wasn't.

I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.

int r;
char buffer[1024], *p; // Don't use in production - buffer overflow likely
p = buffer;

while ((r = getc(file)) != EOF)
{
  *(p++) = (char) r;
}
桃酥萝莉 2024-07-17 19:52:23

我还没有看到它的基本原理(C char 文字是 int 类型),但这里有一些 Stroustrup 不得不说一下(来自设计与进化 11.2.1 - 细粒度分辨率):

在 C 语言中,诸如 'a' 之类的字符文字的类型是 int
令人惊讶的是,在 C++ 中赋予 'a' 类型 char 不会导致任何兼容性问题。
除了病态的例子 sizeof('a'),每个可以表达的结构
在 C 和 C++ 中都给出相同的结果。

所以在大多数情况下,它不会造成任何问题。

I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):

In C, the type of a character literal such as 'a' is int.
Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems.
Except for the pathological example sizeof('a'), every construct that can be expressed
in both C and C++ gives the same result.

So for the most part, it shouldn't cause any problems.

思念满溢 2024-07-17 19:52:23

造成这种情况的历史原因是,C 及其前身 B 最初是在 DEC PDP 各种字长的小型机,支持8位ASCII,但只能进行算术运算在寄存器上。 (但是,不是 PDP-11;后者是后来出现的。)C 的早期版本将 int 定义为机器的本机字长,任何小于 int 的值都需要加宽为 int 以便传递到函数或从函数传出,或用于按位、逻辑或算术表达式,因为这就是底层硬件的工作方式。

这也是为什么整数提升规则仍然规定任何小于 int 的数据类型都会提升为 int。 C 实现也允许使用个补码数学代替类似的补码历史原因。 与十六进制相比,八进制字符转义和八进制常量是一等公民的原因同样是那些早期的 DEC 小型计算机的字大小可分为三字节块,而不是四字节半咬

The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.

That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use ones' complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hexadecimal is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks, but not four-byte nibbles.

掐死时间 2024-07-17 19:52:23

这与语言规范无关,但在硬件中,CPU 通常只有一个寄存器大小(比如说 32 位),因此每当它实际对 char 进行操作(通过加、减或比较)时,都会有一个隐式的加载到寄存器时转换为 int。

编译器会在每次操作后正确屏蔽和移动数字,因此,如果您将 2 添加到(无符号字符)254,它将环绕为 0 而不是 256,但在芯片内部它实际上是一个 int直到您将其保存回内存。

这是一个学术观点,因为无论如何,语言都可以指定 8 位文字类型,但在这种情况下,语言规范恰好更准确地反映了 CPU 真正在做什么。

x86 爱好者可能会注意到,有一个本机addh opcode 一步添加短宽寄存器,但在 RISC 核心内,这转化为两个步骤:添加数字,然后扩展符号,就像 PowerPC< 上的 add/extsh 对/a>。)

This is only tangential to the language specification, but in hardware the CPU usually only has one register size—32 bits, let's say—and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register.

The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.

It's sort of an academic point, because the language could have specified an 8-bit literal type anyway, but in this case the language specification happens to reflect more closely what the CPU is really doing.

(x86 wonks may note that there is eg a native addh opcode that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, and then extend sign, like an add/extsh pair on the PowerPC.)

你げ笑在眉眼 2024-07-17 19:52:23

这才是正确的行为,叫“积分促销”。 在其他情况下也可能发生(主要是二元运算符,如果我没记错的话)。

为了确定起见,我检查了我的《专家 C 编程:深层秘密》副本,并确认 char 文字不以 int 类型开头强>。 它最初是 char 类型,但当它在表达式中使用时,它提升int。 以下内容摘自书中:

字符文字的类型为 int 且
他们通过遵守规则到达那里
用于从 char 类型升级。 这是
第 1 页的 K&R 1 中的介绍过于简单
39 其中说:

表达式中的每个字符都是
转换成 int...注意
表达式中的所有浮点数都是
转换为双....因为
函数参数是一个表达式,
类型转换也会发生在
参数传递给函数:
特别是 char 和 Short 变成 int,
浮动变为双精度。

This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).

Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:

Character literals have type int and
they get there by following the rules
for promotion from type char. This is
too briefly covered in K&R 1, on page
39 where it says:

Every char in an expression is
converted into an int....Notice that
all float's in an expression are
converted to double....Since a
function argument is an expression,
type conversions also take place when
arguments are passed to functions: in
particular, char and short become int,
float becomes double.

煮酒 2024-07-17 19:52:22

关于同一主题的讨论

“更具体地说是积分促销。在 K&RC 中,它实际上是(?)
如果不先将字符值提升为 int,则不可能使用它,
因此,首先将字符常量设置为 int 就消除了这一步。
过去和现在仍然存在多字符常量,例如“abcd”或但是
许多都适合 int。”

Discussion on same subject

"More specifically the integral promotions. In K&R C it was virtually (?)
impossible to use a character value without it being promoted to int first,
so making character constant int in the first place eliminated that step.
There were and still are multi character constants such as 'abcd' or however
many will fit in an int."

笑着哭最痛 2024-07-17 19:52:22

最初的问题是“为什么?”

原因是文字字符的定义已经演变和改变,同时试图保持与现有代码的向后兼容。

在 C 早期的黑暗时期,根本没有类型。 当我第一次学习用 C 编程时,类型已经被引入,但是函数没有原型来告诉调用者参数类型是什么。 相反,它被标准化为作为参数传递的所有内容要么是 int 的大小(包括所有指针),要么是 double 。

这意味着,当您编写函数时,所有非 double 参数都以 int 形式存储在堆栈中,无论您如何声明它们,并且编译器将代码放入函数中来为您处理此问题。

这使得事情有些不一致,因此当 K&R 撰写他们的著名书籍时,他们提出了这样的规则:在任何表达式中,字符文字始终会提升为 int,而不仅仅是函数参数。

当 ANSI 委员会首次标准化 C 时,他们改变了这条规则,使字符文字只是一个 int,因为这似乎是实现相同目标的更简单的方法。

当设计 C++ 时,所有函数都需要有完整的原型(这在 C 中仍然不要求,尽管它被普遍认为是良好实践)。 因此,决定将字符文字存储在 char 中。 在 C++ 中这样做的优点是带有 char 参数的函数和带有 int 参数的函数具有不同的签名。 这个优点在C中是没有的。

这就是它们不同的原因。 进化...

The original question is "why?"

The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.

In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.

This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.

This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.

When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.

When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.

This is why they are different. Evolution...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文