关于 C 中联合的问题 - 存储为一种类型并读取为另一种类型 - 它是实现定义的吗?
我正在阅读 K&R 中关于 C 中的 union 的内容,据我了解,union 中的单个变量可以保存多种类型中的任何一种,如果某些内容存储为一种类型并提取为另一种类型,则结果纯粹是实现定义的。
现在请检查此代码片段:
#include<stdio.h>
int main(void)
{
union a
{
int i;
char ch[2];
};
union a u;
u.ch[0] = 3;
u.ch[1] = 2;
printf("%d %d %d\n", u.ch[0], u.ch[1], u.i);
return 0;
}
输出:
3 2 515
这里我在 u.ch
中分配值,但从 u.ch
和 ui
中检索。是否定义了实现?或者我正在做一些非常愚蠢的事情?
我知道对于大多数其他人来说这可能看起来很初学者,但我无法弄清楚该输出背后的原因。
谢谢。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
这是未定义的行为。
ui
和u.ch
位于同一内存地址。因此,写入其中一个并从另一个读取的结果取决于编译器、平台、体系结构,有时甚至取决于编译器的优化级别。因此ui
的输出可能并不总是515
。示例
例如,我机器上的
gcc
为-O0
和-O2
生成两个不同的答案。因为我的机器具有 32 位小端架构,使用
-O0
我最终得到两个最低有效字节初始化为 2 和 3,两个最高有效字节未初始化。所以联合体的内存看起来像这样:{3, 2,garbage,garbage}
因此我得到类似于
3 2 -1216937469
的输出。使用
-O2
,我像您一样得到3 2 515
的输出,这使得联合内存{3, 2, 0, 0}< /代码>。发生的情况是,
gcc
使用实际值优化了对printf
的调用,因此汇编输出看起来相当于:可以按照此问题的其他答案中的其他解释获得值 515。本质上,这意味着当
gcc
优化调用时,它选择了零作为可能未初始化的联合的随机值。向一名工会成员写入并从另一名工会成员读取通常没有多大意义,但有时它对于使用严格别名编译的程序可能很有用。
This is undefined behaviour.
u.i
andu.ch
are located at the same memory address. So, the result of writing into one and reading from the other depends on the compiler, platform, architecture, and sometimes even compiler's optimization level. Therefore the output foru.i
may not always be515
.Example
For example
gcc
on my machine produces two different answers for-O0
and-O2
.Because my machine has 32-bit little-endian architecture, with
-O0
I end up with two least significant bytes initialized to 2 and 3, two most significant bytes are uninitialized. So the union's memory looks like this:{3, 2, garbage, garbage}
Hence I get the output similar to
3 2 -1216937469
.With
-O2
, I get the output of3 2 515
like you do, which makes union memory{3, 2, 0, 0}
. What happens is thatgcc
optimizes the call toprintf
with actual values, so the assembly output looks like an equivalent of:The value 515 can be obtained as other explained in other answers to this question. In essence it means that when
gcc
optimized the call it has chosen zeroes as the random value of a would-be uninitialized union.Writing to one union member and reading from another usually does not make much sense, but sometimes it may be useful for programs compiled with strict aliasing.
这个问题的答案取决于历史背景,因为语言的规范随着时间而变化。而这件事,恰好是受到变化影响的事情。
你说你在读《K&R》。该书的最新版本(截至目前)描述了 C 语言的第一个标准化版本 - C89/90。在该版本的 C 语言中,写入联合体的一个成员并读取另一个成员是未定义的行为。不是实现定义(这是另一回事),而是未定义行为。本例中语言标准的相关部分是 6.5/7。
现在,在 C 发展的后期(应用了技术勘误 3 的语言规范的 C99 版本),使用联合进行类型双关突然变得合法,即编写联合的一个成员,然后读取另一个成员。
请注意,尝试这样做仍然可能导致未定义的行为。如果您读取的值恰好对于您读取的类型无效(所谓的“陷阱表示”),则该行为仍然是未定义的。否则,您读取的值是实现定义的。
您的特定示例对于从
int
到char[2]
数组的类型双关相对安全。在 C 语言中,将任何对象的内容重新解释为 char 数组始终是合法的(同样,6.5/7)。然而,反之则不然。将数据写入联合体的
char[2]
数组成员,然后将其作为int
读取,可能会创建陷阱表示并导致未定义的行为。即使您的 char 数组有足够的长度来覆盖整个int
,也存在潜在的危险。但在您的具体情况下,如果
int
恰好大于char[2]
,您读取的int
将覆盖末尾之外的未初始化区域数组的值,这又会导致未定义的行为。The answer to this question depends on the historical context, since the specification of the language changed with time. And this matter happens to be the one affected by the changes.
You said that you were reading K&R. The latest edition of that book (as of now), describes the first standardized version of C language - C89/90. In that version of C language writing one member of union and reading another member is undefined behavior. Not implementation defined (which is a different thing), but undefined behavior. The relevant portion of the language standard in this case is 6.5/7.
Now, at some later point in evolution of C (C99 version of language specification with Technical Corrigendum 3 applied) it suddenly became legal to use union for type punning, i.e. to write one member of the union and then read another.
Note that attempting to do that can still lead to undefined behavior. If the value you read happens to be invalid (so called "trap representation") for the type you read it through, then the behavior is still undefined. Otherwise, the value you read is implementation defined.
Your specific example is relatively safe for type punning from
int
tochar[2]
array. It is always legal in C language to reinterpret the content of any object as a char array (again, 6.5/7).However, the reverse is not true. Writing data into the
char[2]
array member of your union and then reading it as anint
can potentially create a trap representation and lead to undefined behavior. The potential danger exists even if your char array has sufficient length to cover the entireint
.But in your specific case, if
int
happens to be larger thanchar[2]
, theint
you read will cover uninitialized area beyond the end of the array, which again leads to undefined behavior.输出背后的原因是,在您的计算机上,整数存储在 little-endian 中格式:首先存储最低有效字节。因此字节序列
[3,2,0,0]表示整数3+2*256=515。
这个结果取决于具体的实现和平台。
The reason behind the output is that on your machine integers are stored in little-endian format: the least-significant bytes are stored first. Hence the byte sequence
[3,2,0,0] represents the integer 3+2*256=515.
This result depends on the specific implementation and the platform.
此类代码的输出将取决于您的平台和 C 编译器实现。您的输出让我认为您正在小端系统(可能是 x86)上运行此代码。如果您将 515 放入 i 中并在调试器中查看它,您会发现最低位字节将是 3,内存中的下一个字节将是 2,它与您放入 ch 中的内容精确映射。
如果您在大端系统上执行此操作,您(可能)会得到 770(假设 16 位整数)或 50462720(假设 32 位整数)。
The output from such code will be dependent on your platform and C compiler implementation. Your output makes me think you're running this code on a litte-endian system (probably x86). If you were to put 515 into i and look at it in a debugger, you would see that the lowest-order byte would be a 3 and the next byte in memory would be a 2, which maps exactly to what you put in ch.
If you did this on a big-endian system, you would have (probably) gotten 770 (assuming 16-bit ints) or 50462720 (assuming 32-bit ints).
它取决于实现,结果可能会在不同的平台/编译器上有所不同,但似乎这就是正在发生的事情:
二进制中的 515 是
填充零以使其成为两个字节(假设 16 位 int):
这两个字节是:
这是
2
和3
希望有人解释为什么它们被颠倒 - 我的猜测是字符没有颠倒,但 int 是小尾数。
分配给联合的内存量等于存储最大成员所需的内存。在本例中,您有一个 int 和一个长度为 2 的 char 数组。假设 int 是 16 位,char 是 8 位,两者都需要相同的空间,因此 union 分配了两个字节。
当您将三个 (00000011) 和两个 (00000010) 分配给 char 数组时,联合状态为
0000001100000010
。当您从这个联合中读取 int 时,它会将整个事物转换为整数。假设 LSB 存储在最低地址的 little-endian 表示形式,则 int 读取来自联合的结果将是0000001000000011
,它是 515 的二进制文件。注意:即使 int 是 32 位,这也成立 - 检查 阿姆农的回答
It is implementation dependent and results might vary on a different platform/compiler but it seems this is what is happening:
515 in binary is
Padding zeros to make it two bytes (assuming 16 bit int):
The two bytes are:
Which is
2
and3
Hope someone explains why they are reversed - my guess is that chars are not reversed but the int is little endian.
Amount of memory allocated to a union is equal to the memory required to store the biggest member. In this case, you have an int and a char array of length 2. Assuming int is 16 bit and char is 8 bit, both require same space and hence the union is allocated two bytes.
When you assign three (00000011) and two (00000010) to the char array, the state of union is
0000001100000010
. When you read the int from this union, it converts the whole thing into and integer. Assuming little-endian representation where LSB is stored at lowest address, the int read from the union would be0000001000000011
which is the binary for 515.NOTE: This holds true even if the int was 32 bit - Check Amnon's answer
如果您使用的是 32 位系统,则 int 为 4 个字节,但您只初始化了 2 个字节。访问未初始化的数据是未定义的行为。
假设您使用的是 16 位整数的系统,那么您所做的仍然是实现定义的。如果您的系统是小端,则 u.ch[0] 将对应 ui 的最低有效字节,而 u.ch1 将是最高有效字节。在大端系统上,情况正好相反。此外,C 标准并不强制实现使用 二进制补码 来表示有符号整数值,尽管补码是最常见的。显然,整数的大小也是实现定义的。
提示:如果使用十六进制值,可以更轻松地了解发生的情况。在小端系统上,十六进制结果将为 0x0203。
If you're on a 32-bit system, then an int is 4 bytes but you only initialise only 2 bytes. Accessing uninitialised data is undefined behaviour.
Assuming you're on a system with 16-bit ints, then what you are doing is still implementation defined. If your system is little endian, then u.ch[0] will correspond with the least significant byte of u.i and u.ch1 will be the most significant byte. On a big endian system, it's the other way around. Also, the C standard does not force the implementation to use two's complement to represent signed integer values, though two's complement is the most common. Obviously, the size of an integer is also implementation defined.
Hint: it's easier to see what's happening if you use hexadecimal values. On a little endian system, the result in hex would be 0x0203.