一个字符被签名意味着什么?
鉴于有符号和无符号整数使用相同的寄存器等,并且只是以不同的方式解释位模式,并且 C 字符基本上只是 8 位整数,那么 C 中的有符号和无符号字符之间有什么区别? 我知道 char 的符号是实现定义的,我根本无法理解它如何产生影响,至少当 char 用于保存字符串而不是做数学时。
Given that signed and unsigned ints use the same registers, etc., and just interpret bit patterns differently, and C chars are basically just 8-bit ints, what's the difference between signed and unsigned chars in C? I understand that the signedness of char is implementation defined, and I simply can't understand how it could ever make a difference, at least when char is used to hold strings instead of to do math.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
符号在 char 中的工作方式与在其他整数类型中的工作方式几乎相同。 正如您所注意到的,字符实际上只是一字节整数。 (不一定是 8 位,但是!这是有区别的;在某些平台上,字节可能大于 8 位,并且由于
char
和sizeof(char)
的定义CHAR_BIT
宏,在
中定义或C++ 的
会告诉您char
中有多少位。)。至于为什么你想要一个带有符号的字符:在 C 和 C++ 中,没有称为
byte
的标准类型。 对于编译器来说,char
是字节,反之亦然,并且它不区分它们。 不过,有时您希望 -- 有时您希望该char
是一个单字节数字,在这些情况下(特别是一个字节的范围有多小) ),您通常还关心该数字是否有签名。 我个人使用有符号(或无符号)来表示某个char
是(数字)“字节”而不是字符,并且它将以数字形式使用。 如果没有指定的符号,该char
实际上是一个字符,并且旨在用作文本。相反,我曾经这样做过。 现在,较新版本的 C 和 C++ 具有
(u?)int_least8_t
(当前在
或
中进行类型定义) code>),它们是更明确的数字(尽管它们通常只是有符号和无符号char
类型的 typedef)。Signedness works pretty much the same way in
char
s as it does in other integral types. As you've noted, chars are really just one-byte integers. (Not necessarily 8-bit, though! There's a difference; a byte might be bigger than 8 bits on some platforms, andchar
s are rather tied to bytes due to the definitions ofchar
andsizeof(char)
. TheCHAR_BIT
macro, defined in<limits.h>
or C++'s<climits>
, will tell you how many bits are in achar
.).As for why you'd want a character with a sign: in C and C++, there is no standard type called
byte
. To the compiler,char
s are bytes and vice versa, and it doesn't distinguish between them. Sometimes, though, you want to -- sometimes you want thatchar
to be a one-byte number, and in those cases (particularly how small a range a byte can have), you also typically care whether the number is signed or not. I've personally used signedness (or unsignedness) to say that a certainchar
is a (numeric) "byte" rather than a character, and that it's going to be used numerically. Without a specified signedness, thatchar
really is a character, and is intended to be used as text.I used to do that, rather. Now the newer versions of C and C++ have
(u?)int_least8_t
(currently typedef'd in<stdint.h>
or<cstdint>
), which are more explicitly numeric (though they'll typically just be typedefs for signed and unsignedchar
types anyway).我能想象这是一个问题的唯一情况是您选择对字符进行数学计算。 编写以下代码是完全合法的。
根据字符的符号,c 可以是两个值之一。 如果 char 是无符号的,则 c 将为 (char)162。 如果它们被签名,那么它将出现溢出情况,因为签名 char 的最大值是 128。我猜大多数实现只会返回 (char)-32。
The only situation I can imagine this being an issue is if you choose to do math on chars. It's perfectly legal to write the following code.
Depending on the signedness of the char, c could be one of two values. If char's are unsigned then c will be (char)162. If they are signed then it will an overflow case as the max value for a signed char is 128. I'm guessing most implementations would just return (char)-32.
关于签名字符的一件事是,您可以测试 c >= ' '(空格)并确保它是正常的可打印 ascii 字符。 当然,它不便于携带,所以用处不大。
One thing about signed chars is that you can test c >= ' ' (space) and be sure it's a normal printable ascii char. Of course, it's not portable, so not very useful.
它不会对字符串产生影响。 但在 C 中,你可以使用 char 来做数学,这会产生影响。
事实上,当在受限内存环境中工作时,例如嵌入式 8 位应用程序,通常会使用 char 来进行数学运算,这会产生很大的差异。 这是因为 C 中默认没有
byte
类型。It won't make a difference for strings. But in C you can use a char to do math, when it will make a difference.
In fact, when working in constrained memory environments, like embedded 8 bit applications a char will often be used to do math, and then it makes a big difference. This is because there is no
byte
type by default in C.就它们表示的值而言:
unsigned char:
0..255 (00000000..11111111)
值在低边缘周围溢出为:
0 - 1 = 255 (00000000 - 00000001 = 11111111)
值在高边沿溢出为:
255 + 1 = 0 (11111111 + 00000001 = 00000000)
按位右移运算符 (
>>
) 进行逻辑移位:<代码>10000000>> 1 = 01000000 (128 / 2 = 64)
有符号字符:
-128..127 (10000000..01111111)
值在低边缘周围溢出为:< /p>
-128 - 1 = 127 (10000000 - 00000001 = 01111111)
值在高边沿溢出,如下所示:
127 + 1 = -128 (01111111 + 00000001 = 10000000)
按位右移运算符 (
>>
) 进行算术移位:<代码>10000000>> 1 = 11000000 (-128 / 2 = -64)
我包含了二进制表示形式,以表明值包装行为是纯粹的、一致的二进制算术,并且与有符号/无符号的 char 无关(除了正确的转变)。
更新
评论中提到的一些特定于实现的行为:
In terms of the values they represent:
unsigned char:
0..255 (00000000..11111111)
values overflow around low edge as:
0 - 1 = 255 (00000000 - 00000001 = 11111111)
values overflow around high edge as:
255 + 1 = 0 (11111111 + 00000001 = 00000000)
bitwise right shift operator (
>>
) does a logical shift:10000000 >> 1 = 01000000 (128 / 2 = 64)
signed char:
-128..127 (10000000..01111111)
values overflow around low edge as:
-128 - 1 = 127 (10000000 - 00000001 = 01111111)
values overflow around high edge as:
127 + 1 = -128 (01111111 + 00000001 = 10000000)
bitwise right shift operator (
>>
) does an arithmetic shift:10000000 >> 1 = 11000000 (-128 / 2 = -64)
I included the binary representations to show that the value wrapping behaviour is pure, consistent binary arithmetic and has nothing to do with a char being signed/unsigned (expect for right shifts).
Update
Some implementation-specific behaviour mentioned in the comments:
对字符串进行排序时这很重要。
It's important when sorting strings.
有一些区别。 最重要的是,如果您通过分配太大或太小的整数来溢出 char 的有效范围,并且 char 是有符号的,则结果值是实现定义的,甚至可能会出现某些信号(在 C 中),对于所有有符号类型。 与将太大或太小的值分配给无符号字符时的情况进行对比:值环绕,您将获得精确定义的语义。 例如,将 -1 分配给无符号字符,您将得到 UCHAR_MAX。 因此,每当您有一个从 0 到 2^CHAR_BIT 的数字中的字节时,您实际上应该使用 unsigned char 来存储它。
当传递给 vararg 函数时,符号也会产生影响:
假设分配给 c 的值对于 char 来说太大而无法表示,并且机器使用二进制补码。 许多实现的行为是为字符分配太大的值,因为位模式不会改变。 如果 int 能够表示 char 的所有值(对于大多数实现来说都是如此),那么在传递给 printf 之前 char 将被提升为 int。 因此,传递的值将为负数。 升级为 int 将保留该符号。 所以你会得到一个负面的结果。 但是,如果 char 是无符号的,则该值也是无符号的,并且提升为 int 将产生正 int。 您可以使用 unsigned char,然后您将获得对变量赋值和传递给 printf 的精确定义的行为,然后 printf 将打印出正值。
请注意,char、unsigned 和signed char 都至少 8 位宽。 不要求 char 恰好是 8 位宽。 然而,对于大多数系统来说这是事实,但对于某些系统,您会发现它们使用 32 位字符。 C 和 C++ 中的字节被定义为 char 的大小,因此 C 中的字节也不总是恰好是 8 位。
另一个区别是,在 C 中,无符号字符必须没有填充位。 也就是说,如果您发现 CHAR_BIT 为 8,则 unsigned char 的值必须在 0 .. 2^CHAR_BIT-1 范围内。 如果 char 是无符号的,则同样如此。 对于有符号字符,您不能假设任何有关值范围的信息,即使您知道编译器如何实现符号内容(二进制补码或其他选项),其中也可能存在未使用的填充位。 在 C++ 中,所有三种字符类型都没有填充位。
There are a couple of difference. Most importantly, if you overflow the valid range of a char by assigning it a too big or small integer, and char is signed, the resulting value is implementation defined or even some signal (in C) could be risen, as for all signed types. Contrast that to the case when you assign something too big or small to an unsigned char: the value wraps around, you will get precisely defined semantics. For example, assigning a -1 to an unsigned char, you will get an UCHAR_MAX. So whenever you have a byte as in a number from 0 to 2^CHAR_BIT, you should really use unsigned char to store it.
The sign also makes a difference when passing to vararg functions:
Assume the value assigned to c would be too big for char to represent, and the machine uses two's complement. Many implementation behave for the case that you assign a too big value to the char, in that the bit-pattern won't change. If an int will be able to represent all values of char (which it is for most implementations), then the char is being promoted to int before passing to printf. So, the value of what is passed would be negative. Promoting to int would retain that sign. So you will get a negative result. However, if char is unsigned, then the value is unsigned, and promoting to an int will yield a positive int. You can use unsigned char, then you will get precisely defined behavior for both the assignment to the variable, and passing to printf which will then print something positive.
Note that a char, unsigned and signed char all are at least 8 bits wide. There is no requirement that char is exactly 8 bits wide. However, for most systems that's true, but for some, you will find they use 32bit chars. A byte in C and C++ is defined to have the size of char, so a byte in C also is not always exactly 8 bits.
Another difference is, that in C, a unsigned char must have no padding bits. That is, if you find CHAR_BIT is 8, then an unsigned char's values must range from 0 .. 2^CHAR_BIT-1. THe same is true for char if it's unsigned. For signed char, you can't assume anything about the range of values, even if you know how your compiler implements the sign stuff (two's complement or the other options), there may be unused padding bits in it. In C++, there are no padding bits for all three character types.
传统上,ASCII 字符集由 7 位字符编码组成。 (与 8 位 EBCIDIC 相反。)
当设计和实现 C 语言时,这是一个重要问题。 (出于各种原因,例如通过串行调制解调器设备进行数据传输。)额外的位具有奇偶校验等用途。
“签名字符”恰好适合这种表示。
二进制数据 OTOH 只是简单地取每个 8 位数据“块”的值,因此不需要符号。
Traditionally, the ASCII character set consists of 7-bit character encodings. (As opposed to the 8 bit EBCIDIC.)
When the C language was designed and implemented this was a significant issue. (For various reasons like data transmission over serial modem devices.) The extra bit has uses like parity.
A "signed character" happens to be perfect for this representation.
Binary data, OTOH, is simply taking the value of each 8-bit "chunk" of data, thus no sign is needed.
字节算术对于计算机图形学非常重要(其中 8 位值通常用于存储颜色)。 除此之外,我可以想到 char 符号很重要的两种主要情况:
令人讨厌的是,如果所有字符串数据都是 7 位,这些函数不会咬你。 然而,如果您试图使您的 C/C++ 程序变得 8 位干净,它可能会成为一个无休无止的隐晦错误来源。
Arithmetic on bytes is important for computer graphics (where 8-bit values are often used to store colors). Aside from that, I can think of two main cases where char sign matters:
The nasty thing is, these won't bite you if all your string data is 7-bit. However, it promises to be an unending source of obscure bugs if you're trying to make your C/C++ program 8-bit clean.