处理 ctype.h 整数溢出
处理字符值的正确方法是什么,当转换为无符号字符时,字符值落在 {INT_MAX + 1 ... UCHAR_MAX} 之间,其中 UCHAR_MAX 大于 INT_MAX。
int is_digit(char c) {
unsigned char uchar = c;
if(uchar > INT_MAX)
return MAYBE;
return isdigit((int)uchar) ? YES : NO;
}
What is the proper way to deal with character values which when casted to an unsigned char fall between {INT_MAX + 1 ... UCHAR_MAX} where UCHAR_MAX is greater than INT_MAX.
int is_digit(char c) {
unsigned char uchar = c;
if(uchar > INT_MAX)
return MAYBE;
return isdigit((int)uchar) ? YES : NO;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
UCHAR_MAX
大于INT_MAX
的唯一方法是,如果您使用的是sizeof(int) == 1
的计算机; ie,其中char
的位数与int
的位数相同。在这些机器上,UCHAR_MAX
=UINT_MAX
≥INT_MAX
。在 32 位(或更高)的计算机上,这不太可能成为问题。只要变量
c
中的值来自文本源,我所知道的文本编码就不会导致溢出。即使“UTF-32”也只有低 21 位有效。 (实际上,由于我们正在讨论奇怪的系统,我应该说这适用于sizeof(int)
= 1 且CHAR_BIT
≥ 22 的机器。☺)如果在这样的情况下尽管如此,机器
is_digit()
仍传递了大于INT_MAX
的参数c
,它不是来自文本源。未定义的行为是将非字符数据放入char
变量中的结果,并且这始终是程序员所做的事情,而不是实现导致的事情。在一个系统中,这可能是一个问题:16 位
char
和int
,并且该系统使用 16 位字符代码 ( 例如,UTF-16),其中可以设置高位。如果是这种情况,那么实现就应该将纯char
定义为带符号的,正是出于这个原因。使用char
签名后,它将提升为(签名)int
并可以安全地传递给is*()
系列函数;如果char
无符号,它将提升为unsigned
int
并强制转换为有符号int
可能是未定义的。在这样的系统上,您的代码确实已损坏,但这将是您自己的错误,因为完全不必要的转换为
unsigned char
以及危险的(在此系统上)强制转换 <代码>(int)uchar。总结一下:在具有
sizeof(int) == 1
的系统上,实现的责任是确保每个代码点在存储在char
变量中时都可以安全地传递到ctype.h
函数(需要int
参数)。这总是可以做到。如果您在char
变量中存储了一些不是代码点的内容,并将其传递给is*()
,那么未定义行为的责任就在于您自己。就你一个人。The only way
UCHAR_MAX
will be greater thanINT_MAX
is if you’re on a machine withsizeof(int) == 1
; i.e., wherechar
has as many bits asint
. On these machines,UCHAR_MAX
=UINT_MAX
≥INT_MAX
.On a 32-bit (or greater) machine, this is unlikely to be a problem. So long as the value in the variable
c
comes from a text source, there is no textual encoding I know of that will cause an overflow. Even ‘UTF-32’ will only have the low 21 bits active. (Actually, since we’re discussing odd systems, I should say that this works for machines withsizeof(int)
= 1 andCHAR_BIT
≥ 22. ☺)If on such a machine
is_digit()
was nevertheless passed an argumentc
greater thanINT_MAX
, it did not come from a text source. The undefined behavior is a consequence of putting non-character data into achar
variable, and that will always be something the programmer did, not something the implementation caused.There is a system where this can be a problem: 16-bit
char
andint
, and the system used a 16-bit character code (e.g., UTF-16) where the high bit can be set. If such is the case, it behooves the implementation to define plainchar
as signed, exactly for this reason. Withchar
signed, it will promote to (signed)int
and can safely be passed to theis*()
family of functions; withchar
unsigned, it will promote tounsigned
int
and the cast to signedint
may be undefined.On such a system, your code is indeed broken, but that would be your own fault for the completely unnecessary conversion to
unsigned char
and the dangerous (on this system) cast(int)uchar
.To summarize: On systems with
sizeof(int) == 1
it is the implementation’s responsibility to ensure that every code point, when stored in achar
variable, can safely be passed to thectype.h
functions (which expectint
arguments). This can always be done. If you’ve stored something in achar
variable that is not a code point and passed that tois*()
then the blame for the undefined behavior is yours & yours alone.Unicode 字符集(这是最常用的)的字符代码从 0 到 0x10ffff。因此,字符代码大于 INT_MAX 的唯一可能是
int
是 16 位类型(或具体小于 22 位)。如果是这种情况,那么您根本无法将字符代码存储在int
中。如果
int
是 32 位类型(或至少 22 位),则字符代码在转换为int
时不会溢出。The Unicode character set (which is the largest commonly used) has character codes from 0 to 0x10ffff. So, the only possiblity for a character code to be larger than INT_MAX is if
int
is a 16 bit type (or specifically less than 22 bits). If that would be the case, then you simply can not store a character code in anint
.If
int
is a 32 bit type (or at least 22 bits) then a character code will not overflow when cast to anint
.