处理 ctype.h 整数溢出

发布于 2024-09-07 13:31:37 字数 265 浏览 7 评论 0原文

处理字符值的正确方法是什么,当转换为无符号字符时,字符值落在 {INT_MAX + 1 ... UCHAR_MAX} 之间,其中 UCHAR_MAX 大于 INT_MAX。

int is_digit(char c) {
    unsigned char uchar = c;
    if(uchar > INT_MAX)
        return MAYBE;
    return isdigit((int)uchar) ? YES : NO;
}

What is the proper way to deal with character values which when casted to an unsigned char fall between {INT_MAX + 1 ... UCHAR_MAX} where UCHAR_MAX is greater than INT_MAX.

int is_digit(char c) {
    unsigned char uchar = c;
    if(uchar > INT_MAX)
        return MAYBE;
    return isdigit((int)uchar) ? YES : NO;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

戈亓 2024-09-14 13:31:37

UCHAR_MAX 大于 INT_MAX 的唯一方法是,如果您使用的是 sizeof(int) == 1 的计算机; ie,其中 char 的位数与 int 的位数相同。在这些机器上,UCHAR_MAX = UINT_MAXINT_MAX

在 32 位(或更高)的计算机上,这不太可能成为问题。只要变量 c 中的值来自文本源,我所知道的文本编码就不会导致溢出。即使“UTF-32”也只有低 21 位有效。 (实际上,由于我们正在讨论奇怪的系统,我应该说这适用于 sizeof(int) = 1 且 CHAR_BIT ≥ 22 的机器。☺)

如果在这样的情况下尽管如此,机器 is_digit() 仍传递了大于 INT_MAX 的参数 c它不是来自文本源。未定义的行为是将非字符数据放入 char 变量中的结果,并且这始终是程序员所做的事情,而不是实现导致的事情。

在一个系统中,这可能是一个问题:16 位 charint,并且该系统使用 16 位字符代码 ( 例如,UTF-16),其中可以设置高位。如果是这种情况,那么实现就应该将纯 char 定义为带符号的,正是出于这个原因。使用 char 签名后,它将提升为(签名)int 并可以安全地传递给 is*() 系列函数;如果 char 无符号,它将提升为 unsignedint 并强制转换为有符号 int 可能是未定义的。

在这样的系统上,您的代码确实已损坏,但这将是您自己的错误,因为完全不必要的转换为 unsigned char 以及危险的(在此系统上)强制转换 <代码>(int)uchar。

总结一下:在具有 sizeof(int) == 1 的系统上,实现的责任是确保每个代码点在存储在 char 变量中时都可以安全地传递到 ctype.h 函数(需要 int 参数)。这总是可以做到。如果您在 char 变量中存储了一些不是代码点的内容,并将其传递给 is*(),那么未定义行为的责任就在于您自己。就你一个人。

The only way UCHAR_MAX will be greater than INT_MAX is if you’re on a machine with sizeof(int) == 1; i.e., where char has as many bits as int. On these machines, UCHAR_MAX = UINT_MAXINT_MAX.

On a 32-bit (or greater) machine, this is unlikely to be a problem. So long as the value in the variable c comes from a text source, there is no textual encoding I know of that will cause an overflow. Even ‘UTF-32’ will only have the low 21 bits active. (Actually, since we’re discussing odd systems, I should say that this works for machines with sizeof(int) = 1 and CHAR_BIT ≥ 22. ☺)

If on such a machine is_digit() was nevertheless passed an argument c greater than INT_MAX, it did not come from a text source. The undefined behavior is a consequence of putting non-character data into a char variable, and that will always be something the programmer did, not something the implementation caused.

There is a system where this can be a problem: 16-bit char and int, and the system used a 16-bit character code (e.g., UTF-16) where the high bit can be set. If such is the case, it behooves the implementation to define plain char as signed, exactly for this reason. With char signed, it will promote to (signed) int and can safely be passed to the is*() family of functions; with char unsigned, it will promote to unsignedint and the cast to signed int may be undefined.

On such a system, your code is indeed broken, but that would be your own fault for the completely unnecessary conversion to unsigned char and the dangerous (on this system) cast (int)uchar.

To summarize: On systems with sizeof(int) == 1 it is the implementation’s responsibility to ensure that every code point, when stored in a char variable, can safely be passed to the ctype.h functions (which expect int arguments). This can always be done. If you’ve stored something in a char variable that is not a code point and passed that to is*() then the blame for the undefined behavior is yours & yours alone.

感性 2024-09-14 13:31:37

Unicode 字符集(这是最常用的)的字符代码从 0 到 0x10ffff。因此,字符代码大于 INT_MAX 的唯一可能是 int 是 16 位类型(或具体小于 22 位)。如果是这种情况,那么您根本无法将字符代码存储在 int 中。

如果 int 是 32 位类型(或至少 22 位),则字符代码在转换为 int 时不会溢出。

The Unicode character set (which is the largest commonly used) has character codes from 0 to 0x10ffff. So, the only possiblity for a character code to be larger than INT_MAX is if int is a 16 bit type (or specifically less than 22 bits). If that would be the case, then you simply can not store a character code in an int.

If int is a 32 bit type (or at least 22 bits) then a character code will not overflow when cast to an int.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文