scanf() 和 strtol() / strtod() 在解析数字方面的区别

发布于 2024-08-04 23:11:55 字数 2581 浏览 8 评论 0原文

注意:我完全重新设计了这个问题,以更正确地反映我设置赏金的目的。请原谅这可能造成的与已给出答案的任何不一致。我不想创建一个新问题,因为之前对此问题的回答可能会有所帮助。


我正在致力于实现一个 C 标准库,并且对该标准的一个特定角落感到困惑。

该标准根据 strtol 的定义定义了 scanf 函数系列接受的数字格式(%d、%i、%u、%o、%x), strtoulstrtod

该标准还规定,fscanf() 最多只会将一个字符放回到输入流中,因此 strtolstrtoul< 接受一些序列。 /code> 和 strtod 对于 fscanf 是不可接受的(ISO/IEC 9899:1999,脚注 251)。

我试图找到一些能够表现出这种差异的价值观。事实证明,十六进制前缀“0x”后跟一个非十六进制数字的字符就是两个函数系列不同的情况之一。

有趣的是,很明显没有两个可用的 C 库似乎在输出上达成一致。 (请参阅本问题末尾的测试程序和示例输出。)

我想听到的是在解析“0xz”时什么被认为是符合标准的行为?。最好引用标准中的相关部分来阐明观点。

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int main()
{
    int i, count, rc;
    unsigned u;
    char * endptr = NULL;
    char culprit[] = "0xz";

    /* File I/O to assert fscanf == sscanf */
    FILE * fh = fopen( "testfile", "w+" );
    fprintf( fh, "%s", culprit );
    rewind( fh );

    /* fscanf base 16 */
    u = -1; count = -1;
    rc = fscanf( fh, "%x%n", &u, &count );
    printf( "fscanf:  Returned %d, result %2d, consumed %d\n", rc, u, count );
    rewind( fh );

    /* strtoul base 16 */
    u = strtoul( culprit, &endptr, 16 );
    printf( "strtoul:             result %2d, consumed %d\n", u, endptr - culprit );

    puts( "" );

    /* fscanf base 0 */
    i = -1; count = -1;
    rc = fscanf( fh, "%i%n", &i, &count );
    printf( "fscanf:  Returned %d, result %2d, consumed %d\n", rc, i, count );
    rewind( fh );

    /* strtol base 0 */
    i = strtol( culprit, &endptr, 0 );
    printf( "strtoul:             result %2d, consumed %d\n", i, endptr - culprit );

    fclose( fh );
    return 0;
}

/* newlib 1.14

fscanf:  Returned 1, result  0, consumed 1
strtoul:             result  0, consumed 0

fscanf:  Returned 1, result  0, consumed 1
strtoul:             result  0, consumed 0
*/

/* glibc-2.8

fscanf:  Returned 1, result  0, consumed 2
strtoul:             result  0, consumed 1

fscanf:  Returned 1, result  0, consumed 2
strtoul:             result  0, consumed 1
*/

/* Microsoft MSVC

fscanf:  Returned 0, result -1, consumed -1
strtoul:             result  0, consumed 0

fscanf:  Returned 0, result  0, consumed -1
strtoul:             result  0, consumed 0
*/

/* IBM AIX

fscanf:  Returned 0, result -1, consumed -1
strtoul:             result  0, consumed 1

fscanf:  Returned 0, result  0, consumed -1
strtoul:             result  0, consumed 1
*/

Note: I completely reworked the question to more properly reflect what I am setting the bounty for. Please excuse any inconsistencies with already-given answers this might have created. I did not want to create a new question, as previous answers to this one might be helpful.


I am working on implementing a C standard library, and am confused about one specific corner of the standard.

The standard defines the number formats accepted by the scanf function family (%d, %i, %u, %o, %x) in terms of the definitions for strtol, strtoul, and strtod.

The standard also says that fscanf() will only put back a maximum of one character into the input stream, and that therefore some sequences accepted by strtol, strtoul and strtod are unacceptable to fscanf (ISO/IEC 9899:1999, footnote 251).

I tried to find some values that would exhibit such differences. It turns out that the hexadecimal prefix "0x", followed by a character that is not a hexadecimal digit, is one such case where the two function families differ.

Funny enough, it became apparent that no two available C libraries seem to agree on the output. (See test program and example output at the end of this question.)

What I would like to hear is what would be considered standard-compliant behaviour in parsing "0xz"?. Ideally citing the relevant parts from the standard to make the point.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int main()
{
    int i, count, rc;
    unsigned u;
    char * endptr = NULL;
    char culprit[] = "0xz";

    /* File I/O to assert fscanf == sscanf */
    FILE * fh = fopen( "testfile", "w+" );
    fprintf( fh, "%s", culprit );
    rewind( fh );

    /* fscanf base 16 */
    u = -1; count = -1;
    rc = fscanf( fh, "%x%n", &u, &count );
    printf( "fscanf:  Returned %d, result %2d, consumed %d\n", rc, u, count );
    rewind( fh );

    /* strtoul base 16 */
    u = strtoul( culprit, &endptr, 16 );
    printf( "strtoul:             result %2d, consumed %d\n", u, endptr - culprit );

    puts( "" );

    /* fscanf base 0 */
    i = -1; count = -1;
    rc = fscanf( fh, "%i%n", &i, &count );
    printf( "fscanf:  Returned %d, result %2d, consumed %d\n", rc, i, count );
    rewind( fh );

    /* strtol base 0 */
    i = strtol( culprit, &endptr, 0 );
    printf( "strtoul:             result %2d, consumed %d\n", i, endptr - culprit );

    fclose( fh );
    return 0;
}

/* newlib 1.14

fscanf:  Returned 1, result  0, consumed 1
strtoul:             result  0, consumed 0

fscanf:  Returned 1, result  0, consumed 1
strtoul:             result  0, consumed 0
*/

/* glibc-2.8

fscanf:  Returned 1, result  0, consumed 2
strtoul:             result  0, consumed 1

fscanf:  Returned 1, result  0, consumed 2
strtoul:             result  0, consumed 1
*/

/* Microsoft MSVC

fscanf:  Returned 0, result -1, consumed -1
strtoul:             result  0, consumed 0

fscanf:  Returned 0, result  0, consumed -1
strtoul:             result  0, consumed 0
*/

/* IBM AIX

fscanf:  Returned 0, result -1, consumed -1
strtoul:             result  0, consumed 1

fscanf:  Returned 0, result  0, consumed -1
strtoul:             result  0, consumed 1
*/

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

情话难免假 2024-08-11 23:11:55

与 PL22.11(ANSI“C”)副主席 Fred J. Tydeman 在 comp.std.c 上的交流对此提供了一些线索:

fscanf

输入项定义为
最长的输入字符序列
[...] 是,或者是,的前缀
匹配输入序列。 (7.19.6.2 P9)

这使得“0x”成为作为匹配输入序列的前缀的最长序列。 (即使使用 %i 转换,因为十六进制“0x”的序列比十进制“0”更长。)

之后的第一个字符(如果有)
输入项目保持未读状态。 (7.19.6.2 P9)

这使得 fscanf 读取“z”,并将其作为不匹配放回(遵守脚注 251 的单字符推回限制)。

如果输入的项目不匹配
顺序,执行
指令失败:此条件是
匹配失败。 (7.19.6.2 P10)

这使得“0x”无法匹配,即fscanf不应该赋值,返回零(如果%x%i< /code> 是第一个转换说明符),并将“z”保留为输入流中的第一个未读字符。

strtol

strtol(和 strtoul)的定义在一个关键点上有所不同:

主题序列定义为
的最长初始子序列
输入字符串,从第一个开始
非空白字符,
预期的形式
。 (7.20.1.4 P4,重点是我的)

这意味着 strtol 应该寻找最长的有效序列,在本例中为“0”。它应该将 endptr 指向“x”,并返回零作为结果。

Communication with Fred J. Tydeman, Vice-char of PL22.11 (ANSI "C"), on comp.std.c shed some light on this:

fscanf

An input item is defined as the
longest sequence of input characters
[...] which is, or is a prefix of, a
matching input sequence. (7.19.6.2 P9)

This makes "0x" the longest sequence that is a prefix of a matching input sequence. (Even with %i conversion, as the hex "0x" is a longer sequence than the decimal "0".)

The first character, if any, after the
input item remains unread. (7.19.6.2 P9)

This makes fscanf read the "z", and put it back as not-matching (honoring the one-character pushback limit of footnote 251)).

If the input item is not a matching
sequence, the execution of the
directive fails: this condition is a
matching failure. (7.19.6.2 P10)

This makes "0x" fail to match, i.e. fscanf should assign no value, return zero (if the %x or %i was the first conv. specifier), and leave "z" as the first unread character in the input stream.

strtol

The definition of strtol (and strtoul) differs in one crucial point:

The subject sequence is defined as the
longest initial subsequence of the
input string, starting with the first
non-white-space character, that is of
the expected form
. (7.20.1.4 P4, emphasis mine)

Which means that strtol should look for the longest valid sequence, in this case the "0". It should point endptr to the "x", and return zero as result.

錯遇了你 2024-08-11 23:11:55

我不相信解析会产生不同的结果。 Paugher 参考文献只是指出 strtol() 实现可能是一个不同的、更高效的版本,因为它可以完全访问整个字符串。

I don't believe the parsing is allowed to produce different results. The Plaugher reference is just pointing out that the strtol() implementation might be a different, more efficient version as it has complete access to the entire string.

ペ泪落弦音 2024-08-11 23:11:55

根据 C99 规范,scanf() 系列函数解析整数的方式与 strto*() 系列函数相同。例如,对于转换说明符 x 如下所示:

匹配可选签名
十六进制整数,其格式为
与该主题的预期相同
strtoul 函数的序列
base 参数的值 16。

因此,如果 sscanf() 和 strtoul() 给出不同的结果,则 libc 实现不符合要求。

您的预期结果是什么示例代码 应该有点不清楚,但是:

strtoul() 接受可选前缀 0x0X if base 为 16,规范如下

主题序列定义为
的最长初始子序列
输入字符串,从第一个开始
非空白字符,即
预期的形式。

对于字符串 "0xz",我认为预期形式的最长初始子序列是 "0",因此值应该是 0 并且endptr 参数应设置为 x

mingw-gcc 4.4.0 不同意,并且无法使用 strtoul()sscanf() 解析字符串。原因可能是预期形式的最长初始子序列是 "0x" - 这不是有效的整数文字,因此不会进行解析。

我认为对标准的这种解释是错误的:预期形式的子序列应该始终产生有效的整数值(如果超出范围,则返回 MIN/MAX 值,并且errno 设置为 ERANGE)。

如果使用 strtoul() ,cygwin-gcc 3.4.4(据我所知使用 newlib)也不会解析文字,而是根据我对标准的解释来解析字符串 sscanf()

请注意,我对标准的解释很容易出现您最初的问题,即标准仅保证能够 ungetc() 一次。要确定 0x 是否是文字的一部分,您必须提前读取两个字符:x 和后面的字符。如果不是十六进制字符,则必须将其推回。如果有更多标记需要解析,您可以缓冲它们并解决此问题,但如果它是最后一个标记,则必须 ungetc() 这两个字符。

我不太确定如果 ungetc() 失败,fscanf() 应该做什么。也许只是设置流的错误指示器?

According to the C99 spec, the scanf() family of functions parses integers the same way as the strto*() family of functions. For example, for the conversion specifier x this reads:

Matches an optionally signed
hexadecimal integer, whose format is
the same as expected for the subject
sequence of the strtoul function with
the value 16 for the base argument.

So if sscanf() and strtoul() give different results, the libc implementation doesn't conform.

What the expected results of you sample code should be is a bit unclear, though:

strtoul() accepts an optional prefix of 0x or 0X if base is 16, and the spec reads

The subject sequence is defined as the
longest initial subsequence of the
input string, starting with the first
non-white-space character, that is of
the expected form.

For the string "0xz", in my opinion the longest initial subsequence of expected form is "0", so the value should be 0 and the endptr argument should be set to x.

mingw-gcc 4.4.0 disagrees and fails to parse the string with both strtoul() and sscanf(). The reasoning could be that the longest initial subsequence of expected form is "0x" - which is not a valid integer literal, so no parsing is done.

I think this interpretation of the standard is wrong: A subsequence of expected form should always yield a valid integer value (if out of range, the MIN/MAX values are returned and errno is set to ERANGE).

cygwin-gcc 3.4.4 (which uses newlib as far as I know) will also not parse the literal if strtoul() is used, but parses the string according to my interpretation of the standard with sscanf().

Beware that my interpretation of the standard is prone to your initital problem, ie that the standard only guarantees to be able to ungetc() once. To decide if the 0x is part of the literal, you have to read ahead two characters: the x and the following character. If it's no hex character, they have to be pushed back. If there are more tokens to parse, you can buffer them and work around this problem, but if it's the last token, you have to ungetc() both characters.

I'm not really sure what fscanf() should do if ungetc() fails. Maybe just set the stream's error indicator?

沧桑㈠ 2024-08-11 23:11:55

总结一下解析数字时根据标准应该发生的情况:

  • 如果 fscanf() 成功,结果必须与通过 strto*() 获得的结果相同
  • strto*() 相比,fscanf() 会在以下情况下失败:

    <块引用>

    最长的输入字符序列[...],它是匹配输入序列或者是匹配输入序列的前缀

    根据fscanf()的定义不是

    <块引用>

    具有预期形式的最长初始子序列[...]

    根据 strto*()

    的定义,

这有点难看,但这是 fscanf() 要求的必然结果 应该是贪婪的,但不能推回多个字符。

一些库实现者选择了不同的行为。在我看来,

  • strto*() 无法使结果一致是愚蠢的(bad mingw),
  • 会推回多个字符,因此 fscanf()接受 strto*() 接受的所有值违反了标准,但也是合理的(为 newlib 欢呼,如果他们没有搞砸 strto*() :(< /em>)
  • 不推回不匹配的字符,但仍然只解析“预期形式”的字符,这似乎很可疑,因为字符消失在空气中(坏glibc

To summarize what should happen according to the standard when parsing numbers:

  • if fscanf() succeeds, the result must be identical to the one obtained via strto*()
  • in contrast to strto*(), fscanf() fails if

    the longest sequence of input characters [...] which is, or is a prefix of, a matching input sequence

    according to the definition of fscanf() is not

    the longest initial subsequence [...] that is of the expected form

    according to the definition of strto*()

This is somewhat ugly, but a necessary consequence of the requirement that fscanf() should be greedy, but can't push back more than one character.

Some library implementators opted for differing behaviour. In my opinion

  • letting strto*() fail to make results consistent is stupid (bad mingw)
  • pushing back more than one character so fscanf() accepts all values accepted by strto*() violates the standard, but is justified (hurray for newlib if they didn't botch strto*() :()
  • not pushing back the non-matching characters but still only parsing the ones of 'expected form' seems dubious as characters vanish into thin air (bad glibc)
月亮坠入山谷 2024-08-11 23:11:55

我不确定我是否理解这个问题,但一方面 scanf() 应该处理 EOF。 scanf() 和 strtol() 是不同种类的野兽。也许您应该比较 strtol() 和 sscanf() ?

I am not sure I understand the question, but for one thing scanf() is supposed to handle EOF. scanf() and strtol() are different kinds of beasts. Maybe you should compare strtol() and sscanf() instead?

川水往事 2024-08-11 23:11:55

我不确定 scanf() 的实现与 ungetc() 有何关系。 scanf() 可以用完流缓冲区中的所有字节。 ungetc() 只是将一个字节推入缓冲区末尾,并且偏移量也会更改。

scanf("%d", &x);
ungetc('9', stdin);
scanf("%d", &y);
printf("%d, %d\n", x, y);

如果输入为“100”,则输出为“100, 9”。我不明白 scanf() 和 ungetc() 如何相互干扰。抱歉,如果我添加了天真的评论。

I am not sure how implementing scanf() may be related to ungetc(). scanf() can use up all bytes in the stream buffer. ungetc() simply pushes a byte to the end of buffer and the offset is also changed.

scanf("%d", &x);
ungetc('9', stdin);
scanf("%d", &y);
printf("%d, %d\n", x, y);

If the input is "100", the output is "100, 9". I do not see how scanf() and ungetc() may interfere with each other. Sorry if I added a naive comment.

萌无敌 2024-08-11 23:11:55

对于 scanf() 函数以及 strtol() 函数的输入,在秒中。 7.20.1.4 P7表示:如果主题序列为空或不具有预期形式,则不执行转换; nptr的值存储在endptr指向的对象中,前提是endptr不是空指针。此外,您还必须考虑解析根据 Sec 规则定义的标记的规则。 6.4.4 常量Sec. 中指出的规则。 7.20.1.4 P5。

其余的行为,例如 errno 值,应该是特定于实现的。例如,在我的 FreeBSD 机器上,我得到了 EINVALERANGE 值,在 Linux 下也会发生同样的情况,其中标准引用仅指向 ERANGE errno 值。

For the input to the scanf() functions and also for strtol() functions, in Sec. 7.20.1.4 P7 indicates: If the subject sequence is empty or does not have the expected form, no conversion is performed; the value of nptr is stored in the object pointed to by endptr, provided that endptr is not a null pointer. Also you must be considering that the rules of parsing those tokens which are defined under the rules of Sec. 6.4.4 Constants, rule that is pointed in Sec. 7.20.1.4 P5.

The rest of the behavior, such as the errno value, should be implementation specific. For example at my FreeBSD box I got EINVAL and ERANGE values and under Linux the same happens, where the standard referrers only to the ERANGE errno value.

那小子欠揍 2024-08-11 23:11:55

重写问题后答案已过时。评论中有一些有趣的链接。


如果有疑问,请编写测试。 ——谚语

在测试了我能想到的转换说明符和输入变体的所有组合之后,我可以说两个函数系列没有给出相同的结果是正确的。 (至少在 glibc 中,这是我可用于测试的。)

当满足三种情况时就会出现差异:

  1. 您使用 "%i""%x" (允许十六进制输入)。
  2. 输入包含(可选)"0x" 十六进制前缀。
  3. 十六进制前缀后面没有有效的十六进制数字。

示例代码:

#include <stdio.h>
#include <stdlib.h>

int main()
{
    char * string = "0xz";
    unsigned u;
    int count;
    char c;
    char * endptr;

    sscanf( string, "%x%n%c", &i, &count, &c );
    printf( "Value: %d - Consumed: %d - Next char: %c - (sscanf())\n", u, count, c );
    i = strtoul( string, &endptr, 16 );
    printf( "Value: %d - Consumed: %td - Next char: %c - (strtoul())\n", u, ( endptr - string ), *endptr );
    return 0;
}

输出:

Value: 0 - Consumed: 1 - Next char: x - (sscanf())
Value: 0 - Consumed: 0 - Next char: 0 - (strtoul())

这让我很困惑。显然 sscanf() 不会在 'x' 处退出,否则它将无法解析任何 "0x " 前缀的十六进制。因此它读取了 'z' 并发现它不匹配。但它决定仅使用前导 "0" 作为值。这意味着将 'z' 'x' 向后推。 (是的,我知道我在这里使用它来方便测试,它不在流上运行,但我强烈假设他们制作了所有 ...scanf() 为了保持一致性,函数的行为相同。)

所以...单字符 ungetc() 并不是真正的原因,这里... ?:-/

是的,结果不同< /强>。但我仍然无法正确解释它......:-(

Answer obsolete after rewrite of question. Some interesting links in the comments though.


If in doubt, write a test. -- proverb

After testing all combinations of conversion specifiers and input variations I could think of, I can say that it is correct that the two function families do not give identical results. (At least in glibc, which is what I have available for testing.)

The difference appears when three circumstances meet:

  1. You use "%i" or "%x" (allowing hexadecimal input).
  2. Input contains the (optional) "0x" hexadecimal prefix.
  3. There is no valid hexadecimal digit following the hexadecimal prefix.

Example code:

#include <stdio.h>
#include <stdlib.h>

int main()
{
    char * string = "0xz";
    unsigned u;
    int count;
    char c;
    char * endptr;

    sscanf( string, "%x%n%c", &i, &count, &c );
    printf( "Value: %d - Consumed: %d - Next char: %c - (sscanf())\n", u, count, c );
    i = strtoul( string, &endptr, 16 );
    printf( "Value: %d - Consumed: %td - Next char: %c - (strtoul())\n", u, ( endptr - string ), *endptr );
    return 0;
}

Output:

Value: 0 - Consumed: 1 - Next char: x - (sscanf())
Value: 0 - Consumed: 0 - Next char: 0 - (strtoul())

This confuses me. Obviously sscanf() does not bail out at the 'x', or it wouldn't be able to parse any "0x" prefixed hexadecimals. So it has read the 'z' and found it non-matching. But it decides to use only the leading "0" as value. That would mean pushing the 'z' and the 'x' back. (Yes I know that sscanf(), which I used here for easy testing, does not operate on a stream, but I strongly assume they made all ...scanf() functions behave identically for consistency.)

So... one-char ungetc() doesn't really to be the reason, here... ?:-/

Yes, results differ. I still cannot explain it properly, though... :-(

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文