scanf() 和 strtol() / strtod() 在解析数字方面的区别
注意:我完全重新设计了这个问题,以更正确地反映我设置赏金的目的。请原谅这可能造成的与已给出答案的任何不一致。我不想创建一个新问题,因为之前对此问题的回答可能会有所帮助。
我正在致力于实现一个 C 标准库,并且对该标准的一个特定角落感到困惑。
该标准根据 strtol
的定义定义了 scanf
函数系列接受的数字格式(%d、%i、%u、%o、%x), strtoul
和 strtod
。
该标准还规定,fscanf()
最多只会将一个字符放回到输入流中,因此 strtol
、strtoul< 接受一些序列。 /code> 和
strtod
对于 fscanf
是不可接受的(ISO/IEC 9899:1999,脚注 251)。
我试图找到一些能够表现出这种差异的价值观。事实证明,十六进制前缀“0x”后跟一个非十六进制数字的字符就是两个函数系列不同的情况之一。
有趣的是,很明显没有两个可用的 C 库似乎在输出上达成一致。 (请参阅本问题末尾的测试程序和示例输出。)
我想听到的是在解析“0xz”时什么被认为是符合标准的行为?。最好引用标准中的相关部分来阐明观点。
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
int main()
{
int i, count, rc;
unsigned u;
char * endptr = NULL;
char culprit[] = "0xz";
/* File I/O to assert fscanf == sscanf */
FILE * fh = fopen( "testfile", "w+" );
fprintf( fh, "%s", culprit );
rewind( fh );
/* fscanf base 16 */
u = -1; count = -1;
rc = fscanf( fh, "%x%n", &u, &count );
printf( "fscanf: Returned %d, result %2d, consumed %d\n", rc, u, count );
rewind( fh );
/* strtoul base 16 */
u = strtoul( culprit, &endptr, 16 );
printf( "strtoul: result %2d, consumed %d\n", u, endptr - culprit );
puts( "" );
/* fscanf base 0 */
i = -1; count = -1;
rc = fscanf( fh, "%i%n", &i, &count );
printf( "fscanf: Returned %d, result %2d, consumed %d\n", rc, i, count );
rewind( fh );
/* strtol base 0 */
i = strtol( culprit, &endptr, 0 );
printf( "strtoul: result %2d, consumed %d\n", i, endptr - culprit );
fclose( fh );
return 0;
}
/* newlib 1.14
fscanf: Returned 1, result 0, consumed 1
strtoul: result 0, consumed 0
fscanf: Returned 1, result 0, consumed 1
strtoul: result 0, consumed 0
*/
/* glibc-2.8
fscanf: Returned 1, result 0, consumed 2
strtoul: result 0, consumed 1
fscanf: Returned 1, result 0, consumed 2
strtoul: result 0, consumed 1
*/
/* Microsoft MSVC
fscanf: Returned 0, result -1, consumed -1
strtoul: result 0, consumed 0
fscanf: Returned 0, result 0, consumed -1
strtoul: result 0, consumed 0
*/
/* IBM AIX
fscanf: Returned 0, result -1, consumed -1
strtoul: result 0, consumed 1
fscanf: Returned 0, result 0, consumed -1
strtoul: result 0, consumed 1
*/
Note: I completely reworked the question to more properly reflect what I am setting the bounty for. Please excuse any inconsistencies with already-given answers this might have created. I did not want to create a new question, as previous answers to this one might be helpful.
I am working on implementing a C standard library, and am confused about one specific corner of the standard.
The standard defines the number formats accepted by the scanf
function family (%d, %i, %u, %o, %x) in terms of the definitions for strtol
, strtoul
, and strtod
.
The standard also says that fscanf()
will only put back a maximum of one character into the input stream, and that therefore some sequences accepted by strtol
, strtoul
and strtod
are unacceptable to fscanf
(ISO/IEC 9899:1999, footnote 251).
I tried to find some values that would exhibit such differences. It turns out that the hexadecimal prefix "0x", followed by a character that is not a hexadecimal digit, is one such case where the two function families differ.
Funny enough, it became apparent that no two available C libraries seem to agree on the output. (See test program and example output at the end of this question.)
What I would like to hear is what would be considered standard-compliant behaviour in parsing "0xz"?. Ideally citing the relevant parts from the standard to make the point.
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
int main()
{
int i, count, rc;
unsigned u;
char * endptr = NULL;
char culprit[] = "0xz";
/* File I/O to assert fscanf == sscanf */
FILE * fh = fopen( "testfile", "w+" );
fprintf( fh, "%s", culprit );
rewind( fh );
/* fscanf base 16 */
u = -1; count = -1;
rc = fscanf( fh, "%x%n", &u, &count );
printf( "fscanf: Returned %d, result %2d, consumed %d\n", rc, u, count );
rewind( fh );
/* strtoul base 16 */
u = strtoul( culprit, &endptr, 16 );
printf( "strtoul: result %2d, consumed %d\n", u, endptr - culprit );
puts( "" );
/* fscanf base 0 */
i = -1; count = -1;
rc = fscanf( fh, "%i%n", &i, &count );
printf( "fscanf: Returned %d, result %2d, consumed %d\n", rc, i, count );
rewind( fh );
/* strtol base 0 */
i = strtol( culprit, &endptr, 0 );
printf( "strtoul: result %2d, consumed %d\n", i, endptr - culprit );
fclose( fh );
return 0;
}
/* newlib 1.14
fscanf: Returned 1, result 0, consumed 1
strtoul: result 0, consumed 0
fscanf: Returned 1, result 0, consumed 1
strtoul: result 0, consumed 0
*/
/* glibc-2.8
fscanf: Returned 1, result 0, consumed 2
strtoul: result 0, consumed 1
fscanf: Returned 1, result 0, consumed 2
strtoul: result 0, consumed 1
*/
/* Microsoft MSVC
fscanf: Returned 0, result -1, consumed -1
strtoul: result 0, consumed 0
fscanf: Returned 0, result 0, consumed -1
strtoul: result 0, consumed 0
*/
/* IBM AIX
fscanf: Returned 0, result -1, consumed -1
strtoul: result 0, consumed 1
fscanf: Returned 0, result 0, consumed -1
strtoul: result 0, consumed 1
*/
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
与 PL22.11(ANSI“C”)副主席 Fred J. Tydeman 在 comp.std.c 上的交流对此提供了一些线索:
fscanf
这使得“0x”成为作为匹配输入序列的前缀的最长序列。 (即使使用
%i
转换,因为十六进制“0x”的序列比十进制“0”更长。)这使得
fscanf
读取“z”,并将其作为不匹配放回(遵守脚注 251 的单字符推回限制)。这使得“0x”无法匹配,即
fscanf
不应该赋值,返回零(如果%x
或%i< /code> 是第一个转换说明符),并将“z”保留为输入流中的第一个未读字符。
strtol
strtol
(和strtoul
)的定义在一个关键点上有所不同:这意味着
strtol
应该寻找最长的有效序列,在本例中为“0”。它应该将endptr
指向“x”,并返回零作为结果。Communication with Fred J. Tydeman, Vice-char of PL22.11 (ANSI "C"), on comp.std.c shed some light on this:
fscanf
This makes "0x" the longest sequence that is a prefix of a matching input sequence. (Even with
%i
conversion, as the hex "0x" is a longer sequence than the decimal "0".)This makes
fscanf
read the "z", and put it back as not-matching (honoring the one-character pushback limit of footnote 251)).This makes "0x" fail to match, i.e.
fscanf
should assign no value, return zero (if the%x
or%i
was the first conv. specifier), and leave "z" as the first unread character in the input stream.strtol
The definition of
strtol
(andstrtoul
) differs in one crucial point:Which means that
strtol
should look for the longest valid sequence, in this case the "0". It should pointendptr
to the "x", and return zero as result.我不相信解析会产生不同的结果。 Paugher 参考文献只是指出
strtol()
实现可能是一个不同的、更高效的版本,因为它可以完全访问整个字符串。I don't believe the parsing is allowed to produce different results. The Plaugher reference is just pointing out that the
strtol()
implementation might be a different, more efficient version as it has complete access to the entire string.根据 C99 规范,
scanf()
系列函数解析整数的方式与strto*()
系列函数相同。例如,对于转换说明符x
如下所示:因此,如果 sscanf() 和 strtoul() 给出不同的结果,则 libc 实现不符合要求。
您的预期结果是什么示例代码 应该有点不清楚,但是:
strtoul()
接受可选前缀0x
或0X
ifbase 为
16
,规范如下对于字符串
"0xz"
,我认为预期形式的最长初始子序列是"0"
,因此值应该是0
并且endptr
参数应设置为x
。mingw-gcc 4.4.0 不同意,并且无法使用
strtoul()
和sscanf()
解析字符串。原因可能是预期形式的最长初始子序列是"0x"
- 这不是有效的整数文字,因此不会进行解析。我认为对标准的这种解释是错误的:预期形式的子序列应该始终产生有效的整数值(如果超出范围,则返回
MIN
/MAX
值,并且errno
设置为ERANGE
)。如果使用
strtoul()
,cygwin-gcc 3.4.4(据我所知使用 newlib)也不会解析文字,而是根据我对标准的解释来解析字符串sscanf()
。请注意,我对标准的解释很容易出现您最初的问题,即标准仅保证能够
ungetc()
一次。要确定0x
是否是文字的一部分,您必须提前读取两个字符:x
和后面的字符。如果不是十六进制字符,则必须将其推回。如果有更多标记需要解析,您可以缓冲它们并解决此问题,但如果它是最后一个标记,则必须ungetc()
这两个字符。我不太确定如果
ungetc()
失败,fscanf()
应该做什么。也许只是设置流的错误指示器?According to the C99 spec, the
scanf()
family of functions parses integers the same way as thestrto*()
family of functions. For example, for the conversion specifierx
this reads:So if
sscanf()
andstrtoul()
give different results, the libc implementation doesn't conform.What the expected results of you sample code should be is a bit unclear, though:
strtoul()
accepts an optional prefix of0x
or0X
ifbase
is16
, and the spec readsFor the string
"0xz"
, in my opinion the longest initial subsequence of expected form is"0"
, so the value should be0
and theendptr
argument should be set tox
.mingw-gcc 4.4.0 disagrees and fails to parse the string with both
strtoul()
andsscanf()
. The reasoning could be that the longest initial subsequence of expected form is"0x"
- which is not a valid integer literal, so no parsing is done.I think this interpretation of the standard is wrong: A subsequence of expected form should always yield a valid integer value (if out of range, the
MIN
/MAX
values are returned anderrno
is set toERANGE
).cygwin-gcc 3.4.4 (which uses newlib as far as I know) will also not parse the literal if
strtoul()
is used, but parses the string according to my interpretation of the standard withsscanf()
.Beware that my interpretation of the standard is prone to your initital problem, ie that the standard only guarantees to be able to
ungetc()
once. To decide if the0x
is part of the literal, you have to read ahead two characters: thex
and the following character. If it's no hex character, they have to be pushed back. If there are more tokens to parse, you can buffer them and work around this problem, but if it's the last token, you have toungetc()
both characters.I'm not really sure what
fscanf()
should do ifungetc()
fails. Maybe just set the stream's error indicator?总结一下解析数字时根据标准应该发生的情况:
fscanf()
成功,结果必须与通过strto*()
获得的结果相同与
strto*()
相比,fscanf()
会在以下情况下失败:<块引用>
最长的输入字符序列[...],它是匹配输入序列或者是匹配输入序列的前缀
根据
fscanf()
的定义不是<块引用>
具有预期形式的最长初始子序列[...]
根据
strto*()
的定义,
这有点难看,但这是 fscanf() 要求的必然结果 应该是贪婪的,但不能推回多个字符。
一些库实现者选择了不同的行为。在我看来,
strto*()
无法使结果一致是愚蠢的(bad mingw),fscanf()
接受strto*()
接受的所有值违反了标准,但也是合理的(为 newlib 欢呼,如果他们没有搞砸strto*()
:(< /em>)To summarize what should happen according to the standard when parsing numbers:
fscanf()
succeeds, the result must be identical to the one obtained viastrto*()
in contrast to
strto*()
,fscanf()
fails ifaccording to the definition of
fscanf()
is notaccording to the definition of
strto*()
This is somewhat ugly, but a necessary consequence of the requirement that
fscanf()
should be greedy, but can't push back more than one character.Some library implementators opted for differing behaviour. In my opinion
strto*()
fail to make results consistent is stupid (bad mingw)fscanf()
accepts all values accepted bystrto*()
violates the standard, but is justified (hurray for newlib if they didn't botchstrto*()
:()我不确定我是否理解这个问题,但一方面 scanf() 应该处理 EOF。 scanf() 和 strtol() 是不同种类的野兽。也许您应该比较 strtol() 和 sscanf() ?
I am not sure I understand the question, but for one thing scanf() is supposed to handle EOF. scanf() and strtol() are different kinds of beasts. Maybe you should compare strtol() and sscanf() instead?
我不确定 scanf() 的实现与 ungetc() 有何关系。 scanf() 可以用完流缓冲区中的所有字节。 ungetc() 只是将一个字节推入缓冲区末尾,并且偏移量也会更改。
如果输入为“100”,则输出为“100, 9”。我不明白 scanf() 和 ungetc() 如何相互干扰。抱歉,如果我添加了天真的评论。
I am not sure how implementing scanf() may be related to ungetc(). scanf() can use up all bytes in the stream buffer. ungetc() simply pushes a byte to the end of buffer and the offset is also changed.
If the input is "100", the output is "100, 9". I do not see how scanf() and ungetc() may interfere with each other. Sorry if I added a naive comment.
对于 scanf() 函数以及 strtol() 函数的输入,在秒中。 7.20.1.4 P7表示:如果主题序列为空或不具有预期形式,则不执行转换; nptr的值存储在endptr指向的对象中,前提是endptr不是空指针。此外,您还必须考虑解析根据 Sec 规则定义的标记的规则。 6.4.4 常量,Sec. 中指出的规则。 7.20.1.4 P5。
其余的行为,例如 errno 值,应该是特定于实现的。例如,在我的 FreeBSD 机器上,我得到了 EINVAL 和 ERANGE 值,在 Linux 下也会发生同样的情况,其中标准引用仅指向 ERANGE errno 值。
For the input to the scanf() functions and also for strtol() functions, in Sec. 7.20.1.4 P7 indicates: If the subject sequence is empty or does not have the expected form, no conversion is performed; the value of nptr is stored in the object pointed to by endptr, provided that endptr is not a null pointer. Also you must be considering that the rules of parsing those tokens which are defined under the rules of Sec. 6.4.4 Constants, rule that is pointed in Sec. 7.20.1.4 P5.
The rest of the behavior, such as the errno value, should be implementation specific. For example at my FreeBSD box I got EINVAL and ERANGE values and under Linux the same happens, where the standard referrers only to the ERANGE errno value.
重写问题后答案已过时。评论中有一些有趣的链接。
在测试了我能想到的转换说明符和输入变体的所有组合之后,我可以说两个函数系列没有给出相同的结果是正确的。 (至少在 glibc 中,这是我可用于测试的。)
当满足三种情况时就会出现差异:
"%i"
或"%x"
(允许十六进制输入)。"0x"
十六进制前缀。示例代码:
输出:
这让我很困惑。显然
sscanf()
不会在'x'
处退出,否则它将无法解析任何
"0x "
前缀的十六进制。因此它读取了'z'
并发现它不匹配。但它决定仅使用前导"0"
作为值。这意味着将'z'
和'x'
向后推。 (是的,我知道我在这里使用它来方便测试,它不在流上运行,但我强烈假设他们制作了所有...scanf()
为了保持一致性,函数的行为相同。)所以...单字符
ungetc()
并不是真正的原因,这里... ?:-/是的,结果不同< /强>。但我仍然无法正确解释它......:-(
Answer obsolete after rewrite of question. Some interesting links in the comments though.
After testing all combinations of conversion specifiers and input variations I could think of, I can say that it is correct that the two function families do not give identical results. (At least in glibc, which is what I have available for testing.)
The difference appears when three circumstances meet:
"%i"
or"%x"
(allowing hexadecimal input)."0x"
hexadecimal prefix.Example code:
Output:
This confuses me. Obviously
sscanf()
does not bail out at the'x'
, or it wouldn't be able to parse any"0x"
prefixed hexadecimals. So it has read the'z'
and found it non-matching. But it decides to use only the leading"0"
as value. That would mean pushing the'z'
and the'x'
back. (Yes I know thatsscanf()
, which I used here for easy testing, does not operate on a stream, but I strongly assume they made all...scanf()
functions behave identically for consistency.)So... one-char
ungetc()
doesn't really to be the reason, here... ?:-/Yes, results differ. I still cannot explain it properly, though... :-(