strtol 等规范中令人困惑的语言

发布于 2024-11-23 22:41:33 字数 768 浏览 2 评论 0原文

strtol 的规范在概念上将输入字符串分为“初始空白”、“主题序列”和“最终字符串”,并将“主题序列”定义为:

输入字符串的最长初始子序列,以预期形式的第一个非空白字符开始。如果输入字符串为空或完全由空白字符组成,或者第一个非空白字符不是符号或允许的字母或数字,则主题序列不应包含任何字符。

有一次,我认为“最长初始子序列”业务类似于 scanf 的工作方式,其中 "0x@" 将扫描为 "0x",失败的匹配,后跟 "@" 作为下一个未读字符。然而,经过一番讨论,我基本上相信 strtol 处理预期形式的最长初始子序列,而不是最长初始字符串,它是预期形式的某些可能字符串的初始子序列。

仍然让我困惑的是规范中的这种语言:

如果主题序列为空或不具有预期的形式,则不执行转换; str的值存储在endptr指向的对象中,前提是endptr不是空指针。

如果我们接受“主题序列”的正确定义,则不存在不具有预期形式的非空主题序列之类的东西,相反(为了避免冗余和混乱)文本应该阅读:

如果主题序列为空,则不进行转换; str的值存储在endptr指向的对象中,前提是endptr不是空指针。

谁能为我澄清这些问题吗?也许过去讨论或任何相关缺陷报告的链接会很有用。

The specification for strtol conceptually divides the input string into "initial whitespace", a "subject sequence", and a "final string", and defines the "subject sequence" as:

the longest initial subsequence of the input string, starting with the first non-white-space character that is of the expected form. The subject sequence shall contain no characters if the input string is empty or consists entirely of white-space characters, or if the first non-white-space character is other than a sign or a permissible letter or digit.

At one time I thought the "longest initial subsequence" business was akin to the way scanf works, where "0x@" would scan as "0x", a failed match, followed by "@" as the next unread character. However, after some discussion, I'm mostly convinced that strtol processes the longest initial subsequence that is of the expected form, not the longest initial string which is the initial subsequence of some possible string of the expected form.

What's still confusing me is this language in the specification:

If the subject sequence is empty or does not have the expected form, no conversion is performed; the value of str is stored in the object pointed to by endptr, provided that endptr is not a null pointer.

If we accept what seems to be the correct definition of "subject sequence", there is no such thing as a non-empty subject sequence that does not have the expected form, and instead (to avoid redundancy and confusion) the text should just read:

If the subject sequence is empty, no conversion is performed; the value of str is stored in the object pointed to by endptr, provided that endptr is not a null pointer.

Can anyone clarify these issues for me? Perhaps a link to past discussions or any relevant defect reports would be useful.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

绝對不後悔。 2024-11-30 22:41:33

我觉得C99语言已经说的很清楚了:

主题序列被定义为最长的初始子序列
输入字符串,从第一个非空白字符开始,
这是预期的形式。

给定 "0x@""0x@" 不是预期的形式; “0x” 不是预期的形式;因此 "0" 是预期形式的最长初始子序列。

我同意这意味着您不能拥有不符合预期形式的非空主题序列 - 除非您解释以下内容:

在除 "C" 语言环境中,其他特定于语言环境的主题
可以接受序列形式。

...允许语言环境定义主题序列可能具有的其他可能形式,但它们不是“预期形式”。

最后一段的措辞似乎只是“腰带和括号”。

I think the C99 language is quite clear:

The subject sequence is defined as the longest initial subsequence of
the input string, starting with the first non-white-space character,
that is of the expected form.

Given "0x@", "0x@" is not of the expected form; "0x" is not of the expected form; therefore "0" is the longest initial subsequence that is of the expected form.

I agree that this implies that you cannot have a non-empty subject sequence that isn't of the expected form - unless you interpret the following:

In other than the "C" locale, additional locale-specific subject
sequence forms may be accepted.

...as allowing a locale to define other possible forms that the subject sequence might have, that are nonetheless not of "the expected form".

The wording in the final paragraph seems to be just "belt-and-braces".

放肆 2024-11-30 22:41:33

如果您从 C9​​9 标准的 §7.20.1.4(strtol、strtoll、strtoul 和 strtoull 函数)¶2(而不是 ¶4)开始,可能会更容易理解:

¶2 strtol、strtoll、strtoul 和 strtoull 函数将初始值转换为
nptr 指向的字符串部分为 long int, long long int, unsigned
分别表示 long int 和 unsigned long long int。第一的,
他们将输入字符串分解为三个部分:一个初始的(可能是空的)序列
空白字符(由 isspace 函数指定),主题序列
类似于以某个基数表示的整数,该基数由基数的值确定,并且
由一个或多个无法识别的字符组成的最终字符串,包括终止 null
输入字符串的字符。然后,他们尝试将主题序列转换为
整数,并返回结果。

¶3 如果 base 的值为零,则主题序列的预期形式是
整数常量,如 6.4.4.1 中所述,前面可以选择加号或减号,但是
不包括整数后缀。如果base的值在2到36之间(含),
主题序列的预期形式是代表一个字母和数字的序列
具有由基数指定的基数的整数,前面可以选择加号或减号,
但不包括整数后缀。从a(或A)到z(或Z)的字母是
赋予值 10 到 35;仅指定值较小的字母和数字
比基地允许的。如果base的值为16,则字符0x或0X可能
可以选择在字母和数字序列之前,在符号(如果存在)之后。

¶4 主题序列被定义为输入字符串的最长初始子序列,...

特别是,¶3 阐明了主题序列是什么。

It might be easier to understand if you started at §7.20.1.4 (The strtol, strtoll, strtoul, and strtoull functions) ¶2 of the C99 standard, instead of ¶4:

¶2 The strtol, strtoll, strtoul, and strtoull functions convert the initial
portion of the string pointed to by nptr to long int, long long int, unsigned
long int, and unsigned long long int representation, respectively. First,
they decompose the input string into three parts: an initial, possibly empty, sequence of
white-space characters (as specified by the isspace function), a subject sequence
resembling an integer represented in some radix determined by the value of base, and a
final string of one or more unrecognized characters, including the terminating null
character of the input string. Then, they attempt to convert the subject sequence to an
integer, and return the result.

¶3 If the value of base is zero, the expected form of the subject sequence is that of an
integer constant as described in 6.4.4.1, optionally preceded by a plus or minus sign, but
not including an integer suffix. If the value of base is between 2 and 36 (inclusive), the
expected form of the subject sequence is a sequence of letters and digits representing an
integer with the radix specified by base, optionally preceded by a plus or minus sign,
but not including an integer suffix. The letters from a (or A) through z (or Z) are
ascribed the values 10 through 35; only letters and digits whose ascribed values are less
than that of base are permitted. If the value of base is 16, the characters 0x or 0X may
optionally precede the sequence of letters and digits, following the sign if present.

¶4 The subject sequence is defined as the longest initial subsequence of the input string, ...

In particular, ¶3 clarifies what a subject sequence is.

小兔几 2024-11-30 22:41:33

strtol 的 POSIX 规范 似乎更清楚:

这些函数应分别将 str 指向的字符串的初始部分转换为 long 和 long long 表示形式。首先,它们将输入字符串分解为三个部分:

  1. 初始的、可能为空的空白字符序列(由 isspace() 指定)

  2. 被解释为整数的主题序列,以由基值确定的某个基数表示

  3. 由一个或多个无法识别的字符组成的最终字符串,包括输入字符串的终止 NUL 字符。

然后他们将尝试将主题序列转换为整数,并返回结果。

但当然,它不是规范性的,并且“遵循 ISO C 标准”。

The POSIX spec for strtol seems to be more clear:

These functions shall convert the initial portion of the string pointed to by str to a type long and long long representation, respectively. First, they decompose the input string into three parts:

  1. An initial, possibly empty, sequence of white-space characters (as specified by isspace())

  2. A subject sequence interpreted as an integer represented in some radix determined by the value of base

  3. A final string of one or more unrecognized characters, including the terminating NUL character of the input string.

Then they shall attempt to convert the subject sequence to an integer, and return the result.

But of course, it is not normative and "defers to the ISO C standard".

若相惜即相离 2024-11-30 22:41:33

我完全同意您的评估:根据定义,所有非空主题序列都是预期的形式,因此标准的措辞是可疑的。

对于浮点转换函数,还有另一个错误(C99:TC3 第 7.20.1.3 节,§3):

[...] 主题序列被定义为最长的初始序列
输入字符串的子序列,从第一个开始
非空白字符,即预期形式。主题
如果输入字符串不属于序列,则序列不包含字符
预期的形式。

这意味着整个输入字符串必须是预期的形式,这违背了endptr参数的目的。有人可能会说输入字符串的预期形式与主题序列的预期形式不同,但这仍然很令人困惑。

您也正确地认为 strto*()*scanf() 系列函数的语义是不同的:如果两者匹配,它们将始终在值上达成一致,并且消耗相同数量的字符(以及任何不被破坏的 libc 实现,包括我上次检查时的 newlib 和 glibc),但 *scanf() 另外无法匹配需要的情况回溯多个字符,如您的示例“0x@”“1.0e+”

I completely agree with your assessment: By definition, all non-empty subject sequences are of expected form, so the wording of the standard is dubious.

In case of the floating point conversion functions, there's another blunder (C99:TC3 section 7.20.1.3, §3):

[...] The subject sequence is defined as the longest initial
subsequence of the input string, starting with the first
non-white-space character, that is of the expected form. The subject
sequence contains no characters if the input string is not of the
expected form.

This implies that the whole input string must be of expected form, defeating the purpose of the endptr parameter. One could argue that the expected form for the input string is different from the expected form for the subject sequence, but it's still pretty confusing.

You are also correct that the semantics of the strto*() and *scanf() family of functions are different: If both match, they will always agree on the value and consume the same number of characters (and any libc implemetation where they do not is broken, including newlib and glibc last time I checked), but *scanf() additionally fails to match cases where it would need to backtrack more than one character, as in your examples "0x@" and "1.0e+".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文