Lex 的前瞻运算符算法不正确

发布于 2024-12-29 13:34:42 字数 371 浏览 5 评论 0原文

在 Andrew Appel 的“Java 中的现代编译器实现”中，他在练习中声称：

Lex 有一个先行运算符 /，因此正则表达式 abc/def 仅当后跟 def 时才匹配 abc（但 def 不是匹配字符串的一部分，而是下一个标记的一部分）。阿霍等人。 [1986] 描述了，并且 Lex [Lesk 1975] 使用了一种错误的算法来实现前瞻（它在输入 aba 的 (a|ab)/ba 上失败，在应该匹配 a 的地方匹配 ab）。 Flex [Paxson 1995] 使用了一种更好的机制，对于 (a|ab)/ba 可以正确工作，但会失败（在 zx*/xy* 上显示警告消息。设计一个更好的前瞻机制。

有谁知道他所描述的解决方案？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

燕归巢 2025-01-05 13:34:42

“不按照我认为应该的方式工作”和“不正确”并不总是同一件事。给定输入

aba

和模式，

(ab|a)/ab

(ab|a) 贪婪匹配，然后单独应用 /ab 约束是有一定意义的。您认为它应该像这个正则表达式一样工作：

(ab|a)(ab)

约束条件是 (ab) 匹配的部分不被消耗。这可能更好，因为它消除了一些限制，但由于在编写 lex 时没有任何外部要求，因此您不能称行为正确或不正确。

天真的方法的优点是添加尾随上下文不会改变标记的含义，而只是添加一个关于其后面可能包含的内容的完全独立的约束。但这确实会导致限制/惊喜：

 {IDENT}  /* original code */

 {IDENT}/ab   /* ident, only when followed by ab */

哎呀，它不起作用，因为“ab”被吞入 IDENT 中，正是因为它的含义没有被尾随上下文改变。这变成了一个限制，但也许这是作者愿意忍受的一个限制，以换取简单性。（无论如何，使它更具上下文的用例是什么？）

另一种方式怎么样？这也可能会带来惊喜：

 {IDENT}/ab  /* input is bracadabra:123 */

假设用户希望不匹配，因为 bracadabra 不是后跟（或以 ab 结尾）的标识符。但 {IDENT}/ab 将匹配 bracad，然后将 abra:123 保留在输入中。

无论您如何确定语义，用户的期望都可能会落空。

lex 现在已由 Single Unix 规范标准化，该规范表示：

r/x
仅当正则表达式 r 后面出现正则表达式 x（x 是尾随上下文的实例，下面进一步定义）时，才应匹配正则表达式 r。 yytext 中返回的标记只能与 r 匹配。如果 r 的尾部部分与 x 的开头匹配，则结果未指定。 r 表达式不能包含进一步的尾随上下文或“$”（匹配行尾）运算符； x 不能包含“^”（匹配行开头）运算符，也不能包含尾随上下文，也不能包含“$”运算符。也就是说，lex 正则表达式中只允许出现一次尾随上下文，并且“^”运算符只能用在此类表达式的开头。

所以你可以看到这里有解释的空间。 r 和 x 可以被视为单独的正则表达式，以正常方式计算 r 的匹配，就好像它是单独的一样，然后将 x 作为特殊约束应用。

该规范也对这个问题进行了讨论（你很幸运）：

以下示例阐明了 lex 正则表达式与 IEEE Std 1003.1-2001 本卷中其他地方出现的正则表达式之间的差异。对于“r/x”形式的正则表达式，总是返回匹配 r 的字符串；当 x 的开头与 r 的尾部部分匹配时，可能会出现混淆。例如，给定正则表达式“a*b/cc”和输入“aaabcc”，yytext 将在此匹配项中包含字符串“aaab”。但是给定正则表达式“x*/xy”和输入“xxxy”，某些实现会返回标记 xxx，而不是 xx，因为 xxx 与“x*”匹配。
在规则“ab*/bc”中，r 末尾的“b*”将 r 的匹配扩展到尾随上下文的开头，因此结果未指定。但是，如果该规则是“ab/bc”，则当该规则后跟文本“bc”时，该规则将匹配文本“ab”。在后一种情况下，r 的匹配不能扩展到 x 的开头，因此结果是指定的。
如您所见，此功能存在一些限制。

未指定的行为意味着对于行为应该是什么有一些选择，其中没有一个比其他更正确（如果您希望 lex 程序可移植，就不要编写这样的模式）。 “如您所见，此功能存在一些限制”。

"Does not work how I think it should" and "incorrect" are, not always the same thing. Given the input

aba

and the pattern

(ab|a)/ab

it makes a certain amount of sense for the (ab|a) to match greedily, and then for the /ab constraint to be applied separately. You're thinking that it should work like this regular expression:

(ab|a)(ab)

with the constraint that the part matched by (ab) is not consumed. That's probably better because it removes some limitations, but since there weren't any external requirements for what lex should do at the time it was written, you cannot call either behavior correct or incorrect.

The naive way has the merit that adding a trailing context doesn't change the meaning of a token, but simply adds a totally separate constraint about what may follow it. But that does lead to limitations/surprises:

 {IDENT}  /* original code */

 {IDENT}/ab   /* ident, only when followed by ab */

Oops, it won't work because "ab" is swallowed into IDENT precisely because its meaning was not changed by the trailing context. That turns into a limitation, but maybe it's a limitation that the author was willing to live with in exchange for simplicity. (What is the use case for making it more contextual, anyway?)

How about the other way? That could have surprises also:

 {IDENT}/ab  /* input is bracadabra:123 */

Say the user wants this not to match because bracadabra is not an identifier followed by (or ending in) ab. But {IDENT}/ab will match bracad and then, leaving abra:123 in the input.

A user could have expectations which are foiled no matter how you pin down the semantics.

lex is now standardized by The Single Unix specification, which says this:

r/x
The regular expression r shall be matched only if it is followed by an occurrence of regular expression x ( x is the instance of trailing context, further defined below). The token returned in yytext shall only match r. If the trailing portion of r matches the beginning of x, the result is unspecified. The r expression cannot include further trailing context or the '$' (match-end-of-line) operator; x cannot include the '^' (match-beginning-of-line) operator, nor trailing context, nor the '$' operator. That is, only one occurrence of trailing context is allowed in a lex regular expression, and the '^' operator only can be used at the beginning of such an expression.

So you can see that there is room for interpretation here. The r and x can be treated as separate regexes, with a match for r computed in the normal way as if it were alone, and then x applied as a special constraint.

The spec also has discussion about this very issue (you are in luck):

The following examples clarify the differences between lex regular expressions and regular expressions appearing elsewhere in this volume of IEEE Std 1003.1-2001. For regular expressions of the form "r/x", the string matching r is always returned; confusion may arise when the beginning of x matches the trailing portion of r. For example, given the regular expression "a*b/cc" and the input "aaabcc", yytext would contain the string "aaab" on this match. But given the regular expression "x*/xy" and the input "xxxy", the token xxx, not xx, is returned by some implementations because xxx matches "x*".
In the rule "ab*/bc", the "b*" at the end of r extends r's match into the beginning of the trailing context, so the result is unspecified. If this rule were "ab/bc", however, the rule matches the text "ab" when it is followed by the text "bc". In this latter case, the matching of r cannot extend into the beginning of x, so the result is specified.
As you can see there are some limitations in this feature.

Unspecified behavior means that there are some choices about what the behavior should be, none of which are more correct than the others (and don't write patterns like that if you want your lex program to be portable). "As you can see, there are some limitations in this feature".

回复收藏 0 原文

~没有更多了~