Lex 的前瞻运算符算法不正确
在 Andrew Appel 的“Java 中的现代编译器实现”中,他在练习中声称:
Lex 有一个先行运算符 /,因此正则表达式 abc/def 仅当后跟 def 时才匹配 abc(但 def 不是匹配字符串的一部分,而是下一个标记的一部分)。阿霍等人。 [1986] 描述了,并且 Lex [Lesk 1975] 使用了一种错误的算法来实现前瞻(它在输入 aba 的 (a|ab)/ba 上失败,在应该匹配 a 的地方匹配 ab)。 Flex [Paxson 1995] 使用了一种更好的机制,对于 (a|ab)/ba 可以正确工作,但会失败(在 zx*/xy* 上显示警告消息。设计一个更好的前瞻机制。
有谁知道他所描述的解决方案?
In "modern compiler implementation in Java" by Andrew Appel he claims in an exercise that:
Lex has a lookahead operator / so that the regular expression abc/def matches abc only when followed by def (but def is not part of the matched string, and will be part of the next token(s)). Aho et al. [1986] describe, and Lex [Lesk 1975] uses, an incorrect algorithm for implementing lookahead (it fails on (a|ab)/ba with input aba, matching ab where it should match a). Flex [Paxson 1995] uses a better mechanism that works correctly for (a|ab)/ba but fails (with a warning message on zx*/xy*. Design a better lookahead mechanism.
Does anyone know the solution to what he is describing?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
“不按照我认为应该的方式工作”和“不正确”并不总是同一件事。给定输入
和模式,
(ab|a) 贪婪匹配,然后单独应用
/ab
约束是有一定意义的。您认为它应该像这个正则表达式一样工作:约束条件是
(ab)
匹配的部分不被消耗。这可能更好,因为它消除了一些限制,但由于在编写 lex 时没有任何外部要求,因此您不能称行为正确或不正确。天真的方法的优点是添加尾随上下文不会改变标记的含义,而只是添加一个关于其后面可能包含的内容的完全独立的约束。但这确实会导致限制/惊喜:
哎呀,它不起作用,因为“ab”被吞入 IDENT 中,正是因为它的含义没有被尾随上下文改变。这变成了一个限制,但也许这是作者愿意忍受的一个限制,以换取简单性。 (无论如何,使它更具上下文的用例是什么?)
另一种方式怎么样?这也可能会带来惊喜:
假设用户希望不匹配,因为
bracadabra
不是后跟(或以ab
结尾)的标识符。但 {IDENT}/ab 将匹配bracad
,然后将abra:123
保留在输入中。无论您如何确定语义,用户的期望都可能会落空。
lex
现在已由 Single Unix 规范标准化,该规范表示:所以你可以看到这里有解释的空间。 r 和 x 可以被视为单独的正则表达式,以正常方式计算 r 的匹配,就好像它是单独的一样,然后将 x 作为特殊约束应用。
该规范也对这个问题进行了讨论(你很幸运):
未指定的行为意味着对于行为应该是什么有一些选择,其中没有一个比其他更正确(如果您希望 lex 程序可移植,就不要编写这样的模式)。 “如您所见,此功能存在一些限制”。
"Does not work how I think it should" and "incorrect" are, not always the same thing. Given the input
and the pattern
it makes a certain amount of sense for the (ab|a) to match greedily, and then for the
/ab
constraint to be applied separately. You're thinking that it should work like this regular expression:with the constraint that the part matched by
(ab)
is not consumed. That's probably better because it removes some limitations, but since there weren't any external requirements for whatlex
should do at the time it was written, you cannot call either behavior correct or incorrect.The naive way has the merit that adding a trailing context doesn't change the meaning of a token, but simply adds a totally separate constraint about what may follow it. But that does lead to limitations/surprises:
Oops, it won't work because "ab" is swallowed into IDENT precisely because its meaning was not changed by the trailing context. That turns into a limitation, but maybe it's a limitation that the author was willing to live with in exchange for simplicity. (What is the use case for making it more contextual, anyway?)
How about the other way? That could have surprises also:
Say the user wants this not to match because
bracadabra
is not an identifier followed by (or ending in)ab
. But {IDENT}/ab will matchbracad
and then, leavingabra:123
in the input.A user could have expectations which are foiled no matter how you pin down the semantics.
lex
is now standardized by The Single Unix specification, which says this:So you can see that there is room for interpretation here. The r and x can be treated as separate regexes, with a match for r computed in the normal way as if it were alone, and then x applied as a special constraint.
The spec also has discussion about this very issue (you are in luck):
Unspecified behavior means that there are some choices about what the behavior should be, none of which are more correct than the others (and don't write patterns like that if you want your lex program to be portable). "As you can see, there are some limitations in this feature".