如何解析上下文相关的 C 代码?
我遇到的一个问题是 C 必须是上下文相关的,并且不能用一个先行标记来解析。例如
int main1;
int main() {}
,这是我能想到的最简单的示例,其中函数定义和变量声明都以相同的标记类型开头。您必须一直查看左括号或分号才能确定要解析的内容。
我的问题是,这是如何实现的?词法分析器是否有一些技巧来进行前瞻并发出一个看不见的标记来区分两者?现代解析有很多向前看的标记吗?
One issue I ran into was that C must be context-sensitive and cannot be parsed with one token of lookahead. For example
int main1;
int main() {}
That's the simplest example I can think of in which both a function definition and a variable declaration start with the same token type. You'd have to look all the way ahead to the left paren or semicolon to determine what to parse.
My question is, how is this accomplished? Does the lexical analyzer have some tricks up its sleeve to do the lookahead and emit an invisible token distinguishing the two? Do modern parses have many tokens of lookahead?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您应该阅读 LR 或移位归约解析器。他们自下而上地组装解析树。对于
main
函数来说,它是这样的:int
作为 TYPE 终端标记移入堆栈main
作为 IDENTIFIER 移入堆栈终端令牌(
移入堆栈)
移入堆栈(
和)
并替换为 ARGLIST非终端标记{
移入堆栈 将}
移入堆栈当然,每次进行替换时,它都会构建一个新的解析树片段并将其附加到新标记上。
(我编了这些令牌名称。)
它在有限状态机的控制下工作,该状态机识别堆栈上令牌的模式,并与输入的下一个(单个)令牌一起决定是否移入下一个令牌,或者应用其中一个语法规则将堆栈上的一组标记减少为单个标记。 FSM 由解析器生成器根据语法规则列表构建。
之所以称为 LR,是因为它从左侧读取输入标记,但从右侧应用语法规则。它与 LL(即递归下降)不同,后者从左侧应用语法规则。 Pascal 是一种 LL(1) 语言。 C 不是 LL(1),因此需要 LR(1) 解析器。
例如,它允许 C 将几乎所有内容嵌入嵌套括号中,而不会混淆解析器。
我希望这可以帮助您了解发生了什么。
You should read up on LR or shift-reduce parsers. They assemble the parse tree bottom-up. In the case of the
main
function it goes like:int
into the stack as a TYPE terminal tokenmain
into the stack as an IDENTIFIER terminal token(
into the stack)
into the stack(
and)
and replace with an ARGLIST non-terminal token{
into the stack}
into the stackOf course, every time it does a replacement, it builds a new fragment of parse tree and attaches it to the new token.
(I made up those token names.)
It works under control of a finite state machine that recognizes the pattern of tokens on the stack and, together with the next (single) token of input, decides whether to shift the next token in, or apply one of the grammar rules to reduce a group of tokens on the stack into a single one. The FSM is built by the parser generator from a list of the grammar rules.
It is called LR because it reads input tokens from the left, but applies grammar rules from the right. It is distinct from LL, or recursive-descent, which applies grammar rules from the left. Pascal is an LL(1) language. C is not LL(1), so requires an LR(1) parser.
It allows C, for example, to embed almost anything in nested parentheses without confusing the parser.
I hope that helps you see what is going on.