如何在编译器中实现前向引用?
我正在使用 Lex 和 YACC(实际上是 Flex 和 Bison)创建一个编译器。 该语言允许对任何符号进行无限制的前向引用(如 C#)。 问题在于,如果不知道标识符是什么,就不可能解析该语言。
我知道的唯一解决方案是对整个源进行 lex 分析,然后进行“广度优先”解析,因此类声明和函数声明等更高级别的内容在使用它们的函数之前得到解析。 然而,这对于大文件来说会占用大量内存,并且很难用 YACC 处理(我必须为每种类型的声明/主体创建单独的语法)。 我还必须手写词法分析器(这不是什么大问题)。
我不太关心效率(尽管它仍然很重要),因为一旦完成,我将重写编译器本身,但我希望该版本能够很快(所以如果有任何快速的通用Lex/YACC 中无法完成但可以手动完成的技术,也请提出建议)。 所以现在,开发的便利性是最重要的因素。
对于这个问题有什么好的解决办法吗? 在 C# 或 Java 等语言的编译器中,这通常是如何完成的?
I'm creating a compiler with Lex and YACC (actually Flex and Bison). The language allows unlimited forward references to any symbol (like C#). The problem is that it's impossible to parse the language without knowing what an identifier is.
The only solution I know of is to lex the entire source, and then do a "breadth-first" parse, so higher level things like class declarations and function declarations get parsed before the functions that use them. However, this would take a large amount of memory for big files, and it would be hard to handle with YACC (I would have to create separate grammars for each type of declaration/body). I would also have to hand-write the lexer (which is not that much of a problem).
I don't care a whole lot about efficiency (although it still is important), because I'm going to rewrite the compiler in itself once I finish it, but I want that version to be fast (so if there are any fast general techniques that can't be done in Lex/YACC but can be done by hand, please suggest them also). So right now, ease of development is the most important factor.
Are there any good solutions to this problem? How is this usually done in compilers for languages like C# or Java?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
完全可以解析它。 尽管标识符和关键字之间存在歧义,但 lex 会很乐意通过给予关键字优先级来解决这个问题。
我看不出还有什么其他问题。 您不需要在解析阶段确定标识符是否有效。 在解析时,您正在构建解析树或抽象语法树(区别很微妙,但与本讨论的目的无关)。 之后,您可以通过对解析期间生成的 AST 执行传递来构建嵌套符号表结构。 然后,您再次传递 AST 以检查所使用的标识符是否有效。 接下来对 AST 进行一个或多个附加解析以生成输出代码或其他一些中间数据结构,然后就完成了!
编辑:如果您想了解它是如何完成的,请检查 Mono C# 编译器的源代码。 这实际上是用 C# 而不是 C 或 C++ 编写的,但它确实使用了 Jay 的 .NET 端口,这与 yacc 非常相似。
It's entirely possible to parse it. Although there is an ambiguity between identifiers and keywords, lex will happily cope with that by giving the keywords priority.
I don't see what other problems there are. You don't need to determine if identifiers are valid during the parsing stage. You are constructing either a parse tree or an abstract syntax tree (the difference is subtle, but irrelevant for the purposes of this discussion) as you parse. After that you build your nested symbol table structures by performing a pass over the AST you generated during the parse. Then you do another pass over the AST to check that identifiers used are valid. Follow this with one or more additional parses over the AST to generate the output code, or some other intermediate datastructure and you're done!
EDIT: If you want to see how it's done, check the source code for the Mono C# compiler. This is actually written in C# rather than C or C++, but it does use .NET port of Jay which is very similar to yacc.
一种选择是通过扫描和缓存令牌来处理前向引用,直到您遇到您知道如何处理的东西(有点像“恐慌模式”错误恢复)。 一旦你运行了完整的文件,请返回并尝试重新解析之前未解析的位。
至于必须手写词法分析器; 不,使用 lex 生成一个普通的解析器,然后通过手写的垫片读取它,这样您就可以返回并从缓存中提供解析器以及 lex 所做的内容。
至于制作几种语法,在 yacc 文件上使用预处理器有点有趣,您应该能够使用相同的原始源来制作它们
One option is to deal with forward references by just scanning and caching tokens till you hit something you know how to real with (sort of like "panic-mode" error recovery). Once you have run thought the full file, go back and try to re parse the bits that didn't parse before.
As to having to hand write the lexer; don't, use lex to generate a normal parser and just read from it via a hand written shim that lets you go back and feed the parser from a cache as well as what lex makes.
As to making several grammars, a little fun with a preprocessor on the yacc file and you should be able to make them all out of the same original source