个人建议:我编写了许多解释器或编译器,但从未从功能齐全的语言开始。 C 语法确实大,所以也许您应该从一个子集开始,然后看看可以使用标记和 AST 做什么,然后扩展它以支持完整的语法。
Since you are going to use already written grammars and regular expressions you choice of the tool is ininfluent.
You can go with flex / bison and you will find many grammars already written. Otherwise you can go with ANTLR that should work on C, C++ and Java without problems and do the same thing also for it.
You didn't speak about what language are you going to use for this work so suggesting a better approach is not so easy.
Think about the fact that every language has its own features, for example symbol table are constructed in a different way in Ruby compared to C++. That's because you can have stricter or looser declarations and so on.. so you should think well what you are going to need (and you can explain it in your question too, so I can give better help).
Of your two phases I can say that
Tokenizing is quite simple, doesn't require different structures for every language and can be easily extended to support a plethora of programming languages..
Parsing can be more difficult. You have to build up an Abstract Syntax Tree of the program and then do whatever you want on it. If you like to do it OOP style you'll have to use a class for every node type, but node types can change between languages because they're structurally different so doing something general and easily extendable to other language it's quite tricky..
For this point ANTLR wins over Flex and Bison because it offers an automatic generation of AST (if I remember well).
The main difference between these two compiler's compilers is the fact that ANTLR uses an LL(k) parser (that is top-down) while Bison uses a LALR(1) that is bottom-up but if you use already written grammars that shouldn't be that difficult.
A personal advice: I wrote many interpreters or compilers but never started from a fully-featured language. C syntax is really big so maybe you should start from a subset, then see what you can do with tokens and AST and later extend it to support full syntax.
I'd go with antlr (and actually I go for parsing Java). It supports a lot of languages and also has a lot of example grammars that you get for free http://www.antlr.org/grammar/list. Unfortunately they don't have to be perfect (the Java grammar has no AST rules) but they give you a good start and I suppose the community is quite big for a parser generator.
The great thing with antlr apart from the many language targets is that LL(*) combinded with the predicates supported by antlr is very powerful a easy to understand and the generated parsers are too.
With "extendable to multiple languages" I suppose you mean multiple source languages. This isn't easy but I suppose you might have some success when translating them to ASTs that have as much common symbols as possible and writing a general tree walker that can handle the differences in those languages. But this could be quite difficult.
Be warned, though, that the online documentation is only good once you've read the official antlr book and understand LL(*) and semantic and syntactic predicates.
It's super simple to use, and even has grammars pre-built for several languages (C# even). There's also pyparsing (http://pyparsing.wikispaces.com/) if you want to use Python as your source language.
要穿过的一扇门是 Eclipse。它具有针对多种语言的解析,包括容错解析。 Eclipse 具有内部模块化功能,允许您在不接触 IDE 的情况下利用此功能。
A door to go through is Eclipse. It has parsing, including error tolerant parsing, for a variety of languages. Eclipse has an internal modularity that allows you to exploit this functionality without touching the IDE.
发布评论
评论(5)
毫无疑问,进行任何解析的最佳方法是 ANTLR。作者有两本关于这个主题的好书是必备的。 权威 ANTLR 参考:构建领域特定语言,以及 语言实现模式,两者都是宝贵的资源。 ANTLR 可以生成多种不同语言的处理代码。
hands down the best way to do any parsing is ANTLR. There are two great books on the subject by the author that are must haves. The Definitive ANTLR Reference: Building Domain Specific Languages, and Language Implementation Patterns, both are invaluable resources. ANTLR can generate processing code in lots of different languages.
由于您将使用已经编写的语法和正则表达式,因此您对工具的选择是有影响的。
你可以使用flex / bison,你会发现很多语法已经写好了。否则,您可以使用 ANTLR,它应该可以毫无问题地在 C、C++ 和 Java 上运行,并对其执行相同的操作。
您没有谈到您将使用什么语言来完成这项工作,因此建议更好的方法并不那么容易。
想想每个语言都有自己的特性这一事实,例如,与 C++ 相比,Ruby 中符号表的构建方式不同。那是因为你可以有更严格或更宽松的声明等等..所以你应该好好考虑一下你将需要什么(你也可以在你的问题中解释它,所以我可以提供更好的帮助)。
在您的两个阶段中,我可以说
标记化非常简单,不需要每种语言都有不同的结构,并且可以轻松扩展以支持多种编程语言..
解析可能会更加困难。您必须构建程序的抽象语法树,然后对其执行任何您想要的操作。如果您喜欢以 OOP 方式进行操作,则必须为每个节点类型使用一个类,但节点类型可以在不同语言之间发生变化,因为它们在结构上不同,因此做一些通用且易于扩展到其他语言的事情是相当棘手的..< /p>
。在这一点上 ANTLR 胜过了 Flex 和 Bison,因为它提供了自动生成 AST(如果我没记错的话)。
这两个编译器的编译器之间的主要区别在于 ANTLR 使用 LL(k) 解析器(即自上而下),而 Bison 使用 LALR(1)是自下而上的,但如果您使用已经编写的语法,那应该不会那么困难。
个人建议:我编写了许多解释器或编译器,但从未从功能齐全的语言开始。 C 语法确实大,所以也许您应该从一个子集开始,然后看看可以使用标记和 AST 做什么,然后扩展它以支持完整的语法。
Since you are going to use already written grammars and regular expressions you choice of the tool is ininfluent.
You can go with flex / bison and you will find many grammars already written. Otherwise you can go with ANTLR that should work on C, C++ and Java without problems and do the same thing also for it.
You didn't speak about what language are you going to use for this work so suggesting a better approach is not so easy.
Think about the fact that every language has its own features, for example symbol table are constructed in a different way in Ruby compared to C++. That's because you can have stricter or looser declarations and so on.. so you should think well what you are going to need (and you can explain it in your question too, so I can give better help).
Of your two phases I can say that
Tokenizing is quite simple, doesn't require different structures for every language and can be easily extended to support a plethora of programming languages..
Parsing can be more difficult. You have to build up an Abstract Syntax Tree of the program and then do whatever you want on it. If you like to do it OOP style you'll have to use a class for every node type, but node types can change between languages because they're structurally different so doing something general and easily extendable to other language it's quite tricky..
For this point ANTLR wins over Flex and Bison because it offers an automatic generation of AST (if I remember well).
The main difference between these two compiler's compilers is the fact that ANTLR uses an LL(k) parser (that is top-down) while Bison uses a LALR(1) that is bottom-up but if you use already written grammars that shouldn't be that difficult.
A personal advice: I wrote many interpreters or compilers but never started from a fully-featured language. C syntax is really big so maybe you should start from a subset, then see what you can do with tokens and AST and later extend it to support full syntax.
你用什么语言编写程序?
我会选择antlr(实际上我会选择解析Java)。它支持多种语言,并且还有大量免费示例语法 http://www .antlr.org/grammar/list。不幸的是,它们不一定是完美的(Java 语法没有 AST 规则),但它们给了你一个良好的开端,我认为对于解析器生成器来说社区相当大。
除了许多语言目标之外,antlr 的伟大之处在于 LL(*) 与 antlr 支持的谓词相结合非常强大且易于理解,生成的解析器也是如此。
对于“可扩展到多种语言”,我想您指的是多种源语言。这并不容易,但我想当你将它们翻译成具有尽可能多的通用符号的 AST 并编写一个可以处理这些语言差异的通用树遍历器时,你可能会取得一些成功。但这可能相当困难。
但请注意,只有在阅读了 Antlr 官方书籍并理解 LL(*) 以及语义和句法谓词后,在线文档才有用。
What language do you write your program in?
I'd go with antlr (and actually I go for parsing Java). It supports a lot of languages and also has a lot of example grammars that you get for free http://www.antlr.org/grammar/list. Unfortunately they don't have to be perfect (the Java grammar has no AST rules) but they give you a good start and I suppose the community is quite big for a parser generator.
The great thing with antlr apart from the many language targets is that LL(*) combinded with the predicates supported by antlr is very powerful a easy to understand and the generated parsers are too.
With "extendable to multiple languages" I suppose you mean multiple source languages. This isn't easy but I suppose you might have some success when translating them to ASTs that have as much common symbols as possible and writing a general tree walker that can handle the differences in those languages. But this could be quite difficult.
Be warned, though, that the online documentation is only good once you've read the official antlr book and understand LL(*) and semantic and syntactic predicates.
您没有指定语言,所以我只推荐我前几天发现的这个小宝石:
http:// irony.codeplex.com/
使用起来超级简单,甚至还为多种语言(甚至 C#)预先构建了语法。如果你想使用 Python 作为你的工具,还有 pyparsing (http://pyparsing.wikispaces.com/)源语言。
You didn't specify a language, so I'll just recommend this little gem I found the other day:
http://irony.codeplex.com/
It's super simple to use, and even has grammars pre-built for several languages (C# even). There's also pyparsing (http://pyparsing.wikispaces.com/) if you want to use Python as your source language.
要穿过的一扇门是 Eclipse。它具有针对多种语言的解析,包括容错解析。 Eclipse 具有内部模块化功能,允许您在不接触 IDE 的情况下利用此功能。
A door to go through is Eclipse. It has parsing, including error tolerant parsing, for a variety of languages. Eclipse has an internal modularity that allows you to exploit this functionality without touching the IDE.