编写编程语言解析器的最佳实践
在编写解析器时我应该遵循哪些最佳实践?
Are there any best practices that I should follow while writing a parser?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
在编写解析器时我应该遵循哪些最佳实践?
Are there any best practices that I should follow while writing a parser?
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(7)
公认的智慧是使用解析器生成器+语法,这似乎是个好建议,因为您使用的是严格的工具,并且可能会减少这样做的工作量和潜在的错误。
要使用解析器生成器,语法必须与上下文无关。 如果您正在设计要解析的语言,那么您可以控制它。 如果你不确定,那么如果你开始学习语法路线,可能会花费你很多努力。 即使它在实践中是上下文无关的,除非语法非常庞大,否则手动编写递归体面的解析器会更简单。
上下文无关不仅使解析器生成器成为可能,而且还使手工编码的解析器变得更加简单。 您最终得到的是每个短语一个(或两个)功能。 也就是说,如果您干净地组织和命名代码,并不比语法更难查看(如果您的 IDE 可以显示调用层次结构,那么您几乎可以看到语法是什么)。
优点: -
我并不是说语法总是不合适,但通常好处很小,而且常常被忽视由成本和风险决定。
(我相信他们的论点似乎很有吸引力,并且对他们存在普遍偏见,因为这是一种表明一个人更有计算机科学素养的方式。)
The received wisdom is to use parser generators + grammars and it seems like good advice, because you are using a rigorous tool and presumably reducing effort and potential for bugs in doing so.
To use a parser generator the grammar has to be context free. If you are designing the languauge to be parsed then you can control this. If you are not sure then it could cost you a lot of effort if you start down the grammar route. Even if it is context free in practice, unless the grammar is enormous, it can be simpler to hand code a recursive decent parser.
Being context free does not only make the parser generator possible, but it also makes hand coded parsers a lot simpler. What you end up with is one (or two) functions per phrase. Which is if you organise and name the code cleanly is not much harder to see than a grammar (if your IDE can show you call hierachies then you can pretty much see what the grammar is).
The advantages:-
I am not saying grammars are always unsuitable, but often the benefits are minimal and are often out weighed by the costs and risks.
(I believe the arguments for them are speciously appealing and that there is a general bias for them as it is a way of signaling that one is more computer-science literate.)
几点建议:
Few pieces of advice:
不要过度使用正则表达式 - 虽然它们有其用处,但它们根本没有能力处理任何类型的真正解析。 你可以推动它们,但你最终会碰壁或最终陷入无法维护的混乱。 您最好找到一个可以处理更大语言集的解析器生成器。 如果您真的不想使用工具,您可以查看递归下降解析器 - 这是手动编写小型解析器的非常简单的模式。 它们不像大型解析器生成器那样灵活或强大,但它们的学习曲线要短得多。
除非您有非常严格的性能要求,否则请尝试将各个层分开 - 词法分析器读取各个标记,解析器将它们排列成树,然后语义分析检查所有内容并链接引用,然后是最后阶段输出任何内容正在制作中。 将逻辑的不同部分分开将使以后更容易维护。
Don't overuse regular expressions - while they have their place, they simply don't have the power to handle any kind of real parsing. You can push them, but you're eventually going to hit a wall or end up with an unmaintainable mess. You're better off finding a parser generator that can handle a larger language set. If you really don't want to get into tools, you can look at recursive descent parsers - it's a really simple pattern for hand-writing a small parser. They aren't as flexible or as powerful as the big parser generators, but they have a much shorter learning curve.
Unless you have very tight performance requirements, try and keep your layers separate - the lexer reads in individual tokens, the parser arranges those into a tree, and then semantic analysis checks over everything and links up references, and then a final phase to output whatever is being produced. Keeping the different parts of logic separate will make things easier to maintain later.
首先阅读龙之书的大部分内容。
如果您知道如何构建解析器,解析器并不复杂,但它们并不是那种只要您投入足够时间就能最终实现的东西。 最好以现有的知识库为基础。 (否则就指望写了几十次就扔掉了)。
Read most of the Dragon book first.
Parsers are not complicated if you know how to build them, but they are NOT the type of thing that if you put in enough time, you'll eventually get there. It's way better to build on the existing knowledge base. (Otherwise expect to write it and throw it away a few dozen times).
是的。 尝试生成它,而不是编写。 考虑使用 yacc、ANTLR、Flex/Bison、Coco/R、GOLD 解析器生成器等。仅当现有解析器生成器都不满足您的需求时,才手动编写解析器。
Yep. Try to generate it, not write. Consider using yacc, ANTLR, Flex/Bison, Coco/R, GOLD Parser generator, etc. Resort to manually writing a parser only if none of existing parser generators fit your needs.
首先,不要尝试应用相同的技术来解析所有内容。 有许多可能的用例,从 IP 地址(一些临时代码)到 C++ 程序(需要具有符号表反馈的工业强度解析器),以及用户输入(需要非常频繁地处理)。快速)到编译器(通常可以花一点时间进行解析)。 如果您想要有用的答案,您可能需要指定您正在做什么。
其次,记住要解析的语法。 越复杂,规范就需要越正式。 尽量避免过于正式。
第三,这取决于你在做什么。
First, don't try to apply the same techniques to parsing everything. There are numerous possible use cases, from something like IP addresses (a bit of ad hoc code) to C++ programs (which need an industrial-strength parser with feedback from the symbol table), and from user input (which needs to be processed very fast) to compilers (which normally can afford to spend a little time parsing). You might want to specify what you're doing if you want useful answers.
Second, have a grammar in mind to parse with. The more complicated it is, the more formal the specification needs to be. Try to err on the side of being too formal.
Third, well, that depends on what you're doing.