编写一个可在 8 位嵌入式系统上使用的解析器,例如 Flex/Bison
我正在为简单的类似 BASIC 的语言编写一个小型解释器,作为使用 avr-gcc 工具链在 C 语言的 AVR 微控制器上的练习。
如果我编写这个代码是为了在我的 Linux 机器上运行,我可以使用 flex/bison。现在我将自己限制在 8 位平台上,我将如何编写解析器代码?
I'm writing a small interpreter for a simple BASIC like language as an exercise on an AVR microcontroller in C using the avr-gcc toolchain.
If I were writing this to run on my Linux box, I could use flex/bison. Now that I restricted myself to an 8-bit platform, how would I code the parser?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您想要一种简单的方法来编写解析器,或者空间有限,则应该手动编写递归下降解析器;这些本质上是 LL(1) 解析器。这对于像 Basic 这样“简单”的语言尤其有效。 (我在 70 年代就做过其中几个!)。好消息是它们不包含任何库代码;就是你写的。
如果您已经掌握了语法,那么它们很容易编写代码。
首先,您必须摆脱左递归规则(例如, X = XY )。
这通常很容易做到,所以我把它作为练习。
(对于列表形成规则,您不必这样做;
参见下面的讨论)。
然后,如果您有以下形式的 BNF 规则:
为规则 (X, A, B, C) 中的每个项目创建一个返回布尔值的子例程
说“我看到了相应的语法结构”。对于 X,代码:
对于 A、B、C 类似。
如果令牌是终端,则编写代码来检查
组成终端的字符串的输入流。
例如,对于一个数字,检查输入流是否包含数字并推进
输入流光标超过数字。如果您这样做,这会特别容易
正在从缓冲区中解析(对于 BASIC,您往往一次会得到一行)
通过简单地推进或不推进缓冲区扫描指针。
这段代码本质上是解析器的词法分析器部分。
如果您的 BNF 规则是递归的……不用担心。只需编写递归调用代码即可。
它处理如下语法规则:
这可以编码为:
如果您有一个带有替代选项的 BNF 规则:
则使用替代选项编码 P:
有时您会遇到列表形成规则。
这些往往是递归的,这种情况很容易处理。基本思想是使用迭代而不是递归,这可以避免以“明显”方式执行此操作时出现的无限递归。
示例:
您可以使用迭代进行编码:
您可以用这种方式在一两天内编写数百个语法规则。
还有更多细节需要填写,但这里的基础知识应该足够了。
如果您的空间非常,您可以构建一个实现这些想法的虚拟机。这就是我在 70 年代所做的事情,当时你可以获得 8K 16 位字。
如果您不想手动编码,可以使用元编译器(Meta II< /a>) 产生本质上相同的东西。这些都是令人兴奋的技术乐趣,并且确实不需要做所有的工作,即使对于大型语法也是如此。
2014 年 8 月:
我收到很多关于“如何使用解析器构建 AST”的请求。有关详细信息,它本质上详细阐述了这个答案,请参阅我的其他答案 https://stackoverflow.com/a/25106688/120163< /a>
2015 年 7 月:
有很多人想要编写一个简单的表达式求值器。您可以通过执行上面“AST builder”链接建议的相同操作来完成此操作;只是做算术而不是构建树节点。
这是以这种方式完成的表达式计算器。
2021 年 10 月:
值得注意的是,当您的语言不存在递归下降无法很好处理的复杂性时,这种解析器就可以工作。我提供两种复杂情况:a)真正不明确的解析(例如,解析一个短语的不止一种方法)和b)任意长的前瞻(例如,不受常量限制)。在这些情况下,递归下降变成了地狱递归下降,是时候获得一个可以处理它们的解析器生成器了。请参阅我的简介,了解一个使用 GLR 解析器生成器来处理 50 多种不同语言的系统,其中包括所有这些甚至到了荒谬的复杂程度。
If you want an easy way to code parsers, or you are tight on space, you should hand-code a recursive descent parser; these are essentially LL(1) parsers. This is especially effective for languages which are as "simple" as Basic. (I did several of these back in the 70s!). The good news is these don't contain any library code; just what you write.
They are pretty easy to code, if you already have a grammar.
First, you have to get rid of left recursive rules (e.g., X = X Y ).
This is generally pretty easy to do, so I leave it as an exercise.
(You don't have to do this for list-forming rules;
see discussion below).
Then if you have BNF rule of the form:
create a subroutine for each item in the rule (X, A, B, C) that returns a boolean
saying "I saw the corresponding syntax construct". For X, code:
Similarly for A, B, C.
If a token is a terminal, write code that checks
the input stream for the string of characters that makes up the terminal.
E.g, for a Number, check that input stream contains digits and advance the
input stream cursor past the digits. This is especially easy if you
are parsing out of a buffer (for BASIC, you tend to get one line at time)
by simply advancing or not advancing a buffer scan pointer.
This code is essentially the lexer part of the parser.
If your BNF rule is recursive... don't worry. Just code the recursive call.
This handles grammar rules like:
This can be coded as:
If you have a BNF rule with an alternative:
then code P with alternative choices:
Sometimes you'll encounter list forming rules.
These tend to be left recursive, and this case is easily handled. The basic idea is to use iteration rather than recursion, and that avoids the infinite recursion you would get doing this the "obvious" way.
Example:
You can code this using iteration as:
You can code several hundred grammar rules in a day or two this way.
There's more details to fill in, but the basics here should be more than enough.
If you are really tight on space, you can build a virtual machine that implements these ideas. That's what I did back in 70s, when 8K 16 bit words was what you could get.
If you don't want to code this by hand, you can automate it with a metacompiler (Meta II) that produces essentially the same thing. These are mind-blowing technical fun and really takes all the work out of doing this, even for big grammars.
August 2014:
I get a lot of requests for "how to build an AST with a parser". For details on this, which essentially elaborates this answer, see my other SO answer https://stackoverflow.com/a/25106688/120163
July 2015:
There are lots of folks what want to write a simple expression evaluator. You can do this by doing the same kinds of things that the "AST builder" link above suggests; just do arithmetic instead of building tree nodes.
Here's an expression evaluator done this way.
October 2021:
Its worth noting that this kind of parser works when your language doesn't have complications that recursive descent doesn't handle well. I offer two kinds of complications: a) genuinely ambiguous parses (e.g., more than one way to parse a phrase) and b) arbitrarily long lookahead (e.g., not bounded by a constant). In these cases recursive descent turns into recursive descent into hell, and its time to get a parser generator that can handle them. See my bio for a system that uses GLR parser generators to handle over 50 different languages, including all these complications even to the point of ridiculousness.
我已经实现了一个针对 ATmega328p 的简单命令语言的解析器。该芯片有32k ROM,只有2k RAM。 RAM 绝对是更重要的限制——如果您还没有绑定到特定的芯片,请选择具有尽可能多 RAM 的芯片。这将使您的生活变得更加轻松。
起初我考虑使用flex/bison。我决定不使用此选项有两个主要原因:
拒绝 Flex & 后Bison,我去寻找其他生成器工具。以下是我考虑过的一些:
您可能还想参加查看维基百科的比较。
最终,我最终手动编写了词法分析器和解析器。
为了进行解析,我使用了递归下降解析器。我认为Ira Baxter 已经就这个主题做了足够的工作,并且在线有大量教程。
对于我的词法分析器,我为所有终端编写了正则表达式,绘制了等效的状态机图,并使用
goto
在状态之间跳转,将其实现为一个巨大的函数。这很乏味,但结果很好。顺便说一句,goto
是实现状态机的一个很好的工具——所有状态都可以在相关代码旁边有清晰的标签,没有函数调用或状态变量开销,而且它大约是尽可能快。 C 确实没有更好的结构来构建静态状态机。需要考虑的是:词法分析器实际上只是解析器的专业化。最大的区别在于,常规语法通常足以进行词法分析,而大多数编程语言(大部分)都具有上下文无关语法。因此,实际上没有什么可以阻止您将词法分析器实现为递归下降解析器或使用解析器生成器来编写词法分析器。它通常不如使用更专业的工具那么方便。
I've implemented a parser for a simple command language targeted for the ATmega328p. This chip has 32k ROM and only 2k RAM. The RAM is definitely the more important limitation -- if you aren't tied to a particular chip yet, pick one with as much RAM as possible. This will make your life much easier.
At first I considered using flex/bison. I decided against this option for two major reasons:
After rejecting Flex & Bison, I went looking for other generator tools. Here are a few that I considered:
You might also want to take a look at Wikipedia's comparison.
Ultimately, I ended up hand coding both the lexer and parser.
For parsing I used a recursive descent parser. I think Ira Baxter has already done an adequate job of covering this topic, and there are plenty of tutorials online.
For my lexer, I wrote up regular expressions for all of my terminals, diagrammed the equivalent state machine, and implemented it as one giant function using
goto
's for jumping between states. This was tedious, but the results worked great. As an aside,goto
is a great tool for implementing state machines -- all of your states can have clear labels right next to the relevant code, there is no function call or state variable overhead, and it's about as fast as you can get. C really doesn't have a better construct for building static state machines.Something to think about: lexers are really just a specialization of parsers. The biggest difference is that regular grammars are usually sufficient for lexical analysis, whereas most programming languages have (mostly) context-free grammars. So there's really nothing stopping you from implementing a lexer as a recursive descent parser or using a parser generator to write a lexer. It's just not usually as convenient as using a more specialized tool.
您可以在 Linux 上使用 flex/bison 及其本机 gcc 来生成代码,然后将其与 AVR gcc 交叉编译以用于嵌入式目标。
You can use flex/bison on Linux with its native gcc to generate the code that you will then cross-compile with your AVR gcc for the embedded target.
GCC 可以交叉编译到多种平台,但您可以在运行编译器的平台上运行 flex 和 bison。他们只是输出 C 代码,然后由编译器构建。对其进行测试,看看生成的可执行文件到底有多大。请注意,它们具有运行时库(
libfl.a
等),您还必须交叉编译到您的目标。GCC can cross-compile to a variety of platforms, but you run flex and bison on the platform you're running the compiler on. They just spit out C code that the compiler then builds. Test it to see how big the resulting executable really is. Note that they have run time libraries (
libfl.a
etc.) that you will also have to cross compile to your target.