当前位置：文江博客话题详情

编写一个可在 8 位嵌入式系统上使用的解析器，例如 Flex/Bison

发布于 2024-08-21 13:27:40 字数 155 浏览 3 评论 0原文

我正在为简单的类似 BASIC 的语言编写一个小型解释器，作为使用 avr-gcc 工具链在 C 语言的 AVR 微控制器上的练习。

如果我编写这个代码是为了在我的 Linux 机器上运行，我可以使用 flex/bison。现在我将自己限制在 8 位平台上，我将如何编写解析器代码？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

歌枕肩 2024-08-28 13:27:40

如果您想要一种简单的方法来编写解析器，或者空间有限，则应该手动编写递归下降解析器；这些本质上是 LL(1) 解析器。这对于像 Basic 这样“简单”的语言尤其有效。（我在 70 年代就做过其中几个！）。好消息是它们不包含任何库代码；就是你写的。

如果您已经掌握了语法，那么它们很容易编写代码。
首先，您必须摆脱左递归规则（例如， X = XY ）。
这通常很容易做到，所以我把它作为练习。
（对于列表形成规则，您不必这样做；
参见下面的讨论）。

然后，如果您有以下形式的 BNF 规则：

 X = A B C ;

为规则 (X, A, B, C) 中的每个项目创建一个返回布尔值的子例程
说“我看到了相应的语法结构”。对于 X，代码：

subroutine X()
     if ~(A()) return false;
     if ~(B()) { error(); return false; }
     if ~(C()) { error(); return false; }
     // insert semantic action here: generate code, do the work, ....
     return true;
end X;

对于 A、B、C 类似。

如果令牌是终端，则编写代码来检查
组成终端的字符串的输入流。
例如，对于一个数字，检查输入流是否包含数字并推进
输入流光标超过数字。如果您这样做，这会特别容易
正在从缓冲区中解析（对于 BASIC，您往往一次会得到一行）
通过简单地推进或不推进缓冲区扫描指针。
这段代码本质上是解析器的词法分析器部分。

如果您的 BNF 规则是递归的……不用担心。只需编写递归调用代码即可。
它处理如下语法规则：

T  =  '('  T  ')' ;

这可以编码为：

subroutine T()
     if ~(left_paren()) return false;
     if ~(T()) { error(); return false; }
     if ~(right_paren()) { error(); return false; }
     // insert semantic action here: generate code, do the work, ....
     return true;
end T;

如果您有一个带有替代选项的 BNF 规则：

 P = Q | R ;

则使用替代选项编码 P：

subroutine P()
    if ~(Q())
        {if ~(R()) return false;
         return true;
        }
    return true;
end P;

有时您会遇到列表形成规则。
这些往往是递归的，这种情况很容易处理。基本思想是使用迭代而不是递归，这可以避免以“明显”方式执行此操作时出现的无限递归。
示例：

L  =  A |  L A ;

您可以使用迭代进行编码：

subroutine L()
    if ~(A()) then return false;
    while (A()) do { /* loop */ }
    return true;
end L;

您可以用这种方式在一两天内编写数百个语法规则。
还有更多细节需要填写，但这里的基础知识应该足够了。

如果您的空间非常，您可以构建一个实现这些想法的虚拟机。这就是我在 70 年代所做的事情，当时你可以获得 8K 16 位字。

如果您不想手动编码，可以使用元编译器（Meta II< /a>) 产生本质上相同的东西。这些都是令人兴奋的技术乐趣，并且确实不需要做所有的工作，即使对于大型语法也是如此。

2014 年 8 月：

我收到很多关于“如何使用解析器构建 AST”的请求。有关详细信息，它本质上详细阐述了这个答案，请参阅我的其他答案 https://stackoverflow.com/a/25106688/120163< /a>

2015 年 7 月：

有很多人想要编写一个简单的表达式求值器。您可以通过执行上面“AST builder”链接建议的相同操作来完成此操作；只是做算术而不是构建树节点。
这是以这种方式完成的表达式计算器。

2021 年 10 月：

值得注意的是，当您的语言不存在递归下降无法很好处理的复杂性时，这种解析器就可以工作。我提供两种复杂情况：a）真正不明确的解析（例如，解析一个短语的不止一种方法）和b）任意长的前瞻（例如，不受常量限制）。在这些情况下，递归下降变成了地狱递归下降，是时候获得一个可以处理它们的解析器生成器了。请参阅我的简介，了解一个使用 GLR 解析器生成器来处理 50 多种不同语言的系统，其中包括所有这些甚至到了荒谬的复杂程度。

If you want an easy way to code parsers, or you are tight on space, you should hand-code a recursive descent parser; these are essentially LL(1) parsers. This is especially effective for languages which are as "simple" as Basic. (I did several of these back in the 70s!). The good news is these don't contain any library code; just what you write.

They are pretty easy to code, if you already have a grammar.
First, you have to get rid of left recursive rules (e.g., X = X Y ).
This is generally pretty easy to do, so I leave it as an exercise.
(You don't have to do this for list-forming rules;
see discussion below).

Then if you have BNF rule of the form:

 X = A B C ;

create a subroutine for each item in the rule (X, A, B, C) that returns a boolean
saying "I saw the corresponding syntax construct". For X, code:

subroutine X()
     if ~(A()) return false;
     if ~(B()) { error(); return false; }
     if ~(C()) { error(); return false; }
     // insert semantic action here: generate code, do the work, ....
     return true;
end X;

Similarly for A, B, C.

If a token is a terminal, write code that checks
the input stream for the string of characters that makes up the terminal.
E.g, for a Number, check that input stream contains digits and advance the
input stream cursor past the digits. This is especially easy if you
are parsing out of a buffer (for BASIC, you tend to get one line at time)
by simply advancing or not advancing a buffer scan pointer.
This code is essentially the lexer part of the parser.

If your BNF rule is recursive... don't worry. Just code the recursive call.
This handles grammar rules like:

T  =  '('  T  ')' ;

This can be coded as:

subroutine T()
     if ~(left_paren()) return false;
     if ~(T()) { error(); return false; }
     if ~(right_paren()) { error(); return false; }
     // insert semantic action here: generate code, do the work, ....
     return true;
end T;

If you have a BNF rule with an alternative:

 P = Q | R ;

then code P with alternative choices:

subroutine P()
    if ~(Q())
        {if ~(R()) return false;
         return true;
        }
    return true;
end P;

Sometimes you'll encounter list forming rules.
These tend to be left recursive, and this case is easily handled. The basic idea is to use iteration rather than recursion, and that avoids the infinite recursion you would get doing this the "obvious" way.
Example:

L  =  A |  L A ;

You can code this using iteration as:

subroutine L()
    if ~(A()) then return false;
    while (A()) do { /* loop */ }
    return true;
end L;

You can code several hundred grammar rules in a day or two this way.
There's more details to fill in, but the basics here should be more than enough.

If you are really tight on space, you can build a virtual machine that implements these ideas. That's what I did back in 70s, when 8K 16 bit words was what you could get.

If you don't want to code this by hand, you can automate it with a metacompiler (Meta II) that produces essentially the same thing. These are mind-blowing technical fun and really takes all the work out of doing this, even for big grammars.

August 2014:

I get a lot of requests for "how to build an AST with a parser". For details on this, which essentially elaborates this answer, see my other SO answer https://stackoverflow.com/a/25106688/120163

July 2015:

There are lots of folks what want to write a simple expression evaluator. You can do this by doing the same kinds of things that the "AST builder" link above suggests; just do arithmetic instead of building tree nodes.
Here's an expression evaluator done this way.

October 2021:

Its worth noting that this kind of parser works when your language doesn't have complications that recursive descent doesn't handle well. I offer two kinds of complications: a) genuinely ambiguous parses (e.g., more than one way to parse a phrase) and b) arbitrarily long lookahead (e.g., not bounded by a constant). In these cases recursive descent turns into recursive descent into hell, and its time to get a parser generator that can handle them. See my bio for a system that uses GLR parser generators to handle over 50 different languages, including all these complications even to the point of ridiculousness.

回复收藏 0 原文

等往事风中吹 2024-08-28 13:27:40

我已经实现了一个针对 ATmega328p 的简单命令语言的解析器。该芯片有32k ROM，只有2k RAM。 RAM 绝对是更重要的限制——如果您还没有绑定到特定的芯片，请选择具有尽可能多 RAM 的芯片。这将使您的生活变得更加轻松。

起初我考虑使用flex/bison。我决定不使用此选项有两个主要原因：

默认情况下，Flex 和 Flex Bison 依赖于一些标准库函数（尤其是 I/O），这些函数在 avr-libc 中不可用或无法正常工作。我很确定有受支持的解决方法，但这是您需要考虑的一些额外工作。
AVR 具有哈佛架构。 C 的设计并未考虑到这一点，因此即使是常量变量也会默认加载到 RAM 中。您必须使用特殊的宏/函数来存储和访问 flash 和 EEPROM。柔性和Bison 创建一些相对较大的查找表，这些表会很快耗尽您的 RAM。除非我弄错了（这很可能），否则您将必须编辑输出源才能利用特殊的 Flash 和 Flash 功能。 EEPROM 接口。

拒绝 Flex & 后Bison，我去寻找其他生成器工具。以下是我考虑过的一些：

您可能还想参加查看维基百科的比较。

最终，我最终手动编写了词法分析器和解析器。

为了进行解析，我使用了递归下降解析器。我认为Ira Baxter 已经就这个主题做了足够的工作，并且在线有大量教程。

对于我的词法分析器，我为所有终端编写了正则表达式，绘制了等效的状态机图，并使用 goto 在状态之间跳转，将其实现为一个巨大的函数。这很乏味，但结果很好。顺便说一句，goto 是实现状态机的一个很好的工具——所有状态都可以在相关代码旁边有清晰的标签，没有函数调用或状态变量开销，而且它大约是尽可能快。 C 确实没有更好的结构来构建静态状态机。

需要考虑的是：词法分析器实际上只是解析器的专业化。最大的区别在于，常规语法通常足以进行词法分析，而大多数编程语言（大部分）都具有上下文无关语法。因此，实际上没有什么可以阻止您将词法分析器实现为递归下降解析器或使用解析器生成器来编写词法分析器。它通常不如使用更专业的工具那么方便。

I've implemented a parser for a simple command language targeted for the ATmega328p. This chip has 32k ROM and only 2k RAM. The RAM is definitely the more important limitation -- if you aren't tied to a particular chip yet, pick one with as much RAM as possible. This will make your life much easier.

At first I considered using flex/bison. I decided against this option for two major reasons:

By default, Flex & Bison depend on some standard library functions (especially for I/O) that aren't available or don't work the same in avr-libc. I'm pretty sure there are supported workarounds, but this is some extra effort that you will need to take into account.
AVR has a Harvard Architecture. C isn't designed to account for this, so even constant variables are loaded into RAM by default. You have to use special macros/functions to store and access data in flash and EEPROM. Flex & Bison create some relatively large lookup tables, and these will eat up your RAM pretty quickly. Unless I'm mistaken (which is quite possible) you will have to edit the output source in order to take advantage of the special Flash & EEPROM interfaces.

After rejecting Flex & Bison, I went looking for other generator tools. Here are a few that I considered:

You might also want to take a look at Wikipedia's comparison.

Ultimately, I ended up hand coding both the lexer and parser.

For parsing I used a recursive descent parser. I think Ira Baxter has already done an adequate job of covering this topic, and there are plenty of tutorials online.

For my lexer, I wrote up regular expressions for all of my terminals, diagrammed the equivalent state machine, and implemented it as one giant function using goto's for jumping between states. This was tedious, but the results worked great. As an aside, goto is a great tool for implementing state machines -- all of your states can have clear labels right next to the relevant code, there is no function call or state variable overhead, and it's about as fast as you can get. C really doesn't have a better construct for building static state machines.

Something to think about: lexers are really just a specialization of parsers. The biggest difference is that regular grammars are usually sufficient for lexical analysis, whereas most programming languages have (mostly) context-free grammars. So there's really nothing stopping you from implementing a lexer as a recursive descent parser or using a parser generator to write a lexer. It's just not usually as convenient as using a more specialized tool.

回复收藏 0 原文