在 C 中解析 DSL 比 lex/yacc 更好的解决方案?
我的一个程序在运行时接受命令(例如 kill foo
)。将其视为一种特定于领域的语言。这里有几个例子:
kill
kill client
exit
而且,链式命令是允许的,并且命令前后的空格并不重要,因此以下示例也是有效的:
kill ; say "that was fun"
kill ; kill ; kill;
我目前已经使用 lex/yacc (具体来说是 flex/bison)实现了这一点,并且引起了很多头痛。词法分析器在很大程度上取决于上下文(例如,通常不会返回空格标记,除非在 kill
关键字之后)并且具有许多不同的状态。语法过去常常存在冲突,而且我不太喜欢指定它的格式(尤其是 $1、$2、$3、... 使用非终结符的参数)。此外,野牛(在解析时)提供的错误消息有时是准确的,但通常不是(带有可选参数的 kill
命令会导致错误消息,例如 Unexpected $undefined, Expected $end或 ;
用于kill clont
而不是kill client
)。最后,yacc 的 C API 很残酷(到处都是外部定义)。
我并不是要求您解决所有上述问题(如果没有办法绕过 lex/yacc,我将打开单独的线程,其中包含更具体的描述和代码)。相反,我对 lex/yacc 的替代品感兴趣。
我的标准如下:
- 输入是字符串(const char *),没有输出,但应该为每个不同的关键字调用一些代码。
- 我想将其与 C (C99) 一起使用。
- 该软件应该已经包含在主要的 Linux 发行版中,或者至少易于捆绑/打包。
- 它应该有详细记录。
- 描述我的语言的语法应该很简单。
- 它应该在解析错误时输出有意义的错误消息。
- 性能并不是那么重要(当然它应该很快,但典型的用例是交互式使用,而不是处理大量的命令)。
One of my programs accepts a commands (like kill foo
) at runtime. Think of it as a little domain-specific language. Here are a few examples:
kill
kill client
exit
But also, chained commands are allowed and whitespace is not significant before and after commands, so the following examples are also valid:
kill ; say "that was fun"
kill ; kill ; kill;
I have currently implemented this with lex/yacc (flex/bison to be specific) and that caused a lot of headache. The lexer very much depends on the context (for example whitespace tokens are generally not returned, unless after a kill
keyword for example) and has many different states. The grammar used to have conflicts and I don’t really like the format in which it has to be specified (especially the $1, $2, $3, … to use arguments for non-terminals). Also, the error messages which bison provides (at parse-time) are sometimes accurate, but often not (the kill
command with optional arguments leads to error messages like Unexpected $undefined, expected $end or ;
for kill clont
instead of kill client
). Lastly, the C API for yacc is cruel (external defines all over the place).
I am not asking you to solve all the aforementioned questions (I will open separate threads with more specific descriptions and code if there is no way around lex/yacc). Instead, I am interested in alternatives to lex/yacc.
My criteria are the following:
- Input is a string (const char *), there is no output but instead some code should be called for each different keyword.
- I want to use this with C (C99).
- The software should be already included in the major linux distros or at least easy to bundle / package.
- It should be well-documented.
- The syntax for describing my language should be easy.
- It should output meaningful error messages upon parsing errors.
- Performance is not that important (of course it should be fast, but the typical use case is interactive usage, not processing tons of MB of commands).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
至于非常简单和小的语法,我会考虑手动编写词法分析器/解析器 - 通常不需要那么多工作。
几乎所有 Linux 发行版都附带了 lex/yacc 的变体。除此之外,另外两个广泛使用的解析器生成器是 lemon 和 antlr。
As for a very simple and small grammar, I'd consider writing the lexer/parser by hand - it's often not that much work.
Virtually all linux distros ship a variant of lex/yacc. Other than that, two other widely used parser generators are lemon and antlr.
由于您的语言看起来非常简单,因此我建议实现一个有限状态机来标记和解析输入。
只需一次读取一个字符,在空白处进行标记(而不是在带引号的字符串中)。每个“命令”使机器处于不同的状态,在该状态下解析命令参数。 “;”或“\n”将机器重置为其启动状态。
Since your language looks pretty simple, I suggest implementing a finite state machine that tokenizes and parses the input.
Simply read the input one character at a time, tokenizing at whitespace (while not in a quoted string). Each "command" takes the machine in a different state where it parses the command arguments. ";" or "\n" reset the machine to its start state.
我非常喜欢 ANTLR,在生产系统中使用过几次。
奇怪的是,在版本 2 中它支持生成 C++ 代码但不支持 C,而在版本 3 中它支持生成 C 代码但不支持 C++。我喜欢 C++,所以仍然使用 ANTLR v2,但您可能会喜欢 v3。这对你来说更好。
许多发行版都有 ANTLR v2 软件包,有些还有 v3。它有相当详细的文档(请注意,我使用 v2;我希望 v3 在这方面不会更差)。
ANTLR 不会“开箱即用”生成超级棒的解析错误消息。这似乎是大多数通用解析器系统的共同点,并且从根本上来说不是一个容易解决的问题。然而,通过一些工作,我看到了来自基于 ANTLR 的系统的一些不错的诊断输出(应用程序有一些逻辑来帮助弄清楚要对用户说些什么——ANTLR 这里没有太多魔力)。
I like ANTLR a lot, having used it a couple times in production systems.
Something quirky about it is that in version 2 it supported generating C++ code but not C, whereas in version 3 it supports generating C code but not C++. I like C++ and so still use ANTLR v2, but you will probably enjoy v3. So much the better for you.
Many distros have ANTLR v2 packages, and some have v3 as well. It's fairly well documented (note that I use v2; I hope v3 is not worse in this regard).
ANTLR does not generate super awesome parsing error messages "out of the box." This seems to be something common to most general parser systems, and is fundamentally not an easy problem to solve. With some work, however, I've seen some decent diagnostic output coming from an ANTLR-based system (the application had some logic to help figure out what to say to the user--ANTLR doesn't have much magic here).
Lex & 的一个有趣的替代品Yacc 是 Lemon 解析器。它有很多优点,但我还没有认真使用它,所以我不完全确定它的实际效果如何。它由 SQLite 使用。
An interesting alternative to Lex & Yacc is the Lemon parser. It has quite a lot going for it, but I have not used it in earnest, so I'm not completely sure how well it really works. It is used by SQLite.
您可能需要考虑Ragel。我最近开始使用它,并且发现一旦你熟悉了它的速度,使用它是一种乐趣。在您的示例中,您可能会执行类似的操作(注意:未测试!):
使用
ragel
通过 Ragel 运行它,它会输出;
。You may want to consider Ragel. I recently started using it and have found it a pleasure to work with once you get up to speed. In your example, you might do something like (note: not tested!):
Run it through Ragel with
ragel <filename.rl>
and it will spit out<filename.c>
.您需要一个无词法分析器(例如,PEG 的实现)。由于您正在使用 C 并且已经熟悉 yacc,因此 this 之类的东西可能值得尝试。
如果您的语法足够简单,您可以实现一个临时递归下降解析器。
You'd need a lexerless parser (e.g., an implementation of PEG). As you're using C and already familiar with yacc, something like this might be worth trying.
And if your grammar is simple enough, you can implement an ad hoc recursive descent parser instead.