解析器和词法分析器的设计指南?

发布于 2024-09-08 10:31:56 字数 670 浏览 3 评论 0原文

我正在编写一个词法分析器(使用 re2c)和一个解析器(使用 Lemon),用于稍微复杂的数据格式:类似于 CSV,但在特定位置具有特定的字符串类型(仅字母数字字符、字母数字字符和减号、除引号和逗号,但带有平衡大括号等)、大括号内的字符串以及看起来像带有可包含参数的左大括号和右大括号的函数调用的字符串。

我的第一个尝试是一个具有许多状态的词法分析器,每个状态都适合特定的字符串格式。但是,在来自词法分析器(变得非常大)的许多无用的“意外输入”消息之后,我意识到它可能正在尝试执行解析器的工作。我放弃了第一次尝试,转而使用只有一个状态的词法分析器、许多字符标记和一个将标记组合成不同字符串类型的解析器。这效果更好,当某些东西关闭时,我从解析器中得到更多有用的语法错误,但仍然感觉不太正确。我正在考虑向词法分析器添加一两个状态,但从解析器启动状态,解析器对给定实例中需要哪种字符串类型有更好的“概述”。总的来说,我觉得有点愚蠢:(

我没有正式的计算机科学背景,并且有点回避数学重的理论。但也许有一个教程或书籍解释了词法分析器应该(和不应该)做什么以及哪一部分解析器应该做的工作,如何构建良好的标记模式,何时使用词法分析器状态,何时以及如何使用递归规则(使用 LALR 解析器),如何避免模糊的规则。 “Lex 和 YACC 入门/HOWTO”很好,但还不够,因为我只想解析数据格式,有关编译器构建的书籍(如红龙书)对我来说有点过大,

或者也许有人可以给我一些 。这里有简单的规则。

I'm writing a lexer (with re2c) and a parser (with Lemon) for a slightly convoluted data format: CSV-like, but with specific string types at specific places (alphanumeric chars only, alphanumeric chars and minus signs, any char except quotes and comma but with balanced braces, etc.), strings inside braces and strings that look like function calls with opening and closing braces that can contain parameters.

My first shot at it was a lexer with many states, each state catering to the specific string format. But after many unhelpful "unexpected input" messages from the lexer (which got very big) I realized that maybe it was trying to do the work of the parser. I scrapped my first try and went with a lexer with only one state, many character tokens and a parser that combines the tokens into the different string types. This works better, I get more helpful syntax errors from the parser when something is off, but it still feels not quite right. I am thinking of adding one or two states to the lexer, but initiating the states from the parser, which has a much better "overview" on which string type is required in a given instance. Overall I feel a bit stupid :(

I have no formal CS background and shy a bit away from the math-heavy theory. But maybe there is a tutorial or book somewhere that explains what a lexer should (and should not) do and which part of the work the parser should do. How to construct good token patterns, when to use lexer states, when and how to use recursive rules (with a LALR parser), how to avoid ambigous rules. A pragmatic cookbook that teaches the basics. The "Lex and YACC primer/HOWTO" was nice, but not enough. Since I just want to parse a data format, books on compiler building (like the red dragon book) look a bit oversized to me.

Or maybe someone can give me some plain rules here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

自在安然 2024-09-15 10:31:56

你真正应该做的是为你的语言编写语法。一旦你有了这个,边界就很容易了:

  • 词法分析器负责接受你的输入并告诉你你有哪个终端
  • 解析器负责将一系列终端非终端与生产规则重复匹配,直到您拥有抽象语法树 (AST) 或解析失败。

词法分析器不负责输入验证,除非拒绝不可能的字符和其他非常基本位。解析器完成这一切。

看看 https://www.cs。 rochester.edu/u/nelson/courses/csc_173/grammars/parsing.html。这是一个关于解析的介绍性 CS 课程页面。

What you should really do is write a grammar for your language. Once you have that, the boundary is easy:

  • The lexer is responsible for taking your input and telling you which terminal you have.
  • The parser is responsible for matching a series of terminals and nonterminals to a production rule, repeatedly, until you either have an Abstract Syntax Tree (AST) or a parse failure.

The lexer is not responsible for input validation except insofar as to reject impossible characters, and other very basic bits. The parser does all that.

Take a look at https://www.cs.rochester.edu/u/nelson/courses/csc_173/grammars/parsing.html . It's an intro CS course page on parsing.

删除会话 2024-09-15 10:31:56

决定解析器或词法分析器是否应该执行某些操作的一个很好的试金石是问自己一个问题:

语法是否具有任何递归、嵌套、自相似的元素?
(例如嵌套括号、大括号、标签、子表达式、子句子等)。

如果没有,普通的正则表达式就足够了,并且可以由词法分析器完成。
如果是,则应该由解析器对其进行分析,因为它至少是上下文无关的语法。

Lexer 通常用于查找语言中的“单词”,并对它们进行分类(是名词吗?动词?形容词?等等)。
解析器用于查找正确的“句子”,如果它们是给定语言中的正确句子,则将它们结构化。

A good litmus test for deciding if something should be done by a parser or lexer is to ask yourself a question:

Does the syntax have any recursive, nested, self-similar elements?
(e.g. nested parentheses, braces, tags, subexpressions, subsentences etc.).

If not, plain regular expressions is enough and it could be done by the lexer.
If yes, it should be analysed by a parser, because it's a context-free grammar at the very least.

Lexer is generally for finding "words" of your language, and classifying them (is it a noun? a verb? an adjective? etc.).
Parser is for finding proper "sentences", structurizing them an finding if they're proper sentences in a given language.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文