词法分析器和解析器通信

发布于 2025-01-08 21:59:28 字数 341 浏览 1 评论 0原文

关于词法分析器和解析器的大多数资源都说明了如何使用流在它们之间进行通信（或者我是这么理解的）。

据解释，解析器通过调用函数 getNextToken() 来请求下一个标记，词法分析器通过返回下一个标记来响应它。我们是否应该将它们视为在同一程序中交互的两个对象或通过流交互的两个不同程序？

另外，我无法理解为什么不选择串行方法，即词法分析器运行到提供的源的末尾，然后解析器才使用词法分析器的输出进行解析。准确地说，如果词法分析器仅在解析器请求下一个标记时才读取下一个词素，那么如何处理错误？特别是如果错误发生在文件末尾，则解析器完成的所有计算可能会由于该错误而被浪费（假设一个非常基本的解析器没有任何错误处理功能）。最近的输出是否被缓存？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

网白 2025-01-15 21:59:28

我下面的回答仅参考 Flex-Bison（或 Lex-Yacc）编译模型。我对其他模型了解甚少。

我认为词法分析器/解析器组合是同一程序中的两个协作模块。
当您将 Flex 与 Bison 一起使用时，您会看到后者调用前者提供的函数 yylex()（yylex() 相当于 getNextToken()< /code> 你的问题中的函数）。因此，将它们视为单个程序中的合作单元而不是两个不同的程序更有意义。此外，如果词法分析器和解析器是两个不同的程序，您就必须处理进程间通信、共享内存和相关问题，从而使手头的任务进一步复杂化。

回答你的第二个问题：
我可以想到一个重要问题，在词法分析器完成读取所有输入之后，解析器开始运行时可能会出现一个重要问题：即使是中等大小的程序，内存使用量也会很大，因为您必须存储内存中每个标记的数据结构（想想像 、 和 = 这样的标记在内存中占用多个字节，您很快就会明白为什么不可扩展）。

至于错误处理：如果词法分析器无法将输入与任何正则表达式匹配，则 yylex() 应该向解析器返回 -1，使用如下的 Flex 规则：（

.               { return -1; }

注意几乎不可见的第一列中的句点，它与除 \n 之外的任何输入符号匹配）

（注意：此规则应该是出现在您的flex 文件，因为规则的顺序决定了优先级： Flex 使用 Flex 文件中的第一个可能规则来匹配令牌。）

词法分析器返回值 -1 表示标记化错误，Bison 解析器会处理它通过调用 yyerror(char *) 自动进行（最好由您定义）；否则，如果输入遇到解析错误，解析器将再次调用yyerror(char *)。

此外，如果您想在遇到错误时显示错误的代码片段，则必须有一种方法来访问给定有缺陷标记的相关源代码，这意味着完全读取输入然后进行解析的方法将是根本不起作用，除非您在标记化时存储与每个标记相关的源代码，本质上是编译器的内存庞然大物。

My answer below is in reference to the Flex-Bison (or Lex-Yacc) model of compilation only. I have little knowledge of other models.

I think of the lexer / parser combination as two co-operating modules in the same program.
When you use Flex with Bison you see that the latter calls a function yylex() provided by the former (yylex() is equivalent to the getNextToken() function in your question). So it makes more sense to think of them as co-operating units in a single program rather than two different programs. Moreover, if the lexer and parser were 2 different programs, you'd have to deal with inter-process communication, shared memory, and related issues, further complicating the task at hand.

To answer your second question:
I can think of one important issue that could arise from the parser coming into action after the lexer had finished reading all input: memory usage would be enormous for even moderate-sized programs, as you would have to store data structures for every token, in memory (think of tokens like , and = occupying multiple bytes in memory and you'll quickly see why it's not scalable).

As for error handling: if the lexer cannot match the input to any regular expression, then yylex() should return a -1 to the parser, using a flex rule as follows:

.               { return -1; }

(Note the near-invisible period in the first column, which matches any input symbol except \n)

(NOTE: This rule should be the last rule to appear in your flex file, because the order of the rules determines priority: a token is matched by Flex using the first possible rule in the flex file.)

A return value of -1 by the lexer indicates a tokenization error, and the Bison parser handles it automatically by calling yyerror(char *) (ideally defined by you); otherwise, if the input encounters a parsing error, the parser, again, calls yyerror(char *).

Also, if you want to display the erroneous piece of code when an error is encountered, you'd have to have a way to access the related source code given the defective token, which means the approach of reading the input entirely followed by parsing would not work at all, unless you store associated source code with each token while tokenizing, essentially making a memory behemoth of a compiler.

回复收藏 0 原文