将不应该存在的节点放入解析树中

发布于 2024-11-28 13:53:49 字数 706 浏览 9 评论 0原文

我正在为一种语言编写一个解析器，并且扫描器被设计为

要么返回不需要的终端（例如空格），要么
不这样做

根据布尔标志

。现在，在解析器中，我不想用所有这些终端来混淆语法，它们应该以某种方式被我正在构建的解析树“自动”吞没。

为了实现这个“魔法”，我想我应该链接终端（简单链接的循环列表），这样我就可以迭代它们并在减少发生时“填充空白”（我正在使用 LALR(1) 解析器生成器）。

这听起来是一个明智的想法，但存在一个问题。还记得我说过“要么回来……要么不回来”吗？在场景（2）中，我将释放终端，因为谁知道接下来会发生什么？我不希望有任何内存泄漏。

但在场景（1）中，我无法释放终端，因为根据它们，我将决定进一步减少“填空”过程应该停止的位置。

我也不能有条件地释放它，因为同样的原因：我不知道接下来会发生什么。如果不会触发任何“填空”过程怎么办？如果根本不再减少怎么办？

你有遇到过类似的问题吗？你是怎么解决的？

注意：这都是我的想法，我可能解释得不够清楚，请询问，我会编辑我的问题。这个场景实际上有点复杂，我不是从头开始写这个，我可以发挥我的想象力，我将它融入到其他东西中，所以我很可能会回答“我做不到”因为环境的限制”。

附录

我想到的唯一真正好的想法是分叉和改进解析器生成器，我已经在一些小地方完成了这项工作，以克服我上面提到的一些限制。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

携君以终年 2024-12-05 13:53:50

你的词汇有点奇怪。大多数解析器旨在识别语言的语法。通常，语言定义定义一些终端概念，并明确排除“空白”，“空白”由终端文本之间无趣的文本序列组成，通常包括空白、制表符和各种独立注释。因此，解析中使用的“终端”一词通常意味着“那些不是空白的语言原子”。您已将其隐式定义为包含空格，我认为这引起了您的悲伤。

从这个角度来看，避免解析器使用的语法定义与空格混淆的最简单方法是简单地让词法分析器不将空格传递给解析器。然后你的语法不需要表明它们是如何处理的（是的，这样做的语法真的很混乱），解析器不必担心它们，它们也不会出现在树中。

如果您正在构建编译器或解释器，那么忽略空格是最简单的。

如果您正在构建重新设计解析器（请参阅我们的 DMS 软件重新设计工具包，那么（至少）在 AST 中捕获评论很重要，因为最终
如果想要从构建的 AST 重新生成文本，如果重新生成的文本也包含注释，将会很有帮助。 [您可以通过其他方式做到这一点，但它们并不那么容易]。

DMS 词法分析器在内部生成“微”标记，它们是语言标记、空格和注释的概念。它丢弃了空白微标记，因为它们根本不添加任何内容（参见上面的讨论）。正如您所期望的，它将传统标记传递给解析器。它将注释标记粘合到前面或后面的语言标记，具体取决于标记类型和遇到的位置；对于 C，在附加标记之前会看到 /* ... */，并且 // ... 注释会附加到前面的标记（这里未讨论一些更微妙的细节）。然后解析仍然只看到语言标记，因此语法并没有不必要的复杂，并且如果附加到标记的所有信息都放置在树中，则注释也会随之而来。

现在，人们经常想要“抽象”语法树；他们想省略“(”和“)”之类的东西。我上面描述的方案甚至对像这样的具体标记附加了注释。现在有一个复杂的情况：如果您将 (..) 标记留在树之外，附加的注释就会消失。哎呀。
因此，DMS 解析器做了一件复杂的事情：附加到在树中具有逻辑位置但实际上不存在的标记的注释（“消除的终端”）被提升到父树节点，并带有一个注释，说明它们属于丢失的子标记。是的，实施这确实是一个 PITA。
好消息是我们只需在 DMS 的通用解析机制中执行一次，并且它适用于很多很多语言。但这意味着您必须愿意构建一个不寻常的（“重新设计”）解析器，并且我们有这样做的商业动机。

编辑：尚不清楚为什么OP想要这个，但他坚持捕获树中的空白。由于他没有告诉我们原因，我猜测：他想要令牌/树节点的精确列信息。这并不难做到：教词法分析器跟踪位置（行/列），并用开始/结束位置标记每个标记（也包括注释等微标记），然后让解析器将该信息存储在树。这种方式也避免了在树中保留空白。（DMS 也这样做，因为在报告问题时，精确的信息很有用，并且在重新生成代码时，将代码放回其原始位置（至少同一列）通常是可取的）。

EDIT2：如果OP坚持捕获空白，他可能会考虑探索无扫描器GLR解析。这会保留输入流中的每个字符，包括空格。

Your vocabulary is a little odd. Most parsers are designed to recognize the syntax of the langauge. usually the language defintion defines some notion of terminals and explicitly excludes "whitespace", consisting of uninteresting sequnces of text between the text of terminals, often including blanks, tabs, and various kinds of free standing comments. So the word "terminal" used in parsing usually means "those language atoms that are not whitespace". You've defined it implicitly to include whitespace, and I think that is causing your grief.

From this point of view, the easiest way to avoid clutting the grammar definition used by your parser with whitespace, is to simply have the lexer not pass whitespace to the parser. Then your grammar doesn't need to indicate how they are handled (and yes, grammars that do so are really messy), the parser doesn't have to worry about them and they don't show up in the tree.

If you are building a compiler or an interpreter, then ignoring whitespace is easiest.

If you are building a re-engineering parser (see our DMS Software Reengineering Toolkit, then capturing comments (at least) in the AST is important, as eventually
one want to regenerate text from consturcted ASTs, and it is helpful if the regenerated text contains the comments, too. [You can do this other ways but they aren't as easy].

The DMS lexer produces "micro"tokens which are your concept of langauge tokens, whitespace, and comments, internally. It throws away whitespace micro-tokens because they simply don't add anything (see above discussion). It passes conventional tokens to the parser, as you'd expect. It glues comment tokens to the preceding or following language token, depending on the token type and where encountered; for C, a /* ... */ seen before a token is attached to it, and a // ... comment is attached to the preceding token (with a few more subtle details not discussed here). Then the parse still only sees language tokens, so the grammar isn't needlessly complicated, and if all the information attached to the token is placed in the tree, the comments go along for the ride.

Now, people often want "Abstract" syntax trees; they want to leave out things like "(" and ")". The scheme I describe above attached comments to even concrete tokens like these. Now there's a complication: if you leave the ( .. ) tokens out of the tree, attached comments vanish. Oops.
So DMS parsers do a complicated thing: comments attached to tokens that have logical place in the tree but aren't actually there ("eliminated terminals") are lifted to the parent tree node with an annotation saying they belong on the missing child token. Yes, implementing this is indeed a PITA.
The good news is we only had to do it once in DMS's general parsing machinery, and it works for many, many languages. But this means you have to be willing to build an unusual ("reengineering") parser, and we had commercial motivation to do so.

EDIT: It isn't clear why OP wants this, but he insists on capturing whitespace in the tree. Since he hasn't told us why, I'm going to guess: he wants precise column information for the tokens/tree nodes. That isn't hard to do: teach the lexer to keep track of position (line/column), and stamp each token (micro-tokens such as comments, too) with start/end position, and let the parse store that information in the tree. This way avoids keeping whitespace in the tree, too. (DMS does this too, because when reporting problems, precise information is useful, and when regenerating code, placing code back at its original position (at least the same column) is often desirable).

EDIT2: If OP insists on capturing whitespace, he might consider exploring scannerless GLR parsing. This keeps every character in the input stream, including the whitespace.

回复收藏 0 原文

~没有更多了~