当前位置：文江博客话题详情

您会推荐哪些具有代码分离和语言可扩展性的解析器生成器？

发布于 2024-11-05 05:48:35 字数 613 浏览 0 评论 0原文

我正在寻找一个具有语法/代码分离的上下文无关语法解析器生成器，并且可以添加对新目标语言的支持。例如，如果我想要 Pascal 解析器，我可以编写自己的 Pascal 代码生成器，而无需重新实现整个过程。

我知道大多数开源解析器生成器理论上都可以扩展，但我仍然更喜欢有可扩展性计划和记录的东西。

在功能方面，我需要解析器至少支持 Python 风格的缩进，也许需要一些额外的工作。对生成的解析器的类型没有要求，但我更喜欢快速的解析器。

哪些是最知名/维护的选项？

流行的解析器生成器似乎主要使用我真的不喜欢的混合语法/代码方法。维基百科上的比较列表列出了一些，但我是这方面的新手，不知道哪个去尝试。

为什么我不喜欢混合语法/代码：因为这种方法看起来很混乱。语法是语法，实现细节是实现细节。它们是用不同语言编写的不同内容，将它们放在不同的位置是很直观的。

如果我想在另一个项目中重用具有不同实现细节的部分语法该怎么办？如果我想用不同的语言编译解析器怎么办？所有这些都需要将语法分开。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

愛放△進行李 2024-11-12 05:48:35

大多数解析器生成器不会处理上下文无关语法。它们处理一些子集（LL(1)、LL(k)、LL(*)、LALR(1)、LR(k)、...）。如果您选择其中之一，您几乎肯定必须修改您的语法以匹配解析器生成器的限制（无左递归、有限的前瞻，...）。如果你想要一个真正的上下文无关的解析器生成器，你需要一个 Early 解析器生成器（效率低下）、一个 GLR 解析器生成器（最实用的）或一个 PEG 解析器生成器（最后一个不是上下文无关的；它需要规则来确定哪些规则优先）。

您似乎担心混合用于构建树的语法和解析器操作。

如果您构建的树不是语法的直接函数，则必须有某种方法将树构建机制与语法产生式联系起来。将其放置在语法产生式“附近”是一种方法，但会导致“混合”符号反对。

另一种方法是为每个规则指定一个名称（或某个唯一标识符），并将树构建机制设置为由名称索引的一侧。这样你的语法就不会被“其他东西”污染，这似乎是你的反对意见。我所知道的解析器生成器系统都没有这样做。一个尴尬的问题是，你现在必须发明大量的规则名称，而任何时候你有几百个名称，这本身就很不方便，而且很难让它们助记。

第三种方法是使语法成为函数，并自动生成树构建步骤。这根本不需要额外的东西来生成 AST。我知道唯一能做到这一点的工具（可能还有其他工具，但我已经找了 20 多年了，但还没有见过）是我公司的产品，DMS 软件重新工程工具包。 [DMS 不仅仅是一个解析器生成器；它是一个完整的生态系统，使用 GLR 解析引擎为任意语言构建程序分析和转换工具；是的，它处理 Python 风格的缩进]。

一种反对意见是，这样的树是具体的、臃肿的、令人困惑的。如果处理得当，事实并非如此。
我对这个问题的回答：
抽象语法树和具体语法树有什么区别？讨论了我们如何从自动生成的压缩 CST 中获得 AST 的好处。

DMS 方案的好消息是基本语法不会因为解析支持而变得臃肿。不太好的消息是，您会发现许多其他想要与语法规则相关联的东西（漂亮打印规则、属性计算、树综合……），并且您会立即回到相同的选择。 DMS 拥有所有这些“其他东西”，并通过多种方式解决关联问题：

通过将其他相关的描述性形式放在语法规则旁边（产生您抱怨的混合）。对于漂亮打印规则，我们容忍这一点，因为事实上，将语法（解析）规则与漂亮打印（反解析）规则相邻是很好的。我们还允许将属性计算放置在语法规则附近以提供关联。
虽然 DMS 允许规则具有名称，但这只是为了方便过程代码访问，而不是将其他机制与规则关联。
虽然
DMS 提供了第三种方法来关联这些机制（尤其是属性语法计算），即使用规则本身作为一种巨型名称。因此，您可以在一处编写语法和漂亮打印规则，并且可以在其他地方再次编写语法规则以及关联的属性计算。原则上，这就像给每个规则一个名称（好吧，一个签名）并将计算与该名称相关联。但它也允许我们定义许多不同的属性计算（用于不同的目的）并将它们与其规则相关联，而不会弄乱基本语法。我们的工具检查（规则，关联计算）在基本语法中是否具有有效的规则，因此当基本语法发生变化时，它可以相对地追踪需要修复的内容。

这是我的工具（我是架构师），你不应该将此视为建议，而只是一种偏见。这种偏见是由 DMS 解析（不抱怨）C、C++、Java、C#、IBM Enterprise COBOL、Python、F77/F90/F95（带有第 6 列 continue/F90 continue 和嵌入式 C 预处理器指令以在大多数情况下引导）的能力支持的， Mumps、PHP4/5 和许多其他语言。

Most parser generators won't handle context-free grammars. They handle some subset (LL(1), LL(k), LL(*), LALR(1), LR(k), ...). If you choose one of these, you will almost certainly have to hack your grammar to match the limitations of the parser generator (no left recursion, limited lookahead, ...). If you want a real context free parser generator you want an Early parser generator (inefficient), a GLR parser generator (the most practical of the lot), or a PEG parser generator (and the last isn't context-free; it requires rules to be ordered to determine which ones take precedence).

You seem to be worried about mixing syntax and parser-actions used to build the trees.

If the tree you build isn't a direct function of the syntax, there has to be some way to tie the tree-building machinery to the grammar productions. Placing it "near" the grammar production is one way, but leads to your "mixed" notation objection.

Another way is to give each rule a name (or some unique identifier), and set the tree-building machinery off to the side indexed by the names. This way your grammar isn't contaminated with the "other stuff", which seems to be your objection. None of the parser generator systems I know of do this. An awkward issue is that you now have to invent lots of rule names, and anytime you have a few hundred names that's inconvenient by itself and it is hard to make them mnemonic.

A third way is to make the a function of the syntax, and auto-generate the tree building steps. This requires no extra stuff off to the side at all to produce the ASTs. The only tool I know that does it (there may be others but I've been looking for 20 odd years and haven't seen one) is my company's product,, the DMS Software Reengineering Toolkit. [DMS isn't just a parser generator; it is a complete ecosystem for building program analysis and transformation tools for arbitrary languages, using a GLR parsing engine; yes it handles Python style indents].

One objection is that such trees are concrete, bloated and confusing; if done right, that's not true.
My SO answer to this question:
What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree? discusses how we get the benefits of ASTs from automatically generated compressed CSTs.

The good news about DMS's scheme is that the basic grammar isn't bloated with parsing support. The not so good news is that you will find lots of other things you want to associate with grammar rules (prettyprinting rules, attribute computations, tree synthesis,...) and you come right back around to the same choices. DMS has all of these "other things" and solves the association problem a number of ways:

By placing other related descriptive formalisms next to the grammar rule (producing the mixing you complained about). We tolerate this for pretty-printing rules because in fact it is nice to have the grammar (parse) rule adjacent to the pretty-print (anti-parse) rule. We also allow attribute computations to be placed near the grammar rules to provide an association.
While DMS allows rules to have names, this is only for convenient access by procedural code, not associating other mechanisms with the rule.
DMS provides a third way to associate these mechanisms (esp. attribute grammar computations) by using the rule itself as a kind of giant name. So, you write the grammar and prettyprint rules in one place, and somewhere else you can write the grammar rule again with an associated attribute computation. In principle, this is just like giving each rule a name (well, a signature) and associating the computation with the name. But it also allows us to define many, many different attribute computations (for different purposes) and associate them with their rules, without cluttering up the base grammar. Our tools check that a (rule,associated-computation) has a valid rule in the base grammar, so it makes it relatively each to track down what needs fixing when the base grammar changes.

This being my tool (I'm the architect) you shouldn't take this as a recommendation, just a bias. That bias is supported by DMS's ability to parse (without whimpering) C, C++, Java, C#, IBM Enterprise COBOL, Python, F77/F90/F95 with column6 continues/F90 continues and embedded C preprocessor directives to boot under most circumstances), Mumps, PHP4/5 and many other languages.

回复收藏 0 原文