您会推荐哪些具有代码分离和语言可扩展性的解析器生成器?

发布于 2024-11-05 05:48:35 字数 613 浏览 0 评论 0原文

我正在寻找一个具有语法/代码分离的上下文无关语法解析器生成器,并且可以添加对新目标语言的支持。例如,如果我想要 Pascal 解析器,我可以编写自己的 Pascal 代码生成器,而无需重新实现整个过程。

我知道大多数开源解析器生成器理论上都可以扩展,但我仍然更喜欢有可扩展性计划和记录的东西。

在功能方面,我需要解析器至少支持 Python 风格的缩进,也许需要一些额外的工作。对生成的解析器的类型没有要求,但我更喜欢快速的解析器。

哪些是最知名/维护的选项?

流行的解析器生成器似乎主要使用我真的不喜欢的混合语法/代码方法。 维基百科上的比较列表列出了一些,但我是这方面的新手,不知道哪个去尝试。

为什么我不喜欢混合语法/代码:因为这种方法看起来很混乱。语法是语法,实现细节是实现细节。它们是用不同语言编写的不同内容,将它们放在不同的位置是很直观的。

如果我想在另一个项目中重用具有不同实现细节的部分语法该怎么办?如果我想用不同的语言编译解析器怎么办?所有这些都需要将语法分开。

I'm looking for a context-free grammar parser generator with grammar/code separation and a possibility to add support for new target languages. For instance if I want parser in Pascal, I can write my own pascal code generator without reimplementing the whole thing.

I understand that most open-source parser generators can in theory be extended, still I'd prefer something that has extendability planned and documented.

Feature-wise I need the parser to at least support Python-style indentation, maybe with some additional work. No requirement on the type of parser generated, but I'd prefer something fast.

Which are the most well-known/maintained options?

Popular parser-generators seem to mostly use mixed grammar/code approach which I really don't like. Comparison list on Wikipedia lists a few but I'm a novice at this and can't tell which to try.

Why I don't like mixing grammar/code: because this approach seems like a mess. Grammar is grammar, implementation details are implementation details. They're different things written in different languages, it's intuitive to keep them in separate places.

What if I want to reuse parts of grammar in another project, with different implementation details? What if I want to compile a parser in a different language? All of this requires grammar to be kept separate.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

愛放△進行李 2024-11-12 05:48:35

大多数解析器生成器不会处理上下文无关语法。它们处理一些子集(LL(1)、LL(k)、LL(*)、LALR(1)、LR(k)、...)。如果您选择其中之一,您几乎肯定必须修改您的语法以匹配解析器生成器的限制(无左递归、有限的前瞻,...)。如果你想要一个真正的上下文无关的解析器生成器,你需要一个 Early 解析器生成器(效率低下)、一个 GLR 解析器生成器(最实用的)或一个 PEG 解析器生成器(最后一个不是上下文无关的;它需要规则来确定哪些规则优先)。

您似乎担心混合用于构建树的语法和解析器操作。

如果您构建的树不是语法的直接函数,则必须有某种方法将树构建机制与语法产生式联系起来。将其放置在语法产生式“附近”是一种方法,但会导致“混合”符号反对。

另一种方法是为每个规则指定一个名称(或某个唯一标识符),并将树构建机制设置为由名称索引的一侧。这样你的语法就不会被“其他东西”污染,这似乎是你的反对意见。我所知道的解析器生成器系统都没有这样做。一个尴尬的问题是,你现在必须发明大量的规则名称,而任何时候你有几百个名称,这本身就很不方便,而且很难让它们助记。

第三种方法是使语法成为函数,并自动生成树构建步骤。这根本不需要额外的东西来生成 AST。我知道唯一能做到这一点的工具(可能还有其他工具,但我已经找了 20 多年了,但还没有见过)是我公司的产品,DMS 软件重新工程工具包。 [DMS 不仅仅是一个解析器生成器;它是一个完整的生态系统,使用 GLR 解析引擎为任意语言构建程序分析和转换工具;是的,它处理 Python 风格的缩进]。

一种反对意见是,这样的树是具体的、臃肿的、令人困惑的。如果处理得当,事实并非如此。
我对这个问题的回答:
抽象语法树和具体语法树有什么区别? 讨论了我们如何从自动生成的压缩 CST 中获得 AST 的好处。

DMS 方案的好消息是基本语法不会因为解析支持而变得臃肿。不太好的消息是,您会发现许多其他想要与语法规则相关联的东西(漂亮打印规则、属性计算、树综合……),并且您会立即回到相同的选择。 DMS 拥有所有这些“其他东西”,并通过多种方式解决关联问题:

  • 通过将其他相关的描述性形式放在语法规则旁边(产生您抱怨的混合)。对于漂亮打印规则,我们容忍这一点,因为事实上,将语法(解析)规则与漂亮打印(反解析)规则相邻是很好的。我们还允许将属性计算放置在语法规则附近以提供关联。

  • 虽然 DMS 允许规则具有名称,但这只是为了方便过程代码访问,而不是将其他机制与规则关联。

    虽然

  • DMS 提供了第三种方法来关联这些机制(尤其是属性语法计算),即使用规则本身作为一种巨型名称。因此,您可以在一处编写语法和漂亮打印规则,并且可以在其他地方再次编写语法规则以及关联的属性计算。原则上,这就像给每个规则一个名称(好吧,一个签名)并将计算与该名称相关联。但它也允许我们定义许多不同的属性计算(用于不同的目的)并将它们与其规则相关联,而不会弄乱基本语法。我们的工具检查(规则,关联计算)在基本语法中是否具有有效的规则,因此当基本语法发生变化时,它可以相对地追踪需要修复的内容。

这是我的工具(我是架构师),你不应该将此视为建议,而只是一种偏见。这种偏见是由 DMS 解析(不抱怨)C、C++、Java、C#、IBM Enterprise COBOL、Python、F77/F90/F95(带有第 6 列 continue/F90 continue 和嵌入式 C 预处理器指令以在大多数情况下引导)的能力支持的, Mumps、PHP4/5 和许多其他语言。

Most parser generators won't handle context-free grammars. They handle some subset (LL(1), LL(k), LL(*), LALR(1), LR(k), ...). If you choose one of these, you will almost certainly have to hack your grammar to match the limitations of the parser generator (no left recursion, limited lookahead, ...). If you want a real context free parser generator you want an Early parser generator (inefficient), a GLR parser generator (the most practical of the lot), or a PEG parser generator (and the last isn't context-free; it requires rules to be ordered to determine which ones take precedence).

You seem to be worried about mixing syntax and parser-actions used to build the trees.

If the tree you build isn't a direct function of the syntax, there has to be some way to tie the tree-building machinery to the grammar productions. Placing it "near" the grammar production is one way, but leads to your "mixed" notation objection.

Another way is to give each rule a name (or some unique identifier), and set the tree-building machinery off to the side indexed by the names. This way your grammar isn't contaminated with the "other stuff", which seems to be your objection. None of the parser generator systems I know of do this. An awkward issue is that you now have to invent lots of rule names, and anytime you have a few hundred names that's inconvenient by itself and it is hard to make them mnemonic.

A third way is to make the a function of the syntax, and auto-generate the tree building steps. This requires no extra stuff off to the side at all to produce the ASTs. The only tool I know that does it (there may be others but I've been looking for 20 odd years and haven't seen one) is my company's product,, the DMS Software Reengineering Toolkit. [DMS isn't just a parser generator; it is a complete ecosystem for building program analysis and transformation tools for arbitrary languages, using a GLR parsing engine; yes it handles Python style indents].

One objection is that such trees are concrete, bloated and confusing; if done right, that's not true.
My SO answer to this question:
What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree? discusses how we get the benefits of ASTs from automatically generated compressed CSTs.

The good news about DMS's scheme is that the basic grammar isn't bloated with parsing support. The not so good news is that you will find lots of other things you want to associate with grammar rules (prettyprinting rules, attribute computations, tree synthesis,...) and you come right back around to the same choices. DMS has all of these "other things" and solves the association problem a number of ways:

  • By placing other related descriptive formalisms next to the grammar rule (producing the mixing you complained about). We tolerate this for pretty-printing rules because in fact it is nice to have the grammar (parse) rule adjacent to the pretty-print (anti-parse) rule. We also allow attribute computations to be placed near the grammar rules to provide an association.

  • While DMS allows rules to have names, this is only for convenient access by procedural code, not associating other mechanisms with the rule.

  • DMS provides a third way to associate these mechanisms (esp. attribute grammar computations) by using the rule itself as a kind of giant name. So, you write the grammar and prettyprint rules in one place, and somewhere else you can write the grammar rule again with an associated attribute computation. In principle, this is just like giving each rule a name (well, a signature) and associating the computation with the name. But it also allows us to define many, many different attribute computations (for different purposes) and associate them with their rules, without cluttering up the base grammar. Our tools check that a (rule,associated-computation) has a valid rule in the base grammar, so it makes it relatively each to track down what needs fixing when the base grammar changes.

This being my tool (I'm the architect) you shouldn't take this as a recommendation, just a bias. That bias is supported by DMS's ability to parse (without whimpering) C, C++, Java, C#, IBM Enterprise COBOL, Python, F77/F90/F95 with column6 continues/F90 continues and embedded C preprocessor directives to boot under most circumstances), Mumps, PHP4/5 and many other languages.

冷弦 2024-11-12 05:48:35

首先,任何像样的解析器生成器都将足够强大以支持 Python 的缩进。对于语言来说,这并不是那么奇怪。您应该尝试解析像 Fortran77 这样的列敏感语言...

其次,我认为您真的不需要解析器本身是“可扩展的”,对吗?您只是希望能够使用它来词法分析并解析您想要的一两种语言,对吧?同样,任何像样的解析器生成器都可以做到这一点。

第三,你并没有真正说出你不喜欢的语法和代码之间的混合情况。您希望全部用元语言(有点困难)实现,还是全部用代码实现?

假设是后者,我知道有几个语言内解析器生成器工具包。第一个是 Boost 的 Spirit,它是用 C++ 实现的。我用过它,而且有效。然而,当我使用它时,你几乎需要一个“Boostology”研究生学位才能充分理解它的错误消息,以便在合理的时间内让任何东西正常工作。

我知道的另一个是 OpenToken,它是一个用 Ada 实现的解析器生成工具包。 Ada 没有 错误- C++ 的模板存在新的问题,因此 OpenToken 更容易使用。然而,你必须在 Ada 中使用它......

典型的函数式语言允许你在语言本身中实现你喜欢的任何子语言(大部分),这要归功于它们对 lambda 和元编程等固有的良好支持。然而,他们的解析器往往比较慢。如果您只是解析一两个配置文件,那确实没有问题。如果您一次解析数百个文件,这将是一个巨大的问题。

First off, any decent parser generator is going to be robust enough to support Python's indenting. That isn't really all that weird as languages go. You should try parsing column-sensitive languages like Fortran77 some time...

Secondly, I don't think you really need the parser itself to be "extensible" do you? You just want to be able to use it to lex and parse the language or two you have in mind, right? Again, any decent parser-generator can do that.

Thirdly, you don't really say what about the mix between grammar and code you don't like. Would you rather it be all implemented in a meta-language (kinda tough), or all in code?

Assuming it is the latter, there are a couple of in-language parser generator toolkits I know of. The first is Boost's Spirit, which is implemented in C++. I've used it, and it works. However, back when I used it you pretty much needed a graduate degree in "boostology" to be able to understand its error messages well enough to get anything working in a reasonable amount of time.

The other I know about is OpenToken, which is a parser-generation toolkit implemented in Ada. Ada doesn't have the error-novel problem that C++ has with its templates, so OpenToken is far easier to use. However, you have to use it in Ada...

Typical functional languages allow you to implement any sublanguage you like (mostly) within the language itself, thanks to their inhernetly good support for things like lambdas and metaprogramming. However, their parsers tend to be slower. That's really no problem at all if you are just parsing a configuration file or two. Its a tremendous problem if you are parsing hundreds of files at a go.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文