我最近一直在尝试自学解析器(用于语言/上下文无关语法)如何工作,除了一件事之外,大多数内容似乎都是有意义的。 我特别关注LL(k)语法,其中两个主要算法似乎是LL 解析器(使用堆栈/解析表)和 递归下降解析器(简单地使用递归)。
据我所知,递归下降算法适用于所有 LL(k) 语法,甚至可能更多,而 LL 解析器适用于所有 LL(k) 语法。 然而,递归下降解析器的实现显然比 LL 解析器简单得多(就像 LL 解析器比 LR 解析器简单一样)。
所以我的问题是,使用这两种算法时可能会遇到哪些优点/问题? 为什么人们会选择 LL 而不是递归下降,因为它适用于同一组语法并且实现起来更棘手?
I've recently being trying to teach myself how parsers (for languages/context-free grammars) work, and most of it seems to be making sense, except for one thing. I'm focusing my attention in particular on LL(k) grammars, for which the two main algorithms seem to be the LL parser (using stack/parse table) and the Recursive Descent parser (simply using recursion).
As far as I can see, the recursive descent algorithm works on all LL(k) grammars and possibly more, whereas an LL parser works on all LL(k) grammars. A recursive descent parser is clearly much simpler than an LL parser to implement, however (just as an LL one is simpler than an LR one).
So my question is, what are the advantages/problems one might encounter when using either of the algorithms? Why might one ever pick LL over recursive descent, given that it works on the same set of grammars and is trickier to implement?
发布评论
评论(1)
LL 通常是比递归下降更有效的解析技术。 事实上,在最坏的情况下,一个简单的递归下降解析器实际上将是 O(k^n) (其中 n 是输入大小)。 某些技术,例如记忆化(生成 Packrat 解析器)可以改进这一点并扩展解析器接受的语法类,但总是存在空间权衡。 LL 解析器(据我所知)始终是线性时间。
另一方面,您的直觉是正确的,即递归下降解析器可以处理比 LL 更大的语法类。 递归下降可以处理任何 LL(*) 语法(即无限前瞻)以及一小组模糊语法。 这是因为递归下降实际上是 PEG 的直接编码实现,或解析器表达式语法。 具体来说,析取运算符 (
a | b
) 不可交换,这意味着a | b
b 不等于b | 一个。 递归下降解析器将按顺序尝试每个替代方案。 因此,如果
a
与输入匹配,即使b
与输入匹配,它也会成功。 这允许简单地通过正确排序析取来处理经典的“最长匹配”歧义,例如悬空else
问题。综上所述,使用递归下降实现 LL(k) 解析器是有可能的,这样它就能以线性时间运行。 这是通过本质上内联预测集来完成的,以便每个解析例程在恒定时间内确定给定输入的适当产生式。 不幸的是,这种技术消除了对整个语法类的处理。 一旦我们进入预测解析,像悬挂
else
这样的问题就不再那么容易解决了。至于为什么选择LL而不是递归下降,主要是效率和可维护性的问题。 递归下降解析器明显更容易实现,但它们通常更难维护,因为它们表示的语法不以任何声明形式存在。 大多数重要的解析器用例都会使用解析器生成器,例如 ANTLR 或 Bison。 有了这样的工具,算法是直接编码的递归下降还是表驱动的 LL(k) 并不重要。
出于兴趣,还值得研究 recursive-ascent,这是一种解析算法按照递归下降的方式直接编码,但能够处理任何 LALR 语法。 我还会深入研究解析器组合器,这是将递归下降解析器组合在一起的一种功能方式。
LL is usually a more efficient parsing technique than recursive-descent. In fact, a naive recursive-descent parser will actually be O(k^n) (where n is the input size) in the worst case. Some techniques such as memoization (which yields a Packrat parser) can improve this as well as extend the class of grammars accepted by the parser, but there is always a space tradeoff. LL parsers are (to my knowledge) always linear time.
On the flip side, you are correct in your intuition that recursive-descent parsers can handle a greater class of grammars than LL. Recursive-descent can handle any grammar which is LL(*) (that is, unlimited lookahead) as well as a small set of ambiguous grammars. This is because recursive-descent is actually a directly-encoded implementation of PEGs, or Parser Expression Grammar(s). Specifically, the disjunctive operator (
a | b
) is not commutative, meaning thata | b
does not equalb | a
. A recursive-descent parser will try each alternative in order. So ifa
matches the input, it will succeed even ifb
would have matched the input. This allows classic "longest match" ambiguities like the danglingelse
problem to be handled simply by ordering disjunctions correctly.With all of that said, it is possible to implement an LL(k) parser using recursive-descent so that it runs in linear time. This is done by essentially inlining the predict sets so that each parse routine determines the appropriate production for a given input in constant time. Unfortunately, such a technique eliminates an entire class of grammars from being handled. Once we get into predictive parsing, problems like dangling
else
are no longer solvable with such ease.As for why LL would be chosen over recursive-descent, it's mainly a question of efficiency and maintainability. Recursive-descent parsers are markedly easier to implement, but they're usually harder to maintain since the grammar they represent does not exist in any declarative form. Most non-trivial parser use-cases employ a parser generator such as ANTLR or Bison. With such tools, it really doesn't matter if the algorithm is directly-encoded recursive-descent or table-driven LL(k).
As a matter of interest, it is also worth looking into recursive-ascent, which is a parsing algorithm directly encoded after the fashion of recursive-descent, but capable of handling any LALR grammar. I would also dig into parser combinators, which are a functional way of composing recursive-descent parsers together.