学习野牛:什么是上下文无关语法和 LALR(1)?
我正在阅读这篇 bison 介绍。
我有两个问题,如果有人能帮助我理解,那就太好了:
术语
上下文无关语法
是什么意思?来自上面的链接:并非所有上下文无关语言都可以由 Bison 处理,只能处理那些 LALR(1) 的语言。简而言之,这意味着必须能够仅使用单个前瞻标记来告诉如何解析输入字符串的任何部分。 “仅用一个前瞻标记就可以告诉如何解析输入字符串的任何部分”是什么意思
I am reading this bison introduction.
I have two questions and it will be great if someone can help me understand:
What does term
context free grammar
mean?From the link above: Not all context-free languages can be handled by Bison, only those that are LALR(1). In brief, this means that it must be possible to tell how to parse any portion of an input string with just a single token of look-ahead. What does it mean by "possible to tell how to parse any portion of an input string with just a single token of look-ahead`"
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
上下文无关语法是对一组字符串的描述,它比正则表达式更强大,但仍然很容易被机器处理。更正式地说,上下文无关语法由四部分组成:
一组终结符号,它们是正在生成的字符串的元素。对于野牛解析器,这通常是扫描器生成的标记集。对于像英语这样的自然语言的语法,这可能是所有英语单词的集合。
一组非终结符号。直观上,非终结符号代表某种词性,例如“名词”或“动词”。
一组产生式。每个产生式说明如何用其他一组终结符和非终结符替换非终结符。例如,产生式
Variable ->类型名称
表示,如果我们看到非终结符变量
,我们可以将其替换为字符串类型名称
。一个开始符号,它是派生开始的非终结符。
作为示例,考虑 C 风格函数声明的这个简单的上下文无关语法:
这里,起始符号是
Function
。给定这个语法,我们可以通过重复选择一个非终结符并将其替换为匹配产生式的右侧之一来生成 C 风格的函数声明。在每一步中,我们到目前为止构建的字符串称为句子形式。例如,以下是可以从上述语法导出的一些不同的函数声明:大多数编程语言都具有可以通过上下文无关语法描述的结构。解析器的工作是获取程序和语法,并确定如何通过语法生成该程序。
至于 LALR(1),不幸的是 LALR(1) 的正式定义并不简单。我刚刚教授完编译器构建课程,我们在第一次花了两堂课讨论相关解析技术后才能够谈论 LALR(1)。如果您想对该材料进行正式介绍,可以查看我关于自下而上解析的幻灯片在课程网站。
LALR(1) 是一种称为自下而上分析算法的分析算法,这意味着它尝试以相反的顺序应用语法的产生式来减少程序回到开始符号。例如,让我们考虑一下由上述语法生成的字符串:
在自下而上的解析中,我们将通过一次查看程序一个标记来解析该字符串。每当我们发现某些东西可以反转回某个非终结符时,我们就会这样做。 (更准确地说,LALR(1) 仅在满足其他条件时才进行此类缩减,以便算法具有更多上下文,但对于本示例,我们不需要担心这一点)。解析中的每一步都被称为shift或reduce。 转变意味着我们会再查看输入的一个标记,以获取有关要应用哪些缩减的更多信息。 归约意味着我们采用一定数量的标记和非终结符并反转产生式以返回到某些非终结符。
以下是字符串自下而上解析的轨迹:
了解移位和归约非常重要,因为当使用野牛。这些错误意味着解析器生成器确定解析器可能会进入一种状态,无法判断是移位还是归约,或者应该执行两个归约中的哪一个。有关如何处理此问题的更多详细信息,请参阅野牛手册。
如果您想了解有关上下文无关语法和解析算法的更多信息,可以阅读一本名为 解析技术:实用指南,第二版,由 Grune 和 Jacobs 编写,迄今为止,它对我见过的材料。它涵盖了各种解析算法,包括许多比 LALR(1) 更强大且开始得到更广泛使用的技术(例如 GLR 和 Earley)。我强烈推荐这本书——这是我如此了解解析的主要原因!
希望这有帮助!
A context-free grammar is a description of a set of strings that is strictly more powerful than the regular expressions, but still easily handled by a machine. More formally, a context-free grammar consists of four things:
A set of terminal symbols, which are the elements of the strings being produced. For a bison parser, this is typically the set of the tokens produced by the scanner. For a grammar for a natural language like English, this might be the set of all English words.
A set of nonterminal symbols. Intuitively, a nonterminal symbol represents something like a part of speech, such as "noun" or "verb."
A set of productions. Each production says how to replace a nonterminal symbol with some other set of terminals and nonterminals. For example, the production
Variable -> Type Name
says that if we see the nonterminalVariable
, we may replace it with the stringType Name
.A start symbol, which is the nonterminal from which the derivation begins.
As an example, consider this simple context-free grammar for C-style function declarations:
Here, the start symbol is
Function
. Given this grammar, we can produce C-style function declarations by repeatedly picking a nonterminal symbol and replacing it with one of the right-hand sides of a matching production. At each step, the string we've built so far is called a sentential form. For example, here are some different function declarations that can be derived from the above grammar:Most programming languages have a structure that can be described by a context-free grammar. The job of the parser is to take a program and a grammar and to determine how that program can be generated by the grammar.
As for LALR(1), unfortunately the formal definition of LALR(1) is not trivial. I just finished teaching a compiler construction course and we were only able to talk about LALR(1) after first spending two lectures discussing related parsing techniques. If you'd like a formal introduction to the material, my slides on bottom-up parsing are available at the course website.
LALR(1) is a type of parsing algorithm called a bottom-up parsing algorithm, which means that it tries to apply the productions of the grammar in reverse order to reduce the program back to the start symbol. For example, let's consider this string, which is generated by the above grammar:
In a bottom-up parse, we would parse this string by looking at the program one token at a time. Whenever we found something that could be reversed back to some nonterminal, we do so. (To be more precise, LALR(1) only does these sorts of reductions when other criteria are met as well so that the algorithm has more context, but for this example we don't need to worry about this). Each step in the parse is said either to be a shift or a reduce. A shift means that we look at one more token of the input to gain more information about what reductions to apply. A reduce means that we take some number of the tokens and nonterminals and reverse a production to get back to some nonterminal.
Here's a trace of the bottom-up parse of the string:
It's important to know about shifts and reductions because you will invariable encounter shift/reduce conflicts and reduce/reduce conflicts when using bison. These errors mean that the parser generator determined that the parser might get into a state where it can't tell whether to shift or reduce, or which of two reductions it's supposed to perform. Consult the bison manual for more details about how to deal with this.
If you'd like to learn more about context-free grammars and parsing algorithms in general, there is an excellent book entitled Parsing Techniques: A Practical Guide, Second Edition by Grune and Jacobs that has, by far, the best treatment of the material I've ever seen. It covers all sorts of parsing algorithms, including many techniques that are substantially more powerful than LALR(1) that are starting to get wider usage (GLR and Earley, for example). I highly recommend this book - it's the main reason I understand parsing so well!
Hope this helps!
1)简单来说——这意味着代码的格式和上下文对于各个部分来说并不重要,并且您不需要看到它是如何格式化的来理解含义。例如,您可以在一行中编写 C 程序,或者将每个单词写在不同的行上,并且该程序的含义相同。 维基百科有更正式的定义。
2) LALR(1) 的含义更复杂,但简单来说我会描述一下通过一次阅读一个单词来逐步理解含义——只看到下一个单词/符号——一些口语以不属于 LALR(1) 而闻名——当你在这种情况下,你无法理解该句子是一个问题还是一个陈述你到了最后的句子。某些编程语言也可以以这种方式构建,但我不知道有哪些,因为出于实际原因,它们都符合 LALR(1) 语法。不过,我确实认为存在例外情况,其中 C/C++ 需要 2 个符号前瞻才能正确解析语句(因此是 LALR(2),但由于我无法记住它们是什么,我希望有人能指出这一点 在任何情况下,
这本书都是当谈到理解编译器和解析器时,圣经。
1) In simple terms -- It means that the formating and context of the code is not important for the individual parts, and you don't have see how it is formatted to understand the meaning. As an example, you could write your C-program in one line, or with every word on a different line and the program would mean the same. Wikipedia have a more formal definition.
2) What LALR(1) means is more complicated, but in simple terms I would describe that as understanding the meaning incrementally by reading one word at the time -- only seeing the next word/symbol -- some spoken languages are famous for not being LALR(1) -- where you unly understand the sentence being a question or a statement when you get to the end of the sentence. Some programming languages could be constructed that way as well, but I'm not aware of any as they for practical reasons all conform to a LALR(1) syntax. I do however think there are exceptions where C/C++ needs a 2 symbol look ahead to correctly parse the statements (hence being LALR(2), but since I cannot remember from top of head what they are, I hope somebody will point that out in a comment.
In any cases this book here is the bible when it comes to understand compilers and parsers.