GCC 和 Clang 解析器真的是手写的吗?
看来 GCC 和 LLVM-Clang 使用的是手写递归下降解析器,而不是机器生成的、基于 Bison-Flex 的自下而上解析。
这里有人可以确认一下情况是这样吗? 如果是这样,为什么主流编译器框架使用手写解析器?
更新:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
有一个民间定理说 C 很难解析,而 C++ 基本上不可能。
这不是真的。
事实是,如果不破坏解析机制并纠缠符号表数据,使用 LALR(1) 解析器很难解析 C 和 C++。事实上,GCC 曾经使用 YACC 和类似的其他黑客技术来解析它们,是的,它很丑陋。 现在 GCC 使用手写解析器,但仍然使用符号表 hackery。 Clang 人员从未尝试过使用自动解析器生成器;据我所知,Clang 解析器一直是手工编码的递归下降。
事实是,C 和 C++ 相对容易使用更强大的自动生成的解析器进行解析,例如 GLR 解析器,并且您不需要任何技巧。 Elsa C++ 解析器就是一个例子。我们的 C++ 前端 是另一个(我们所有的“编译器”前端也是如此, GLR 是非常精彩的解析技术)。
我们的 C++ 前端不如 GCC 快,而且肯定比 Elsa 慢;我们很少花精力仔细调整它,因为我们还有其他更紧迫的问题(尽管它已被用于数百万行 C++ 代码)。 Elsa 可能比 GCC 慢,只是因为它更通用。鉴于当今的处理器速度,这些差异在实践中可能并不重要。
但如今广泛传播的“真正的编译器”的根源在于 10、20 年前或更早的编译器。效率低下变得更加重要,而且没有人听说过 GLR 解析器,所以人们就做他们知道怎么做的事情。 Clang 肯定是最近出现的,但民间定理在很长一段时间内仍然保留着它们的“说服力”。
你不必再这样做了。您可以非常合理地使用 GLR 和其他此类解析器作为前端,并提高编译器的可维护性。
事实是,获得与友好邻居编译器的行为相匹配的语法是很困难的。虽然几乎所有 C++ 编译器都实现了(大部分)原始标准,但它们也往往有很多暗角扩展,例如 MS 编译器中的 DLL 规范等。如果您有强大的解析引擎,您可以
花时间尝试让最终的语法符合现实,而不是试图改变语法以匹配解析器生成器的限制。
2012 年 11 月编辑:自从编写此答案以来,我们改进了 C++ 前端以处理完整的 C++11,包括 ANSI、GNU 和 MS 变体方言。虽然有很多额外的东西,但我们不必改变我们的解析引擎;我们刚刚修改了语法规则。我们确实必须改变语义分析; C++11 在语义上非常复杂,这项工作耗费了解析器运行的精力。
2015 年 2 月编辑:...现在可以处理完整的 C++14。 (请参阅从 C++ 代码获取人类可读的 AST GLR 解析一段简单的代码,而 C++ 则解析臭名昭著的“最令人烦恼的解析”)。
2017 年 4 月编辑:现在处理(草案)C++17。
There's a folk-theorem that says C is hard to parse, and C++ essentially impossible.
It isn't true.
What is true is that C and C++ are pretty hard to parse using LALR(1) parsers without hacking the parsing machinery and tangling in symbol table data. GCC in fact used to parse them, using YACC and additional hackery like this, and yes it was ugly. Now GCC uses handwritten parsers, but still with the symbol table hackery. The Clang folks never tried to use automated parser generators; AFAIK the Clang parser has always been hand-coded recursive descent.
What is true, is that C and C++ are relatively easy to parse with stronger automatically generated parsers, e.g., GLR parsers, and you don't need any hacks. The Elsa C++ parser is one example of this. Our C++ Front End is another (as are all our "compiler" front ends, GLR is pretty wonderful parsing technology).
Our C++ front end isn't as fast as GCC's, and certainly slower than Elsa; we've put little energy into tuning it carefully because we have other more pressing issues (nontheless it has been used on millions of lines of C++ code). Elsa is likely slower than GCC simply because it is more general. Given processor speeds these days, these differences might not matter a lot in practice.
But the "real compilers" that are widely distributed today have their roots in compilers of 10 or 20 years ago or more. Inefficiencies then mattered much more, and nobody had heard of GLR parsers, so people did what they knew how to do. Clang is certainly more recent, but then folk theorems retain their "persuasiveness" for a long time.
You don't have to do it that way anymore. You can very reasonably use GLR and other such parsers as front ends, with an improvement in compiler maintainability.
What is true, is that getting a grammar that matches your friendly neighborhood compiler's behavior is hard. While virtually all C++ compilers implement (most) of the original standard, they also tend have lots of dark corner extensions, e.g., DLL specifications in MS compilers, etc. If you have a strong parsing engine, you can
spend your time trying to get the final grammar to match reality, rather than trying to bend your grammar to match the limitations of your parser generator.
EDIT November 2012: Since writing this answer, we've improved our C++ front end to handle full C++11, including ANSI, GNU, and MS variant dialects. While there was lots of extra stuff, we don't have to change our parsing engine; we just revised the grammar rules. We did have to change the semantic analysis; C++11 is semantically very complicated, and this work swamps the effort to get the parser to run.
EDIT February 2015: ... now handles full C++14. (See get human readable AST from c++ code for GLR parses of a simple bit of code, and C++'s infamous "most vexing parse").
EDIT April 2017: Now handles (draft) C++17.
是的:
GCC 曾经使用过 yacc (bison) 解析器,但在 3.x 系列中的某个时刻它被手写的递归下降解析器取代:参见 http://gcc.gnu.org/wiki/New_C_Parser 获取相关补丁提交的链接。
Clang 还使用手写的递归下降解析器:请参阅 http://clang.llvm.org/features.html .
Yes:
GCC used a yacc (bison) parser once upon a time, but it was replaced with a hand-written recursive descent parser at some point in the 3.x series: see http://gcc.gnu.org/wiki/New_C_Parser for links to relevant patch submissions.
Clang also uses a hand-written recursive descent parser: see the section "A single unified parser for C, Objective C, C++ and Objective C++" near the end of http://clang.llvm.org/features.html .
Clang 的解析器是一个手写的递归下降解析器,其他几个开源和商业 C 和 C++ 前端也是如此。
Clang 使用递归下降解析器有几个原因:
总的来说,对于 C++ 编译器来说,这并不重要:C++ 的解析部分并不简单,但它仍然是较容易的部分之一,因此保持简单是值得的。语义分析——特别是名称查找、初始化、重载解析和模板实例化——比解析复杂几个数量级。如果您想要证据,请检查 Clang 的“Sema”组件(用于语义分析)及其“Parse”组件(用于解析)中的代码分布和提交情况。
Clang's parser is a hand-written recursive-descent parser, as are several other open-source and commercial C and C++ front ends.
Clang uses a recursive-descent parser for several reasons:
Overall, for a C++ compiler, it just doesn't matter much: the parsing part of C++ is non-trivial, but it's still one of the easier parts, so it pays to keep it simple. Semantic analysis---particularly name lookup, initialization, overload resolution, and template instantiation---is orders of magnitude more complicated than parsing. If you want proof, go check out the distribution of code and commits in Clang's "Sema" component (for semantic analysis) vs. its "Parse" component (for parsing).
那里的答案很奇怪!
C/C++ 语法不是上下文无关的。由于 Foo * bar,它们是上下文相关的;歧义。我们必须构建一个 typedef 列表来知道 Foo 是否是一个类型。
Ira Baxter:我不明白你的 GLR 事情有什么意义。为什么要构建一个包含歧义的解析树。解析意味着解决歧义,构建语法树。您可以在第二遍中解决这些歧义,因此这并不那么难看。对我来说,它丑陋得多......
Yacc 是一个 LR(1) 解析器生成器(或 LALR(1)),但它可以很容易地修改为上下文敏感。而且里面并没有什么丑陋的地方。 Yacc/Bison 是为了帮助解析 C 语言而创建的,所以它可能不是生成 C 解析器的最丑陋的工具......
直到 GCC 3.x,C 解析器由 yacc/bison 生成,并在期间构建了 typedefs 表解析。通过“解析中”typedef 表构建,C 语法变得本地上下文无关,并且进一步成为“本地 LR(1)”。
现在,在 Gcc 4.x 中,它是一个递归下降解析器。它与 Gcc 3.x 中的解析器完全相同,仍然是 LR(1),并且具有相同的语法规则。不同之处在于 yacc 解析器已被手工重写,移位/归约现在隐藏在调用堆栈中,并且没有像 gcc 3.x yacc 中那样的“state454 : if (nextsym == '(') goto state398”解析器,因此更容易修补、处理错误和打印更好的消息,并在解析期间执行一些后续编译步骤,但代价是“易于阅读”的代码少得多。对于 gcc 菜鸟来说,
为什么他们从 yacc 转向递归下降?因为避免 yacc 解析 C++ 是非常必要的,而且因为 GCC 梦想成为多语言编译器,即在它可以编译的不同语言之间共享最大的代码。这就是为什么 C++ 和 C 解析器以相同的方式编写,
比 C 更难解析,因为它不是像 C 那样的“本地”LR(1),甚至不是 LR(k)。
看看
func<4> 2>
这是用 4> 实例化的模板函数。 2,即func<4> 2>
必须读作
func<1>
。这绝对不是LR(1)。现在考虑,func<4> 2> 1> 3> 3> 8> 9> 8> 7> 8>
。这是递归下降可以轻松解决歧义的地方,但代价是更多的函数调用(parse_template_parameter是歧义解析器函数。如果parse_template_parameter(17tokens)失败,请重试parse_template_parameter(15tokens)、parse_template_parameter(13tokens)...直到它起作用)。
我不知道为什么不可能添加到 yacc/bison 递归子语法中,也许这将是 gcc/GNU 解析器开发的下一步?
Weird answers there!
C/C++ grammars aren't context free. They are context sensitive because of the Foo * bar; ambiguity. We have to build a list of typedefs to know if Foo is a type or not.
Ira Baxter: I don't see the point with your GLR thing. Why build a parse tree which comprises ambiguities. Parsing means solving ambiguities, building the syntax tree. You resolve these ambiguities in a second pass, so this isn't less ugly. For me it is far more ugly ...
Yacc is a LR(1) parser generator (or LALR(1)), but it can be easily modified to be context sensitive. And there is nothing ugly in it. Yacc/Bison has been created to help in parsing C language, so probably it isn't the ugliest tool to generate a C parser ...
Until GCC 3.x the C parser is generated by yacc/bison, with typedefs table built during parsing. With "in parse" typedefs table building, C grammar becomes locally context free and furthermore "locally LR(1)".
Now, in Gcc 4.x, it is a recursive descent parser. It is exactly the same parser as in Gcc 3.x, it is still LR(1), and has the same grammar rules. The difference is that the yacc parser has been hand rewritten, the shift/reduce are now hidden in the call stack, and there is no "state454 : if (nextsym == '(') goto state398" as in gcc 3.x yacc's parser, so it is easier to patch, handle errors and print nicer messages, and to perform some of the next compiling steps during parsing. At the price of much less "easy to read" code for a gcc noob.
Why did they switched from yacc to recursive descent? Because it is quite necessary to avoid yacc to parse C++, and because GCC dreams to be multi language compiler, i.e. sharing maximum of code between the different languages it can compile. This is why the C++ and the C parser are written in the same way.
C++ is harder to parse than C because it isn't "locally" LR(1) as C, it is not even LR(k).
Look at
func<4 > 2>
which is a template function instantiated with 4 > 2, i.e.func<4 > 2>
has to be read as
func<1>
. This is definitely not LR(1). Now consider,func<4 > 2 > 1 > 3 > 3 > 8 > 9 > 8 > 7 > 8>
. This is where a recursive descent can easily solve ambiguity, at the price of a few more function calls (parse_template_parameter is the ambiguous parser function. If parse_template_parameter(17tokens) failed, try again parse_template_parameter(15tokens), parse_template_parameter(13tokens)... until it works).
I don't know why it wouldn't be possible to add into yacc/bison recursive sub grammars, maybe this will be the next step in gcc/GNU parser development?
gcc 的解析器是手写的。。我怀疑 clang 也是如此。这可能有以下几个原因:
这可能不是“不是在这里发明的”综合症,而是“没有专门针对我们需要的东西进行优化,所以我们编写了自己的”。
gcc's parser is handwritten.. I suspect the same for clang. This is probably for a few reasons:
This is probably not a case of "not invented here" syndrome, but more along the lines of "there was nothing optimized specifically for what we needed, so we wrote our own".
特别是 Bison,我认为如果不模糊地解析一些东西并稍后进行第二遍,就无法处理语法。
我知道 Haskell 的 Happy 允许使用单子(即状态相关)解析器来解决 C 语法的特定问题,但我知道没有 C 解析器生成器允许用户提供状态单子。
从理论上讲,错误恢复将是有利于手写解析器的一个点,但我使用 GCC/Clang 的经验是错误消息不是特别好。
至于性能——有些说法似乎没有根据。使用解析器生成器生成大型状态机应该会产生
O(n)
的结果,我怀疑解析是许多工具的瓶颈。Bison in particular I don't think can handle the grammar without parsing some things ambiguously and doing a second pass later.
I know Haskell's Happy allows for monadic (i.e. state-dependent) parsers that can resolve the particular issue with C syntax, but I know of no C parser generators that allow a user-supplied state monad.
In theory, error recovery would be a point in favor of a handwritten parser, but my experience with GCC/Clang has been that the error messages are not particularly good.
As for performance - some of the claims seem unsubstantiated. Generating a big state machine using a parser generator should result in something that's
O(n)
and I doubt parsing is the bottleneck in much tooling.