为什么不能 C++ 使用 LR(1) 解析器进行解析?
我正在阅读有关解析器和解析器生成器的内容,并在维基百科的 LR 解析页面中找到了此声明:
许多编程语言都可以使用 LR 解析器的某些变体进行解析。 一个值得注意的例外是 C++。
为什么会这样呢? C++ 的什么特殊属性导致它无法用 LR 解析器进行解析?
使用google,我只发现C可以用LR(1)完美解析,但C++需要LR(∞)。
I was reading about parsers and parser generators and found this statement in wikipedia's LR parsing -page:
Many programming languages can be parsed using some variation of an LR parser. One notable exception is C++.
Why is it so? What particular property of C++ causes it to be impossible to parse with LR parsers?
Using google, I only found that C can be perfectly parsed with LR(1) but C++ requires LR(∞).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
LR 解析器在设计上无法处理不明确的语法规则。 (在 20 世纪 70 年代,当这些想法被提出时,理论变得更容易了)。
C 和 C++ 都允许以下语句:
它有两种不同的解析:
现在,您可能认为后者很愚蠢,应该被忽略。
大多数人都会同意你的观点; 然而,在某些情况下可能会
有副作用(例如,如果乘法超载)。 但这不是重点。
关键是有两个不同的解析,因此有一个程序
可能意味着不同的事情,具体取决于这个应该如何被解析。
编译器必须在适当的情况下接受适当的信息,并且在没有任何其他信息(例如,x 类型的知识)的情况下,必须收集两者以便稍后决定要做什么。 因此语法必须允许这样做。 这使得语法变得含糊不清。
因此纯 LR 解析无法处理这个问题。 许多其他广泛使用的解析器生成器(例如 Antlr、JavaCC、YACC 或传统的 Bison,甚至 PEG 风格的解析器)也不能以“纯粹”的方式使用。
还有很多更复杂的情况(解析模板语法需要任意前瞻,而 LALR(k) 可以前瞻最多 k 个标记),但只需要一个反例就可以击垮纯 LR(或者其他)解析。
大多数真正的 C/C++ 解析器通过使用一些来处理这个例子
一种具有额外技巧的确定性解析器:它们将解析与符号表交织在一起
集合...这样当遇到“x”时,
解析器知道 x 是否是类型,因此可以
在两个潜在的解析之间进行选择。 但是一个解析器
这样做不是上下文无关的,LR 解析器
(纯粹的等)(充其量)是上下文无关的。
人们可以作弊,并在
LR 解析器来消除歧义。 (此代码通常并不简单)。 大多数其他解析器类型
有一些方法可以在各个点添加语义检查
在解析中,可以用来做到这一点。
如果你作弊得足够多,你可以让 LR 解析器为
C 和 C++。 GCC 的人做了一段时间,但还是放弃了
适合手工编码解析,我想是因为他们想要
更好的错误诊断。
不过,还有另一种方法,既漂亮又干净
无需任何符号表即可很好地解析 C 和 C++
黑客:GLR 解析器。
这些是完整的上下文无关解析器(实际上具有无限的
展望)。 GLR 解析器只接受两种解析,
生成一棵“树”(实际上是一个主要类似于树的有向无环图)
这代表了不明确的解析。
后解析过程可以解决歧义。
我们在 C 和 C++ 前端中使用这种技术
DMS 软件再造工具包(截至 2017 年 6 月)
它们处理 MS 和 GNU 方言中的完整 C++17)。
它们已被用来处理数百万行
大型 C 和 C++ 系统的完整、精确的解析,生成带有完整源代码详细信息的 AST。 (请参阅C++ 最令人烦恼的解析的 AST。)
LR parsers can't handle ambiguous grammar rules, by design. (Made the theory easier back in the 1970s when the ideas were being worked out).
C and C++ both allow the following statement:
It has two different parses:
Now, you might think the latter is stupid and should be ignored.
Most would agree with you; however, there are cases where it might
have a side effect (e.g., if multiply is overloaded). but that isn't the point.
The point is there are two different parses, and therefore a program
can mean different things depending on how this should have been parsed.
The compiler must accept the appropriate one under the appropriate circumstances, and in the absence of any other information (e.g., knowledge of the type of x) must collect both in order to decide later what to do. Thus a grammar must allow this. And that makes the grammar ambiguous.
Thus pure LR parsing can't handle this. Nor can many other widely available parser generators, such as Antlr, JavaCC, YACC, or traditional Bison, or even PEG-style parsers, used in a "pure" way.
There are lots of more complicated cases (parsing template syntax requires arbitrary lookahead, whereas LALR(k) can look ahead at most k tokens), but only it only takes one counterexample to shoot down pure LR (or the others) parsing.
Most real C/C++ parsers handle this example by using some
kind of deterministic parser with an extra hack: they intertwine parsing with symbol table
collection... so that by the time "x" is encountered,
the parser knows if x is a type or not, and can thus
choose between the two potential parses. But a parser
that does this isn't context free, and LR parsers
(the pure ones, etc.) are (at best) context free.
One can cheat, and add per-rule reduction-time semantic checks in the
to LR parsers to do this disambiguation. (This code often isn't simple). Most of the other parser types
have some means to add semantic checks at various points
in the parsing, that can be used to do this.
And if you cheat enough, you can make LR parsers work for
C and C++. The GCC guys did for awhile, but gave it
up for hand-coded parsing, I think because they wanted
better error diagnostics.
There's another approach, though, which is nice and clean
and parses C and C++ just fine without any symbol table
hackery: GLR parsers.
These are full context free parsers (having effectively infinite
lookahead). GLR parsers simply accept both parses,
producing a "tree" (actually a directed acyclic graph that is mostly tree like)
that represents the ambiguous parse.
A post-parsing pass can resolve the ambiguities.
We use this technique in the C and C++ front ends for our
DMS Software Reengineering Tookit (as of June 2017
these handle full C++17 in MS and GNU dialects).
They have been used to process millions of lines
of large C and C++ systems, with complete, precise parses producing ASTs with complete details of the source code. (See the AST for C++'s most vexing parse.)
Lambda the Ultimate 上有一个有趣的线程,讨论了 C++ 的 LALR 语法。
它包含一个博士论文的链接,其中包括C++ 解析的讨论,其中指出:
它继续给出了一些例子(参见pdf第147页)。
例子是:
mean
比较:
mean
(逗号分隔的表达式)。
两个token序列具有相同的初始子序列但不同解析树,它依赖于最后一个元素,在消除歧义的元素之前可以有任意多个标记。
There is an interesting thread on Lambda the Ultimate that discusses the LALR grammar for C++.
It includes a link to a PhD thesis that includes a discussion of C++ parsing, which states that:
It goes on to give a number of examples (see page 147 of the pdf).
The example is:
meaning
Compare to:
meaning
(a comma-separated expression).
The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one.
问题从来不是这样定义的,但它应该很有趣:
为了让“非上下文无关”yacc 解析器可以完美地解析这个新语法,对 C++ 语法进行的最小修改集是什么? (仅使用一种“黑客”:类型名/标识符消歧,解析器通知词法分析器每个 typedef/类/结构)
我看到一些:
Type Type;
被禁止。 声明为类型名的标识符不能成为非类型名标识符(请注意,struct Type Type
没有歧义,仍然可以被允许)。有 3 种类型的
名称标记
:types
:内置类型或由于 typedef/class/struct将模板函数视为不同的标记可以解决
func<
歧义。 如果func
是模板函数名称,则<
必须是模板参数列表的开头,否则func
是函数指针并且<
是比较运算符。Type a(2);
是一个对象实例化。Type a();
和Type a(int)
是函数原型。int (k);
是完全禁止的,应该写成int k;
typedef int func_type();
和typedef int (func_type)();
被禁止。函数 typedef 必须是函数指针 typedef :
typedef int (*func_ptr_type)();
模板递归限制为 1024,否则可以将增加的最大值作为选项传递给编译器。
int a,b,c[9],*d,(*f)(), (*g)()[9], h(char);
也可能被禁止,替换为int a,b,c[9],*d;
int (*f)();
int (*g)()[9];
int h(char);
每个函数原型或函数指针声明一行。
一个高度首选的替代方案是更改糟糕的函数指针语法,
int (MyClass::*MethodPtr)(char*);
重新语法为:
int (MyClass::*)(char*) MethodPtr;
这与强制转换运算符一致
(int (MyClass::*)(char*))
typedef int type, *type_ptr;
一致code> 也可以被禁止:每个 typedef 一行。 这样就变成了typedef int 类型;
typedef int *type_ptr;
sizeof int
,sizeof char
,sizeof long long
和 co。 可以在每个源文件中声明。因此,每个使用
int
类型的源文件应以开头
#type int :signed_integer(4)
和
unsigned_integer(4)
在该#type
指令之外将被禁止这将是向许多 C++ 头文件中存在的愚蠢的
sizeof int
歧义迈出的一大步如果遇到使用不明确语法的 C++ 源代码,实现重新语法的 C++ 的编译器将移动
source。 cpp
也是一个ambiguously_syntax
文件夹,并且会在编译之前自动创建一个明确的翻译source.cpp
。如果您知道一些不明确的 C++ 语法,请添加!
The problem is never defined like this, whereas it should be interesting :
what is the smallest set of modifications to C++ grammar that would be necessary so that this new grammar could be perfectly parsed by a "non-context-free" yacc parser ? (making use only of one 'hack' : the typename/identifier disambiguation, the parser informing the lexer of every typedef/class/struct)
I see a few ones:
Type Type;
is forbidden. An identifier declared as a typename cannot become a non-typename identifier (note thatstruct Type Type
is not ambiguous and could be still allowed).There are 3 types of
names tokens
:types
: builtin-type or because of a typedef/class/structConsidering template-functions as different tokens solves the
func<
ambiguity. Iffunc
is a template-function name, then<
must be the beginning of a template parameter list, otherwisefunc
is a function pointer and<
is the comparison operator.Type a(2);
is an object instantiation.Type a();
andType a(int)
are function prototypes.int (k);
is completely forbidden, should be writtenint k;
typedef int func_type();
andtypedef int (func_type)();
are forbidden.A function typedef must be a function pointer typedef :
typedef int (*func_ptr_type)();
template recursion is limited to 1024, otherwise an increased maximum could be passed as an option to the compiler.
int a,b,c[9],*d,(*f)(), (*g)()[9], h(char);
could be forbidden too, replaced byint a,b,c[9],*d;
int (*f)();
int (*g)()[9];
int h(char);
one line per function prototype or function pointer declaration.
An highly preferred alternative would be to change the awful function pointer syntax,
int (MyClass::*MethodPtr)(char*);
being resyntaxed as:
int (MyClass::*)(char*) MethodPtr;
this being coherent with the cast operator
(int (MyClass::*)(char*))
typedef int type, *type_ptr;
could be forbidden too : one line per typedef. Thus it would becometypedef int type;
typedef int *type_ptr;
sizeof int
,sizeof char
,sizeof long long
and co. could be declared in each source file.Thus, each source file making use of the type
int
should begin with#type int : signed_integer(4)
and
unsigned_integer(4)
would be forbidden outside of that#type
directivethis would be a big step into the stupid
sizeof int
ambiguity present in so many C++ headersThe compiler implementing the resyntaxed C++ would, if encountering a C++ source making use of ambiguous syntax, move
source.cpp
too anambiguous_syntax
folder, and would create automatically an unambiguous translatedsource.cpp
before compiling it.Please add your ambiguous C++ syntaxes if you know some!
正如您在我的此处的答案中看到的,C++ 包含无法由 LL 或 LR 解析器确定性解析的语法,因为类型解析阶段(通常是解析后)改变了操作的顺序,因此改变了 AST 的基本形状(通常是解析后)预计由第一阶段解析提供)。
As you can see in my answer here, C++ contains syntax that cannot be deterministically parsed by an LL or LR parser due to the type resolution stage (typically post-parsing) changing the order of operations, and therefore the fundamental shape of the AST (typically expected to be provided by a first-stage parse).
我认为你已经非常接近答案了。
LR(1) 意味着从左到右解析只需要一个标记来向前查找上下文,而 LR(∞) 意味着无限向前查找。 也就是说,解析器必须知道即将发生的所有事情才能弄清楚它现在在哪里。
I think you are pretty close to the answer.
LR(1) means that parsing from left to right needs only one token to look-ahead for the context, whereas LR(∞) means an infinite look-ahead. That is, the parser would have to know everything that was coming in order to figure out where it is now.
C++ 中的“typedef”问题可以使用 LALR(1) 解析器进行解析,该解析器在解析时构建符号表(不是纯 LALR 解析器)。 “模板”问题可能无法用这种方法解决。 这种 LALR(1) 解析器的优点是语法(如下所示)是 LALR(1) 语法(没有歧义)。
可以毫无问题地解析以下输入:
LRSTAR 解析器生成器 读取上述语法符号并生成一个处理“的解析器” typedef”问题在解析树或 AST 中没有歧义。 (披露:我是 LRSTAR 的创建者。)
The "typedef" problem in C++ can be parsed with an LALR(1) parser that builds a symbol table while parsing (not a pure LALR parser). The "template" problem probably cannot be solved with this method. The advantage of this kind of LALR(1) parser is that the grammar (shown below) is an LALR(1) grammar (no ambiguity).
The following input can be parsed without a problem:
The LRSTAR parser generator reads the above grammar notation and generates a parser that handles the "typedef" problem without ambiguity in the parse tree or AST. (Disclosure: I am the guy who created LRSTAR.)