为什么不能 C++ 使用 LR(1) 解析器进行解析?

发布于 2024-07-07 20:10:22 字数 239 浏览 17 评论 0原文

我正在阅读有关解析器和解析器生成器的内容,并在维基百科的 LR 解析页面中找到了此声明:

许多编程语言都可以使用 LR 解析器的某些变体进行解析。 一个值得注意的例外是 C++。

为什么会这样呢? C++ 的什么特殊属性导致它无法用 LR 解析器进行解析?

使用google,我只发现C可以用LR(1)完美解析,但C++需要LR(∞)。

I was reading about parsers and parser generators and found this statement in wikipedia's LR parsing -page:

Many programming languages can be parsed using some variation of an LR parser. One notable exception is C++.

Why is it so? What particular property of C++ causes it to be impossible to parse with LR parsers?

Using google, I only found that C can be perfectly parsed with LR(1) but C++ requires LR(∞).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

枕梦 2024-07-14 20:10:22

LR 解析器在设计上无法处理不明确的语法规则。 (在 20 世纪 70 年代,当这些想法被提出时,理论变得更容易了)。

C 和 C++ 都允许以下语句:

x * y ;

它有两种不同的解析:

  1. 它可以是 y 的声明,作为指向类型 x 的指针
  2. 它可以是 x 和 y 的乘法,丢弃答案。

现在,您可能认为后者很愚蠢,应该被忽略。
大多数人都会同意你的观点; 然而,在某些情况下可能会
有副作用(例如,如果乘法超载)。 但这不是重点。
关键是有两个不同的解析,因此有一个程序
可能意味着不同的事情,具体取决于这个应该如何被解析。

编译器必须在适当的情况下接受适当的信息,并且在没有任何其他信息(例如,x 类型的知识)的情况下,必须收集两者以便稍后决定要做什么。 因此语法必须允许这样做。 这使得语法变得含糊不清。

因此纯 LR 解析无法处理这个问题。 许多其他广泛使用的解析器生成器(例如 Antlr、JavaCC、YACC 或传统的 Bison,甚至 PEG 风格的解析器)也不能以“纯粹”的方式使用。

还有很多更复杂的情况(解析模板语法需要任意前瞻,而 LALR(k) 可以前瞻最多 k 个标记),但只需要一个反例就可以击垮 LR(或者其他)解析。

大多数真正的 C/C++ 解析器通过使用一些来处理这个例子
一种具有额外技巧的确定性解析器:它们将解析与符号表交织在一起
集合...这样当遇到“x”时,
解析器知道 x 是否是类型,因此可以
在两个潜在的解析之间进行选择。 但是一个解析器
这样做不是上下文无关的,LR 解析器
(纯粹的等)(充其量)是上下文无关的。

人们可以作弊,并在
LR 解析器来消除歧义。 (此代码通常并不简单)。 大多数其他解析器类型
有一些方法可以在各个点添加语义检查
在解析中,可以用来做到这一点。

如果你作弊得足够多,你可以让 LR 解析器为
C 和 C++。 GCC 的人做了一段时间,但还是放弃了
适合手工编码解析,我想是因为他们想要
更好的错误诊断。

不过,还有另一种方法,既漂亮又干净
无需任何符号表即可很好地解析 C 和 C++
黑客:GLR 解析器
这些是完整的上下文无关解析器(实际上具有无限的
展望)。 GLR 解析器只接受两种解析,
生成一棵“树”(实际上是一个主要类似于树的有向无环图)
这代表了不明确的解析。
后解析过程可以解决歧义。

我们在 C 和 C++ 前端中使用这种技术
DMS 软件再造工具包(截至 2017 年 6 月)
它们处理 MS 和 GNU 方言中的完整 C++17)。
它们已被用来处理数百万行
大型 C 和 C++ 系统的完整、精确的解析,生成带有完整源代码详细信息的 AST。 (请参阅C++ 最令人烦恼的解析的 AST。

LR parsers can't handle ambiguous grammar rules, by design. (Made the theory easier back in the 1970s when the ideas were being worked out).

C and C++ both allow the following statement:

x * y ;

It has two different parses:

  1. It can be the declaration of y, as pointer to type x
  2. It can be a multiply of x and y, throwing away the answer.

Now, you might think the latter is stupid and should be ignored.
Most would agree with you; however, there are cases where it might
have a side effect (e.g., if multiply is overloaded). but that isn't the point.
The point is there are two different parses, and therefore a program
can mean different things depending on how this should have been parsed.

The compiler must accept the appropriate one under the appropriate circumstances, and in the absence of any other information (e.g., knowledge of the type of x) must collect both in order to decide later what to do. Thus a grammar must allow this. And that makes the grammar ambiguous.

Thus pure LR parsing can't handle this. Nor can many other widely available parser generators, such as Antlr, JavaCC, YACC, or traditional Bison, or even PEG-style parsers, used in a "pure" way.

There are lots of more complicated cases (parsing template syntax requires arbitrary lookahead, whereas LALR(k) can look ahead at most k tokens), but only it only takes one counterexample to shoot down pure LR (or the others) parsing.

Most real C/C++ parsers handle this example by using some
kind of deterministic parser with an extra hack: they intertwine parsing with symbol table
collection... so that by the time "x" is encountered,
the parser knows if x is a type or not, and can thus
choose between the two potential parses. But a parser
that does this isn't context free, and LR parsers
(the pure ones, etc.) are (at best) context free.

One can cheat, and add per-rule reduction-time semantic checks in the
to LR parsers to do this disambiguation. (This code often isn't simple). Most of the other parser types
have some means to add semantic checks at various points
in the parsing, that can be used to do this.

And if you cheat enough, you can make LR parsers work for
C and C++. The GCC guys did for awhile, but gave it
up for hand-coded parsing, I think because they wanted
better error diagnostics.

There's another approach, though, which is nice and clean
and parses C and C++ just fine without any symbol table
hackery: GLR parsers.
These are full context free parsers (having effectively infinite
lookahead). GLR parsers simply accept both parses,
producing a "tree" (actually a directed acyclic graph that is mostly tree like)
that represents the ambiguous parse.
A post-parsing pass can resolve the ambiguities.

We use this technique in the C and C++ front ends for our
DMS Software Reengineering Tookit (as of June 2017
these handle full C++17 in MS and GNU dialects).
They have been used to process millions of lines
of large C and C++ systems, with complete, precise parses producing ASTs with complete details of the source code. (See the AST for C++'s most vexing parse.)

一直在等你来 2024-07-14 20:10:22

Lambda the Ultimate 上有一个有趣的线程,讨论了 C++ 的 LALR 语法

它包含一个博士论文的链接,其中包括C++ 解析的讨论,其中指出:

“C++ 语法是不明确的,
依赖于上下文并且可能
需要无限的前瞻来解决
一些含糊之处”。

它继续给出了一些例子(参见pdf第147页)。

例子是:

int(x), y, *const z;

mean

int x;
int y;
int *const z;

比较:

int(x), y, new int;

mean

(int(x)), (y), (new int));

(逗号分隔的表达式)。

两个token序列具有相同的初始子序列但不同解析树,它依赖于最后一个元素,在消除歧义的元素之前可以有任意多个标记。

There is an interesting thread on Lambda the Ultimate that discusses the LALR grammar for C++.

It includes a link to a PhD thesis that includes a discussion of C++ parsing, which states that:

"C++ grammar is ambiguous,
context-dependent and potentially
requires infinite lookahead to resolve
some ambiguities".

It goes on to give a number of examples (see page 147 of the pdf).

The example is:

int(x), y, *const z;

meaning

int x;
int y;
int *const z;

Compare to:

int(x), y, new int;

meaning

(int(x)), (y), (new int));

(a comma-separated expression).

The two token sequences have the same initial subsequence but different parse trees, which depend on the last element. There can be arbitrarily many tokens before the disambiguating one.

一袭水袖舞倾城 2024-07-14 20:10:22

问题从来不是这样定义的,但它应该很有趣:

为了让“非上下文无关”yacc 解析器可以完美地解析这个新语法,对 C++ 语法进行的最小修改集是什么? (仅使用一种“黑客”:类型名/标识符消歧,解析器通知词法分析器每个 typedef/类/结构)

我看到一些:

  1. Type Type; 被禁止。 声明为类型名的标识符不能成为非类型名标识符(请注意,struct Type Type 没有歧义,仍然可以被允许)。

    有 3 种类型的名称标记

    • types :内置类型或由于 typedef/class/struct
    • 模板函数
    • 标识符:函数/方法和变量/对象

    将模板函数视为不同的标记可以解决 func< 歧义。 如果 func 是模板函数名称,则 < 必须是模板参数列表的开头,否则 func 是函数指针并且< 是比较运算符。

  2. Type a(2); 是一个对象实例化。
    Type a();Type a(int) 是函数原型。

  3. int (k);是完全禁止的,应该写成int k;

  4. typedef int func_type();
    typedef int (func_type)(); 被禁止。

    函数 typedef 必须是函数指针 typedef : typedef int (*func_ptr_type)();

  5. 模板递归限制为 1024,否则可以将增加的最大值作为选项传递给编译器。

  6. int a,b,c[9],*d,(*f)(), (*g)()[9], h(char); 也可能被禁止,替换为int a,b,c[9],*d;
    int (*f)();

    int (*g)()[9];

    int h(char);

    每个函数原型或函数指针声明一行。

    一个高度首选的替代方案是更改糟糕的函数指针语法,

    int (MyClass::*MethodPtr)(char*);

    重新语法为:

    int (MyClass::*)(char*) MethodPtr;

    这与强制转换运算符一致 (int (MyClass::*)(char*))

  7. typedef int type, *type_ptr; 一致code> 也可以被禁止:每个 typedef 一行。 这样就变成了

    typedef int 类型;

    typedef int *type_ptr;

  8. sizeof int, sizeof char, sizeof long long 和 co。 可以在每个源文件中声明。
    因此,每个使用 int 类型的源文件应以

    开头

    #type int :signed_integer(4)

    unsigned_integer(4) 在该 #type 指令之外将被禁止
    这将是向许多 C++ 头文件中存在的愚蠢的 sizeof int 歧义迈出的一大步

如果遇到使用不明确语法的 C++ 源代码,实现重新语法的 C++ 的编译器将移动 source。 cpp 也是一个 ambiguously_syntax 文件夹,并且会在编译之前自动创建一个明确的翻译 source.cpp

如果您知道一些不明确的 C++ 语法,请添加!

The problem is never defined like this, whereas it should be interesting :

what is the smallest set of modifications to C++ grammar that would be necessary so that this new grammar could be perfectly parsed by a "non-context-free" yacc parser ? (making use only of one 'hack' : the typename/identifier disambiguation, the parser informing the lexer of every typedef/class/struct)

I see a few ones:

  1. Type Type; is forbidden. An identifier declared as a typename cannot become a non-typename identifier (note that struct Type Type is not ambiguous and could be still allowed).

    There are 3 types of names tokens :

    • types : builtin-type or because of a typedef/class/struct
    • template-functions
    • identifiers : functions/methods and variables/objects

    Considering template-functions as different tokens solves the func< ambiguity. If func is a template-function name, then < must be the beginning of a template parameter list, otherwise func is a function pointer and < is the comparison operator.

  2. Type a(2); is an object instantiation.
    Type a(); and Type a(int) are function prototypes.

  3. int (k); is completely forbidden, should be written int k;

  4. typedef int func_type(); and
    typedef int (func_type)(); are forbidden.

    A function typedef must be a function pointer typedef : typedef int (*func_ptr_type)();

  5. template recursion is limited to 1024, otherwise an increased maximum could be passed as an option to the compiler.

  6. int a,b,c[9],*d,(*f)(), (*g)()[9], h(char); could be forbidden too, replaced by int a,b,c[9],*d;
    int (*f)();

    int (*g)()[9];

    int h(char);

    one line per function prototype or function pointer declaration.

    An highly preferred alternative would be to change the awful function pointer syntax,

    int (MyClass::*MethodPtr)(char*);

    being resyntaxed as:

    int (MyClass::*)(char*) MethodPtr;

    this being coherent with the cast operator (int (MyClass::*)(char*))

  7. typedef int type, *type_ptr; could be forbidden too : one line per typedef. Thus it would become

    typedef int type;

    typedef int *type_ptr;

  8. sizeof int, sizeof char, sizeof long long and co. could be declared in each source file.
    Thus, each source file making use of the type int should begin with

    #type int : signed_integer(4)

    and unsigned_integer(4) would be forbidden outside of that #type directive
    this would be a big step into the stupid sizeof int ambiguity present in so many C++ headers

The compiler implementing the resyntaxed C++ would, if encountering a C++ source making use of ambiguous syntax, move source.cpp too an ambiguous_syntax folder, and would create automatically an unambiguous translated source.cpp before compiling it.

Please add your ambiguous C++ syntaxes if you know some!

时光礼记 2024-07-14 20:10:22

正如您在我的此处的答案中看到的,C++ 包含无法由 LL 或 LR 解析器确定性解析的语法,因为类型解析阶段(通常是解析后)改变了操作的顺序,因此改变了 AST 的基本形状(通常是解析后)预计由第一阶段解析提供)。

As you can see in my answer here, C++ contains syntax that cannot be deterministically parsed by an LL or LR parser due to the type resolution stage (typically post-parsing) changing the order of operations, and therefore the fundamental shape of the AST (typically expected to be provided by a first-stage parse).

烦人精 2024-07-14 20:10:22

我认为你已经非常接近答案了。

LR(1) 意味着从左到右解析只需要一个标记来向前查找上下文,而 LR(∞) 意味着无限向前查找。 也就是说,解析器必须知道即将发生的所有事情才能弄清楚它现在在哪里。

I think you are pretty close to the answer.

LR(1) means that parsing from left to right needs only one token to look-ahead for the context, whereas LR(∞) means an infinite look-ahead. That is, the parser would have to know everything that was coming in order to figure out where it is now.

断肠人 2024-07-14 20:10:22

C++ 中的“typedef”问题可以使用 LALR(1) 解析器进行解析,该解析器在解析时构建符号表(不是纯 LALR 解析器)。 “模板”问题可能无法用这种方法解决。 这种 LALR(1) 解析器的优点是语法(如下所示)是 LALR(1) 语法(没有歧义)。

/* C Typedef Solution. */

/* Terminal Declarations. */

   <identifier> => lookup();  /* Symbol table lookup. */

/* Rules. */

   Goal        -> [Declaration]... <eof>               +> goal_

   Declaration -> Type... VarList ';'                  +> decl_
               -> typedef Type... TypeVarList ';'      +> typedecl_

   VarList     -> Var /','...     
   TypeVarList -> TypeVar /','...

   Var         -> [Ptr]... Identifier 
   TypeVar     -> [Ptr]... TypeIdentifier                               

   Identifier     -> <identifier>       +> identifier_(1)      
   TypeIdentifier -> <identifier>      =+> typedefidentifier_(1,{typedef})

// The above line will assign {typedef} to the <identifier>,  
// because {typedef} is the second argument of the action typeidentifier_(). 
// This handles the context-sensitive feature of the C++ language.

   Ptr          -> '*'                  +> ptr_

   Type         -> char                 +> type_(1)
                -> int                  +> type_(1)
                -> short                +> type_(1)
                -> unsigned             +> type_(1)
                -> {typedef}            +> type_(1)

/* End Of Grammar. */

可以毫无问题地解析以下输入:

 typedef int x;
 x * y;

 typedef unsigned int uint, *uintptr;
 uint    a, b, c;
 uintptr p, q, r;

LRSTAR 解析器生成器 读取上述语法符号并生成一个处理“的解析器” typedef”问题在解析树或 AST 中没有歧义。 (披露:我是 LRSTAR 的创建者。)

The "typedef" problem in C++ can be parsed with an LALR(1) parser that builds a symbol table while parsing (not a pure LALR parser). The "template" problem probably cannot be solved with this method. The advantage of this kind of LALR(1) parser is that the grammar (shown below) is an LALR(1) grammar (no ambiguity).

/* C Typedef Solution. */

/* Terminal Declarations. */

   <identifier> => lookup();  /* Symbol table lookup. */

/* Rules. */

   Goal        -> [Declaration]... <eof>               +> goal_

   Declaration -> Type... VarList ';'                  +> decl_
               -> typedef Type... TypeVarList ';'      +> typedecl_

   VarList     -> Var /','...     
   TypeVarList -> TypeVar /','...

   Var         -> [Ptr]... Identifier 
   TypeVar     -> [Ptr]... TypeIdentifier                               

   Identifier     -> <identifier>       +> identifier_(1)      
   TypeIdentifier -> <identifier>      =+> typedefidentifier_(1,{typedef})

// The above line will assign {typedef} to the <identifier>,  
// because {typedef} is the second argument of the action typeidentifier_(). 
// This handles the context-sensitive feature of the C++ language.

   Ptr          -> '*'                  +> ptr_

   Type         -> char                 +> type_(1)
                -> int                  +> type_(1)
                -> short                +> type_(1)
                -> unsigned             +> type_(1)
                -> {typedef}            +> type_(1)

/* End Of Grammar. */

The following input can be parsed without a problem:

 typedef int x;
 x * y;

 typedef unsigned int uint, *uintptr;
 uint    a, b, c;
 uintptr p, q, r;

The LRSTAR parser generator reads the above grammar notation and generates a parser that handles the "typedef" problem without ambiguity in the parse tree or AST. (Disclosure: I am the guy who created LRSTAR.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文