当前位置：文江博客话题详情

C# 和 Java 语法是 LALR(x) 吗？

发布于 2024-12-19 17:25:44 字数 174 浏览 3 评论 0原文

我想知道C#和Java语法是否是LALR(x)？如果是，x 的值是多少？

编辑：

接受真实答案后，我认为最好以这种方式更改 Q：

是否有任何 LALR(x) 解析器可以解析当前版本的 Java（版本 7）或 C#（版本 4）？如果是，x 的值是多少？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光匆匆的小流年 2024-12-26 17:25:45

如果没有首先为某种语言指定特定的语法，就不能提出这个问题，因为有些语法可能是，有些则可能不是。

也许您指的是最近 Java 规范中发布的 Java 语法。你的意思是Java 7吗？

我不确定您是否可以为 C# 指定一种特定的语法，至少不能为 Microsoft 指定一种语法，尤其是针对 C# 4.0；我不相信他们已经出版了语法。

我可以告诉你，我不认为 C# 可以是 LALR(x)，因为它有一些看起来像标识符的元素，但在某些上下文中可以是关键字。这要求词法分析器知道解析器期望什么来决定类似标识符的标记是关键字还是标识符。因此，必须有从解析器到词法分析器的反馈，或者词法分析器必须生成两个标记并将它们传递给解析器来决定它想要哪个。 LALR 解析器在令牌流上定义，没有任何反馈，并且每个输入令牌只有一种解释。

我也不认为 Java 是从 Java 1.5 开始的，当时 enum 是作为一种带有自己的关键字的特殊类型引入的。这是因为，为了让 Java 1.5 编译器处理使用 enum 作为变量名的现有 Java 1.4 程序，enum 在某些上下文中必须被视为关键字，并被视为别人的变量名。因此，Java 1.5 解析器具有与 C# 相同的问题。

作为一个实际问题，没有真正的语言是 LALR(1) [第一版 Java 可能是一个例外]，任何构建真正的解析器（尤其是 LALR）的人都必须进行某种破解来解决这个问题。（GCC 长期以来一直使用 LALR 解析器和糟糕的符号表 hack 来解析 C++，因此它可以区分作为变量的标识符和作为 typedef 实例的标识符之间的区别。现在它有某种手动实现的递归下降解析器，但我认为可怕的黑客仍然存在）。所以我不确定回答你的问题的价值。

我们的我们语言前端系列的 C# 4.0 和 Java 7 成员都使用 GLR 解析语言解析器，扩展了反馈功能和处理同一标记的两种解释的能力。 GLR 使 LALR(x) 的问题变得毫无意义，反馈和多种解释也让我们能够处理许多超出纯 GLR 能力的语言。

编辑：经过一番思考，可能有一种真正丑陋的方法可以让两种语法都处理上下文中的关键字。我们以Java的enum为例。实际上必须有语法规则：

  type = 'enum' '{'  enum_members '}' ;

但我们还需要允许“enum”作为标识符。我们可以通过替换终端令牌来做到这一点
带有非终结符的标识符：

  identifier = IDENTIFIER | 'enum' ;

并坚持认为标识符是词法分析器生成的终结符。现在至少词法分析器不必决定如何处理enum；解析器会这样做。但是你指定的语法必须是这样的，才有机会成为 LALR(x)。

我们的解析器过去这样做是为了允许某些关键字有时用作标识符。我们按照前面的描述更改了解析引擎，并且不再这样做。

You can't ask this question without first designating a specific grammar for a langauge, as some grammars may be, and some may not.

Perhaps you mean the Java grammar as published in recent Java specifications. Do you mean for Java 7?

I'm not sure you can designate a specific grammar for C#, at least not one from Microsoft, especially for C# 4.0; I don't believe they have published a grammar.

I can tell you that i don't think C# can be LALR(x), because it has some elements which look like identifiers, but can be keywords in certain contexts. This requires the lexer to know what the parser is expecting to decide if an identifier-like token is a keyword, or just and identifier. Thus there has to be feedback from the parser to lexer, or the lexer has to produce both tokens and pass them to the parser to decide which it wants. LALR parsers are defined on token streams without any feedback, and where every input token has only one interpretation.

I don't think Java is, either, from Java 1.5 and up, when enum was introduced as a special type with its own keyword. This is because, for Java 1.5 compilers to process existing Java 1.4 programs that used enum as a variable name, enum must be treated as a keyword in some contexts, and as a variable name in others. So a Java 1.5 parser has the same issues as C# does.

As a practical matter, no real langauges are LALR(1) [first edition Java may be an exception] and anybody building a real parser (esp LALR) has to make some kind of hack to get around this. (GCC famously parsed C++ with an LALR parser with an awful symbol table hack for a long time, so it could tell the difference between an identifier as a variable, and an identifier as a typedef instance. It now has some kind of hand-implemented recursive descent parser, but I think the awful hack remains). So I'm not sure the value of answer to your question.

Our C# 4.0 and Java 7 members of our family of language front ends both parse the languages using a GLR parser, extended both with the feedback capability, and the ability to process two interpretations of the same token. GLR makes the question of LALR(x) moot, and the feedback and multiple interpretations let us handle many languages that would be outside of pure GLR's capability, too.

EDIT: After a bit of thought, there might be a truly ugly way to make both grammars handle their keyword-in-context. Let's use Java's enum as an example. There realistically has to be grammar rule:

  type = 'enum' '{'  enum_members '}' ;

But we also need to allow 'enum' as an identifer. We can do that, by replacing the terminal token
identifier with a nonterminal:

  identifier = IDENTIFIER | 'enum' ;

and insist that IDENTIFIERs are the terminals produced by the lexer. Now at least the lexer does not have to decide how to treat enum; the parser does. But your designated grammar would have to shaped like this in order to even have a chance of being LALR(x).

Our parsers used to do this to allow some keywords to be used sometimes as identifiers. We changed our parsing engine as described earlier, and don't do this any more.

回复收藏 0 原文