ANTLR，表达式语法问题

发布于 2024-10-21 10:36:52 字数 2949 浏览 0 评论 0原文

我最近开始使用 ANTLR。我目前正在尝试使用 +、-、* 和 array[index] 对表达式语法进行编码还有一些构造。

这是所需的语法：

Exp -> Exp (+ | - | * | < | &&) Exp
     | Exp [ Exp ]
     | -Exp
     | ( Exp )
     | Exp.length
     | true
     | false
     | Id
     | this
     | ! Exp

我首先将其重构为 AndExp、SumExp、ProdExp 等来解析优先级。大致是这样的：

Exp        -> AndExp
AndExp     -> CmpExp (&& CmpExp)*
CmpExp     -> SumExp (< SumExp)*
SumExp     -> ProdExp ((+|-) ProdExp)*
ProdExp    -> UnaryExp (Times UnaryExp)*
UnaryExp   -> Minus* PrimaryExp
PrimaryExp -> Exp.length
            | Exp [ Exp ]
            | ! Exp
            | true
            | false
            | Id
            | this

然后我意识到这使用了左递归，而 ANTLR 不喜欢这样。我继续消除左递归，最后我得到了这个语法：

grammar test;

options {
    language=Java;
    output=AST;
    backtrack=true;
}

start      : expression;

expression : andExp;
andExp     : cmpExp (And^ cmpExp)*;
cmpExp     : sumExp (LessThan^ sumExp)*;
sumExp     : prodExp ((Plus | Minus)^ prodExp)*;
prodExp    : unaryExp (Times^ unaryExp)*;
unaryExp   : Minus* primaryExp;

primaryExp : INT                   primaryPrime
           | 'true'                primaryPrime
           | 'false'               primaryPrime
           | 'this'                primaryPrime
           | ID                    primaryPrime
           | '!' expression        primaryPrime
           | '('! expression ')'!  primaryPrime
           ;


primaryPrime
           : '[' expression ']'             primaryPrime
           | '.' ID '(' exprList ')'        primaryPrime
           | '.length'                      primaryPrime
           | 'new' 'int' '[' expression ']' primaryPrime
           | 'new' ID '(' ')'               primaryPrime
           |
           ;


exprList   :
           | expression (',' expression)*;

LessThan   : '<';
Plus       : '+';
Minus      : '-';
Times      : '*';
And        : '&&';
Not        : '!';
INT        :    '0' | ('1'..'9')('0'..'9')*;
ID         :    ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS         : ('\t' | ' ' | '\r' | '\n'| '\u000C') { $channel=HIDDEN; } ;

Q1：对于这种类型的语法来说，回溯是“必需的”（除非我激活它，否则我无法通过 ANTLR 获得它）还是我错过了一些简单的东西？
Q2：当添加几个^和->时^(TOKEN ...) 构造来刷新树，我遇到了以下恼人的情况（因为 primaryPrime 是由于左因数分解引起的）：
```
primaryPrime
    : '['表达式']'primaryPrime -> ^(ARRAY_EXPR 表达式)
//...
```
这会将数组[索引]变成
<前><代码>数组 ARRAY_EXPR 数组指数
虽然我真的很想要
<前><代码>ARRAY_EXPR 大批指数
解决这个问题的最佳方法是什么？我走在正确的轨道上吗？还是应该采用其他方法？

原文

I've recently started using ANTLR. I'm currently trying to encode an expression grammar with +, -, * and array[index] and a few more constructs.

This is the desired grammar:

Exp -> Exp (+ | - | * | < | &&) Exp
     | Exp [ Exp ]
     | -Exp
     | ( Exp )
     | Exp.length
     | true
     | false
     | Id
     | this
     | ! Exp

I first refactored this into AndExp, SumExp, ProdExp and so on to resolve precedence. Roughly like this:

Exp        -> AndExp
AndExp     -> CmpExp (&& CmpExp)*
CmpExp     -> SumExp (< SumExp)*
SumExp     -> ProdExp ((+|-) ProdExp)*
ProdExp    -> UnaryExp (Times UnaryExp)*
UnaryExp   -> Minus* PrimaryExp
PrimaryExp -> Exp.length
            | Exp [ Exp ]
            | ! Exp
            | true
            | false
            | Id
            | this

I then realized that this uses left-recursion, and that ANTLR doesn't like that. I went on to eliminate the left-recursion and I ended up with this grammar:

grammar test;

options {
    language=Java;
    output=AST;
    backtrack=true;
}

start      : expression;

expression : andExp;
andExp     : cmpExp (And^ cmpExp)*;
cmpExp     : sumExp (LessThan^ sumExp)*;
sumExp     : prodExp ((Plus | Minus)^ prodExp)*;
prodExp    : unaryExp (Times^ unaryExp)*;
unaryExp   : Minus* primaryExp;

primaryExp : INT                   primaryPrime
           | 'true'                primaryPrime
           | 'false'               primaryPrime
           | 'this'                primaryPrime
           | ID                    primaryPrime
           | '!' expression        primaryPrime
           | '('! expression ')'!  primaryPrime
           ;


primaryPrime
           : '[' expression ']'             primaryPrime
           | '.' ID '(' exprList ')'        primaryPrime
           | '.length'                      primaryPrime
           | 'new' 'int' '[' expression ']' primaryPrime
           | 'new' ID '(' ')'               primaryPrime
           |
           ;


exprList   :
           | expression (',' expression)*;

LessThan   : '<';
Plus       : '+';
Minus      : '-';
Times      : '*';
And        : '&&';
Not        : '!';
INT        :    '0' | ('1'..'9')('0'..'9')*;
ID         :    ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS         : ('\t' | ' ' | '\r' | '\n'| '\u000C') { $channel=HIDDEN; } ;

Q1: Is backtracking "required" for this type of grammar (I can't get it through ANTLR unless I activate it) or am I missing something simple?
Q2: When adding a few ^ and -> ^(TOKEN ...) constructs to brush up the tree, I ran into the following annoying situation (because of the primaryPrime which is due to the left factoring):
```
primaryPrime
    : '[' expression ']' primaryPrime -> ^(ARRAY_EXPR expression)
//...
```
This turns an array[index] into
```
array
ARRAY_EXPR
    index
```
while I really want
```
ARRAY_EXPR
    array
    index
```
What is the best way to solve this? Am I on the right track here, or should I go with some other approach all together?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

安静 2024-10-28 10:36:52

A1

不，（尚）不需要回溯。但如果您确实需要一些回溯，建议不要立即使用 backtrack=true 而是在需要回溯的规则之前使用谓词。通过使用 backtrack=true，您可以对所有规则启用回溯，而可能只有一两个规则需要回溯。但是，如果您的语言相对较小，backtrack=true 比手动混合谓词更容易，并且可能不会对性能产生太大影响。但如果你能避免它们，那就去做吧。

您有几个与空字符串匹配的解析器规则，这导致了问题。通常，您最好让规则匹配某些内容，并使规则可选。所以代替：

foo : bar baz ;
bar : 'bar' ;
baz : 'baz' | /* epsilon */ ;

做

foo : bar baz? ;
bar : 'bar' ;
baz : 'baz' ;

代替。

另外，如果是 true、false 等保留关键字，请勿将它们混合在解析器规则中：始终在词法分析器规则的顶部显式定义它们。 Lexer 规则从上到下开始匹配，因此（至少）在可能匹配它们的规则（例如 ID）之前定义它们是最安全的。我通常将它们作为第一个词法分析器规则。

A2

您可以通过将参数传递给解析器规则来做到这一点，尽管这会使您的语法（有点）可读性较差。

您的语法与我的评论：

grammar test;

options {
  output=AST;
}

tokens {
  ARRAY_EXPR;
}

start      : expression;

expression : andExp;
andExp     : cmpExp (And^ cmpExp)*;
cmpExp     : sumExp (LessThan^ sumExp)*;
sumExp     : prodExp ((Plus | Minus)^ prodExp)*;
prodExp    : unaryExp (Times^ unaryExp)*;
unaryExp   :  '-' primaryExp
           |  '!' primaryExp // negation is really a `unaryExp`
           |  primaryExp
           ;

primaryExp : INT                  primaryPrime[null]?
           | 'true'               primaryPrime[null]?
           | 'false'              primaryPrime[null]?
           | 'this'               primaryPrime[null]?
           | (ID -> ID)           (primaryPrime[new CommonTree($ID)] -> primaryPrime)?
           | '('! expression ')'! primaryPrime[null]?
           ;

// removed the matching of 'epsilon'
primaryPrime [CommonTree parent]
           : '[' expression ']'             primaryPrime[null]? -> ^(ARRAY_EXPR {parent} expression primaryPrime?)
           | '.' ID '(' exprList? ')'       primaryPrime[null]?
           | '.length'                      primaryPrime[null]?
           | 'new' 'int' '[' expression ']' primaryPrime[null]?
           | 'new' ID '(' ')'               primaryPrime[null]?
           ;

// removed the matching of 'epsilon' 
exprList   : expression (',' expression)*;

// be sure to create explicit tokens for keywords!
True       : 'true';
False      : 'false';
This       : 'this';
LessThan   : '<';
Plus       : '+';
Minus      : '-';
Times      : '*';
And        : '&&';
Not        : '!';
INT        : '0' | ('1'..'9')('0'..'9')*;
ID         : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS         : ('\t' | ' ' | '\r' | '\n'| '\u000C') { $channel=HIDDEN; } ;

将把输入 "array[2*3]" 解析为以下 AST：

在此处输入图像描述

正如您通过运行以下测试类可以看到的：

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String source = "array[2*3]";
    testLexer lexer = new testLexer(new ANTLRStringStream(source));
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    testParser parser = new testParser(tokens);
    testParser.start_return returnValue = parser.start();
    CommonTree tree = (CommonTree)returnValue.getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

A1

No, backtracking is not (yet) required. But if you do need some backtracking, it's advisable to not use backtrack=true right away but use predicate before the rules that do need backtracking. By using backtrack=true, you're enabling backtracking on all of your rules, while it's probably only one or two needing backtracking. But, if your language will be relatively small, backtrack=true is easier than mixing in predicates by hand, and probably won't have a big impact on performance. But if you can avoid them, do so.

You have a couple of parser rules that match empty strings, which are causing the problems. You'd usually better let rules match something, and make the rule optional. So instead of:

foo : bar baz ;
bar : 'bar' ;
baz : 'baz' | /* epsilon */ ;

foo : bar baz? ;
bar : 'bar' ;
baz : 'baz' ;

instead.

Also, in case of reserved keywords like true, false etc., don't mix them in your parser rules: always explicitly define them at the top of your lexer rules. Lexer rules are matched starting from top to bottom, so it safest to define them (at least) before rules like ID that could possible match them as well. I usually put them as first lexer rules.

A2

You could do that by passing parameters to your parser rules, although that makes your grammar (a bit) less readable.

Your grammar with my comments:

grammar test;

options {
  output=AST;
}

tokens {
  ARRAY_EXPR;
}

start      : expression;

expression : andExp;
andExp     : cmpExp (And^ cmpExp)*;
cmpExp     : sumExp (LessThan^ sumExp)*;
sumExp     : prodExp ((Plus | Minus)^ prodExp)*;
prodExp    : unaryExp (Times^ unaryExp)*;
unaryExp   :  '-' primaryExp
           |  '!' primaryExp // negation is really a `unaryExp`
           |  primaryExp
           ;

primaryExp : INT                  primaryPrime[null]?
           | 'true'               primaryPrime[null]?
           | 'false'              primaryPrime[null]?
           | 'this'               primaryPrime[null]?
           | (ID -> ID)           (primaryPrime[new CommonTree($ID)] -> primaryPrime)?
           | '('! expression ')'! primaryPrime[null]?
           ;

// removed the matching of 'epsilon'
primaryPrime [CommonTree parent]
           : '[' expression ']'             primaryPrime[null]? -> ^(ARRAY_EXPR {parent} expression primaryPrime?)
           | '.' ID '(' exprList? ')'       primaryPrime[null]?
           | '.length'                      primaryPrime[null]?
           | 'new' 'int' '[' expression ']' primaryPrime[null]?
           | 'new' ID '(' ')'               primaryPrime[null]?
           ;

// removed the matching of 'epsilon' 
exprList   : expression (',' expression)*;

// be sure to create explicit tokens for keywords!
True       : 'true';
False      : 'false';
This       : 'this';
LessThan   : '<';
Plus       : '+';
Minus      : '-';
Times      : '*';
And        : '&&';
Not        : '!';
INT        : '0' | ('1'..'9')('0'..'9')*;
ID         : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS         : ('\t' | ' ' | '\r' | '\n'| '\u000C') { $channel=HIDDEN; } ;

will parse the input "array[2*3]" into the following AST:

enter image description here

as you can see by running the following test class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String source = "array[2*3]";
    testLexer lexer = new testLexer(new ANTLRStringStream(source));
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    testParser parser = new testParser(tokens);
    testParser.start_return returnValue = parser.start();
    CommonTree tree = (CommonTree)returnValue.getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

回复收藏 0 原文

~没有更多了~