ANTLR,表达式语法问题
我最近开始使用 ANTLR。我目前正在尝试使用 +
、-
、*
和 array[index]
对表达式语法进行编码还有一些构造。
这是所需的语法:
Exp -> Exp (+ | - | * | < | &&) Exp
| Exp [ Exp ]
| -Exp
| ( Exp )
| Exp.length
| true
| false
| Id
| this
| ! Exp
我首先将其重构为 AndExp
、SumExp
、ProdExp
等来解析优先级。大致是这样的:
Exp -> AndExp
AndExp -> CmpExp (&& CmpExp)*
CmpExp -> SumExp (< SumExp)*
SumExp -> ProdExp ((+|-) ProdExp)*
ProdExp -> UnaryExp (Times UnaryExp)*
UnaryExp -> Minus* PrimaryExp
PrimaryExp -> Exp.length
| Exp [ Exp ]
| ! Exp
| true
| false
| Id
| this
然后我意识到这使用了左递归,而 ANTLR 不喜欢这样。我继续消除左递归,最后我得到了这个语法:
grammar test;
options {
language=Java;
output=AST;
backtrack=true;
}
start : expression;
expression : andExp;
andExp : cmpExp (And^ cmpExp)*;
cmpExp : sumExp (LessThan^ sumExp)*;
sumExp : prodExp ((Plus | Minus)^ prodExp)*;
prodExp : unaryExp (Times^ unaryExp)*;
unaryExp : Minus* primaryExp;
primaryExp : INT primaryPrime
| 'true' primaryPrime
| 'false' primaryPrime
| 'this' primaryPrime
| ID primaryPrime
| '!' expression primaryPrime
| '('! expression ')'! primaryPrime
;
primaryPrime
: '[' expression ']' primaryPrime
| '.' ID '(' exprList ')' primaryPrime
| '.length' primaryPrime
| 'new' 'int' '[' expression ']' primaryPrime
| 'new' ID '(' ')' primaryPrime
|
;
exprList :
| expression (',' expression)*;
LessThan : '<';
Plus : '+';
Minus : '-';
Times : '*';
And : '&&';
Not : '!';
INT : '0' | ('1'..'9')('0'..'9')*;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS : ('\t' | ' ' | '\r' | '\n'| '\u000C') { $channel=HIDDEN; } ;
Q1:对于这种类型的语法来说,回溯是“必需的”(除非我激活它,否则我无法通过 ANTLR 获得它)还是我错过了一些简单的东西?
Q2:当添加几个
^
和->时^(TOKEN ...)
构造来刷新树,我遇到了以下恼人的情况(因为primaryPrime
是由于左因数分解引起的):primaryPrime : '['表达式']'primaryPrime -> ^(ARRAY_EXPR 表达式) //...
这会将
<前><代码>数组 ARRAY_EXPR 数组 指数数组[索引]
变成虽然我真的很想要
<前><代码>ARRAY_EXPR 大批 指数解决这个问题的最佳方法是什么?我走在正确的轨道上吗?还是应该采用其他方法?
I've recently started using ANTLR. I'm currently trying to encode an expression grammar with +
, -
, *
and array[index]
and a few more constructs.
This is the desired grammar:
Exp -> Exp (+ | - | * | < | &&) Exp
| Exp [ Exp ]
| -Exp
| ( Exp )
| Exp.length
| true
| false
| Id
| this
| ! Exp
I first refactored this into AndExp
, SumExp
, ProdExp
and so on to resolve precedence. Roughly like this:
Exp -> AndExp
AndExp -> CmpExp (&& CmpExp)*
CmpExp -> SumExp (< SumExp)*
SumExp -> ProdExp ((+|-) ProdExp)*
ProdExp -> UnaryExp (Times UnaryExp)*
UnaryExp -> Minus* PrimaryExp
PrimaryExp -> Exp.length
| Exp [ Exp ]
| ! Exp
| true
| false
| Id
| this
I then realized that this uses left-recursion, and that ANTLR doesn't like that. I went on to eliminate the left-recursion and I ended up with this grammar:
grammar test;
options {
language=Java;
output=AST;
backtrack=true;
}
start : expression;
expression : andExp;
andExp : cmpExp (And^ cmpExp)*;
cmpExp : sumExp (LessThan^ sumExp)*;
sumExp : prodExp ((Plus | Minus)^ prodExp)*;
prodExp : unaryExp (Times^ unaryExp)*;
unaryExp : Minus* primaryExp;
primaryExp : INT primaryPrime
| 'true' primaryPrime
| 'false' primaryPrime
| 'this' primaryPrime
| ID primaryPrime
| '!' expression primaryPrime
| '('! expression ')'! primaryPrime
;
primaryPrime
: '[' expression ']' primaryPrime
| '.' ID '(' exprList ')' primaryPrime
| '.length' primaryPrime
| 'new' 'int' '[' expression ']' primaryPrime
| 'new' ID '(' ')' primaryPrime
|
;
exprList :
| expression (',' expression)*;
LessThan : '<';
Plus : '+';
Minus : '-';
Times : '*';
And : '&&';
Not : '!';
INT : '0' | ('1'..'9')('0'..'9')*;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
WS : ('\t' | ' ' | '\r' | '\n'| '\u000C') { $channel=HIDDEN; } ;
Q1: Is backtracking "required" for this type of grammar (I can't get it through ANTLR unless I activate it) or am I missing something simple?
Q2: When adding a few
^
and-> ^(TOKEN ...)
constructs to brush up the tree, I ran into the following annoying situation (because of theprimaryPrime
which is due to the left factoring):primaryPrime : '[' expression ']' primaryPrime -> ^(ARRAY_EXPR expression) //...
This turns an
array[index]
intoarray ARRAY_EXPR index
while I really want
ARRAY_EXPR array index
What is the best way to solve this? Am I on the right track here, or should I go with some other approach all together?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
A1
不,(尚)不需要回溯。但如果您确实需要一些回溯,建议不要立即使用
backtrack=true
而是在需要回溯的规则之前使用谓词。通过使用backtrack=true
,您可以对所有规则启用回溯,而可能只有一两个规则需要回溯。但是,如果您的语言相对较小,backtrack=true
比手动混合谓词更容易,并且可能不会对性能产生太大影响。但如果你能避免它们,那就去做吧。您有几个与空字符串匹配的解析器规则,这导致了问题。通常,您最好让规则匹配某些内容,并使规则可选。所以代替:
做
代替。
另外,如果是
true
、false
等保留关键字,请勿将它们混合在解析器规则中:始终在词法分析器规则的顶部显式定义它们。 Lexer 规则从上到下开始匹配,因此(至少)在可能匹配它们的规则(例如ID
)之前定义它们是最安全的。我通常将它们作为第一个词法分析器规则。A2
您可以通过将参数传递给解析器规则来做到这一点,尽管这会使您的语法(有点)可读性较差。
您的语法与我的评论:
将把输入
"array[2*3]"
解析为以下 AST:正如您通过运行以下测试类可以看到的:
A1
No, backtracking is not (yet) required. But if you do need some backtracking, it's advisable to not use
backtrack=true
right away but use predicate before the rules that do need backtracking. By usingbacktrack=true
, you're enabling backtracking on all of your rules, while it's probably only one or two needing backtracking. But, if your language will be relatively small,backtrack=true
is easier than mixing in predicates by hand, and probably won't have a big impact on performance. But if you can avoid them, do so.You have a couple of parser rules that match empty strings, which are causing the problems. You'd usually better let rules match something, and make the rule optional. So instead of:
do
instead.
Also, in case of reserved keywords like
true
,false
etc., don't mix them in your parser rules: always explicitly define them at the top of your lexer rules. Lexer rules are matched starting from top to bottom, so it safest to define them (at least) before rules likeID
that could possible match them as well. I usually put them as first lexer rules.A2
You could do that by passing parameters to your parser rules, although that makes your grammar (a bit) less readable.
Your grammar with my comments:
will parse the input
"array[2*3]"
into the following AST:as you can see by running the following test class: