ANTLR 空白问题(不是典型问题)
考虑这个简短的 SmallC 程序:
#include "lib"
main() {
int bob;
}
如果我在 ANTLWorks 中以及使用解释器时指定行结尾 -> ,我的 ANTLR 语法就可以很好地识别它。 “麦克(CR)”。如果我将行结束选项设置为 Unix (LF),则语法会抛出 NoViableAltException,并且在包含语句结束后不会识别任何内容。如果我在包含末尾添加换行符,此错误就会消失。我使用的计算机是 Mac,因此我认为将行结尾设置为 Mac 格式是有意义的。因此,我转而使用 Linux 机器 - 并得到了同样的结果。如果我在 ANTLRWorks 解释器框中键入任何内容,并且不选择行结尾 Mac (CR),则会遇到有关空白行不足的问题,如上面的情况,此外,每个语句块的最后一个语句需要一个分号后面的额外空格(即 bob 之后;上面)。
当我在要解析的代码输入文件上运行 Java 版本的语法时,这些错误再次出现...
可能是什么问题?我会理解问题是否存在太多新行,其格式可能是解析器不理解/没有被我的空白规则捕获。但在这种情况下,这是一个缺乏新线路的问题。
我的空白声明如下:
WS : ( '\t' | ' ' | '\r' | '\n' )+ { $channel = HIDDEN; } ;
或者,这可能是由于歧义问题造成的吗?
这是完整的语法文件(请随意忽略前几个块,它们会覆盖 ANTLR 的默认错误处理机制:
grammar SmallC;
options {
output = AST ; // Set output mode to AST
}
tokens {
DIV = '/' ;
MINUS = '-' ;
MOD = '%' ;
MULT = '*' ;
PLUS = '+' ;
RETURN = 'return' ;
WHILE = 'while' ;
// The following are empty tokens used in AST generation
ARGS ;
CHAR ;
DECLS ;
ELSE ;
EXPR ;
IF ;
INT ;
INCLUDES ;
MAIN ;
PROCEDURES ;
PROGRAM ;
RETURNTYPE ;
STMTS ;
TYPEIDENT ;
}
@members {
// Force error throwing, and make sure we don't try to recover from invalid input.
// The exceptions are handled in the FrontEnd class, and gracefully end the
// compilation routine after displaying an error message.
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MismatchedTokenException(ttype, input);
}
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow)throws RecognitionException {
throw e;
}
protected Object recoverFromMismatchedToken(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MissingTokenException(ttype, input, null);
}
// We override getErrorMessage() to include information about the specific
// grammar rule in which the error happened, using a stack of nested rules.
Stack paraphrases = new Stack();
public String getErrorMessage(RecognitionException e, String[] tokenNames) {
String msg = super.getErrorMessage(e, tokenNames);
if ( paraphrases.size()>0 ) {
String paraphrase = (String)paraphrases.peek();
msg = msg+" "+paraphrase;
}
return msg;
}
// We override displayRecognitionError() to specify a clearer error message,
// and to include the error type (ie. class of the exception that was thrown)
// for the user's reference. The idea here is to come as close as possible
// to Java's exception output.
public void displayRecognitionError(String[] tokenNames, RecognitionException e)
{
String exType;
String hdr;
if (e instanceof UnwantedTokenException) {
exType = "UnwantedTokenException";
} else if (e instanceof MissingTokenException) {
exType = "MissingTokenException";
} else if (e instanceof MismatchedTokenException) {
exType = "MismatchedTokenException";
} else if (e instanceof MismatchedTreeNodeException) {
exType = "MismatchedTreeNodeException";
} else if (e instanceof NoViableAltException) {
exType = "NoViableAltException";
} else if (e instanceof EarlyExitException) {
exType = "EarlyExitException";
} else if (e instanceof MismatchedSetException) {
exType = "MismatchedSetException";
} else if (e instanceof MismatchedNotSetException) {
exType = "MismatchedNotSetException";
} else if (e instanceof FailedPredicateException) {
exType = "FailedPredicateException";
} else {
exType = "Unknown";
}
if ( getSourceName()!=null ) {
hdr = "Exception of type " + exType + " encountered in " + getSourceName() + " at line " + e.line + ", char " + e.charPositionInLine + ": ";
} else {
hdr = "Exception of type " + exType + " encountered at line " + e.line + ", char " + e.charPositionInLine + ": ";
}
String msg = getErrorMessage(e, tokenNames);
emitErrorMessage(hdr + msg + ".");
}
}
// Force the parser not to try to guess tokens and resume on faulty input,
// but rather display the error, and throw an exception for the program
// to quit gracefully.
@rulecatch {
catch (RecognitionException e) {
reportError(e);
throw e;
}
}
/*------------------------------------------------------------------
* PARSER RULES
*
* Many of these make use of ANTLR's rewrite rules to allow us to
* specify the roots of AST sub-trees, and to allow us to do away
* with certain insignificant literals (like parantheses and commas
* in lists) and to add empty tokens to disambiguate the tree
* construction
*
* The @init and @after definitions populate the paraphrase
* stack to allow us to specify which grammar rule we are in when
* errors are found.
*------------------------------------------------------------------*/
args
@init { paraphrases.push("in these procedure arguments"); }
@after { paraphrases.pop(); }
: ( typeident ( ',' typeident )* )? -> ^( ARGS ( typeident ( typeident )* )? )? ;
body
@init { paraphrases.push("in this procedure body"); }
@after { paraphrases.pop(); }
: '{'! decls stmtlist '}'! ;
decls
@init { paraphrases.push("in these declarations"); }
@after { paraphrases.pop(); }
: ( typeident ';' )* -> ^( DECLS ( typeident )* )? ;
exp
@init { paraphrases.push("in this expression"); }
@after { paraphrases.pop(); }
: lexp ( ( '>' | '<' | '>=' | '<=' | '!=' | '==' )^ lexp )? ;
factor : '(' lexp ')'
| ( MINUS )? ( IDENT | NUMBER )
| CHARACTER
| IDENT '(' ( IDENT ( ',' IDENT )* )? ')' ;
lexp : term ( ( PLUS | MINUS )^ term )* ;
includes
@init { paraphrases.push("in the include statements"); }
@after { paraphrases.pop(); }
: ( '#include' STRING )* -> ^( INCLUDES ( STRING )* )? ;
main
@init { paraphrases.push("in the main method"); }
@after { paraphrases.pop(); }
: 'main' '(' ')' body -> ^( MAIN body ) ;
procedure
@init { paraphrases.push("in this procedure"); }
@after { paraphrases.pop(); }
: ( proc_return_char | proc_return_int )? IDENT^ '('! args ')'! body ;
procedures : ( procedure )* -> ^( PROCEDURES ( procedure)* )? ;
proc_return_char
: 'char' -> ^( RETURNTYPE CHAR ) ;
proc_return_int : 'int' -> ^( RETURNTYPE INT ) ;
// We hard-code the regex (\n)* to fix a bug whereby a program would be accepted
// if it had 0 or more than 1 new lines before EOF but not if it had exactly 1,
// and not if it had 0 new lines between components of the following rule.
program : includes decls procedures main EOF ;
stmt
@init { paraphrases.push("in this statement"); }
@after { paraphrases.pop(); }
: '{'! stmtlist '}'!
| WHILE '(' exp ')' s=stmt -> ^( WHILE ^( EXPR exp ) $s )
| 'if' '(' exp ')' s=stmt ( options {greedy=true;} : 'else' s2=stmt )? -> ^( IF ^( EXPR exp ) $s ^( ELSE $s2 )? )
| IDENT '='^ lexp ';'!
| ( 'read' | 'output' | 'readc' | 'outputc' )^ '('! IDENT ')'! ';'!
| 'print'^ '('! STRING ( options {greedy=true;} : ')'! ';'! )
| RETURN ( lexp )? ';' -> ^( RETURN ( lexp )? )
| IDENT^ '('! ( IDENT ( ','! IDENT )* )? ')'! ';'!;
stmtlist : ( stmt )* -> ^( STMTS ( stmt )* )? ;
term : factor ( ( MULT | DIV | MOD )^ factor )* ;
// We divide typeident into two grammar rules depending on whether the
// ident is of type 'char' or 'int', to allow us to implement different
// rewrite rules in each case.
typeident : typeident_char | typeident_int ;
typeident_char : 'char' s2=IDENT -> ^( CHAR $s2 ) ;
typeident_int : 'int' s2=IDENT -> ^( INT $s2 ) ;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
// Must come before CHARACTER to avoid ambiguity ('i' matches both IDENT and CHARACTER)
IDENT : ( LCASE_ALPHA | UCASE_ALPHA | '_' ) ( LCASE_ALPHA | UCASE_ALPHA | DIGIT | '_' )* ;
CHARACTER : PRINTABLE_CHAR
| '\n' | '\t' | EOF ;
NUMBER : ( DIGIT )+ ;
STRING : '\"' ( ~( '"' | '\n' | '\r' | 't' ) )* '\"' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ { $channel = HIDDEN; } ;
fragment
DIGIT : '0'..'9' ;
fragment
LCASE_ALPHA : 'a'..'z' ;
fragment
NONALPHA_CHAR : '`' | '~' | '!' | '@' | '#' | '$' | '%' | '^' | '&' | '*' | '(' | ')' | '-'
| '_' | '+' | '=' | '{' | '[' | '}' | ']' | '|' | '\\' | ';' | ':' | '\''
| '\\"' | '<' | ',' | '>' | '.' | '?' | '/' ;
fragment
PRINTABLE_CHAR : LCASE_ALPHA | UCASE_ALPHA | DIGIT | NONALPHA_CHAR ;
fragment
UCASE_ALPHA : 'A'..'Z' ;
Consider this short SmallC program:
#include "lib"
main() {
int bob;
}
My ANTLR grammar picks it up fine if I specify, in ANTLWorks and when using the Interpreter, line endings -> "Mac (CR)". If I set the line endings option to Unix (LF), the grammar throws a NoViableAltException and does not recognize anything after the end of the include statement. This error disappears if I add a newline at the end of include. The computer I'm using for this is a Mac, so I figured that it made sense to have to set the line endings to Mac format. So instead, I switch to a Linux box - and get the same thing. If I type anything in the ANTLRWorks Interpreter box, and if I don't select line endings Mac (CR), I get issues about insufficient blank lines as was the case above and, in addition, the last statement of each statement block requires an extra space following the semicolon (ie. after bob; above).
These bugs show up again when I run a Java version of my grammar on a code input file that I want to parse...
What could possibly be the issue? I'd understand if the issue was the presence of TOO many new lines, in a format that perhaps the parser didn't understand / weren't caught by my whitespace rule. But in this case, it's an issue of lacking new lines.
My white space declaration is as follows:
WS : ( '\t' | ' ' | '\r' | '\n' )+ { $channel = HIDDEN; } ;
Alternatively, could this be due to an ambiguity issue?
Here is the full grammar file (feel free to ignore the first few blocks, which override ANTLR's default error handling mechanisms:
grammar SmallC;
options {
output = AST ; // Set output mode to AST
}
tokens {
DIV = '/' ;
MINUS = '-' ;
MOD = '%' ;
MULT = '*' ;
PLUS = '+' ;
RETURN = 'return' ;
WHILE = 'while' ;
// The following are empty tokens used in AST generation
ARGS ;
CHAR ;
DECLS ;
ELSE ;
EXPR ;
IF ;
INT ;
INCLUDES ;
MAIN ;
PROCEDURES ;
PROGRAM ;
RETURNTYPE ;
STMTS ;
TYPEIDENT ;
}
@members {
// Force error throwing, and make sure we don't try to recover from invalid input.
// The exceptions are handled in the FrontEnd class, and gracefully end the
// compilation routine after displaying an error message.
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MismatchedTokenException(ttype, input);
}
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow)throws RecognitionException {
throw e;
}
protected Object recoverFromMismatchedToken(IntStream input, int ttype, BitSet follow) throws RecognitionException {
throw new MissingTokenException(ttype, input, null);
}
// We override getErrorMessage() to include information about the specific
// grammar rule in which the error happened, using a stack of nested rules.
Stack paraphrases = new Stack();
public String getErrorMessage(RecognitionException e, String[] tokenNames) {
String msg = super.getErrorMessage(e, tokenNames);
if ( paraphrases.size()>0 ) {
String paraphrase = (String)paraphrases.peek();
msg = msg+" "+paraphrase;
}
return msg;
}
// We override displayRecognitionError() to specify a clearer error message,
// and to include the error type (ie. class of the exception that was thrown)
// for the user's reference. The idea here is to come as close as possible
// to Java's exception output.
public void displayRecognitionError(String[] tokenNames, RecognitionException e)
{
String exType;
String hdr;
if (e instanceof UnwantedTokenException) {
exType = "UnwantedTokenException";
} else if (e instanceof MissingTokenException) {
exType = "MissingTokenException";
} else if (e instanceof MismatchedTokenException) {
exType = "MismatchedTokenException";
} else if (e instanceof MismatchedTreeNodeException) {
exType = "MismatchedTreeNodeException";
} else if (e instanceof NoViableAltException) {
exType = "NoViableAltException";
} else if (e instanceof EarlyExitException) {
exType = "EarlyExitException";
} else if (e instanceof MismatchedSetException) {
exType = "MismatchedSetException";
} else if (e instanceof MismatchedNotSetException) {
exType = "MismatchedNotSetException";
} else if (e instanceof FailedPredicateException) {
exType = "FailedPredicateException";
} else {
exType = "Unknown";
}
if ( getSourceName()!=null ) {
hdr = "Exception of type " + exType + " encountered in " + getSourceName() + " at line " + e.line + ", char " + e.charPositionInLine + ": ";
} else {
hdr = "Exception of type " + exType + " encountered at line " + e.line + ", char " + e.charPositionInLine + ": ";
}
String msg = getErrorMessage(e, tokenNames);
emitErrorMessage(hdr + msg + ".");
}
}
// Force the parser not to try to guess tokens and resume on faulty input,
// but rather display the error, and throw an exception for the program
// to quit gracefully.
@rulecatch {
catch (RecognitionException e) {
reportError(e);
throw e;
}
}
/*------------------------------------------------------------------
* PARSER RULES
*
* Many of these make use of ANTLR's rewrite rules to allow us to
* specify the roots of AST sub-trees, and to allow us to do away
* with certain insignificant literals (like parantheses and commas
* in lists) and to add empty tokens to disambiguate the tree
* construction
*
* The @init and @after definitions populate the paraphrase
* stack to allow us to specify which grammar rule we are in when
* errors are found.
*------------------------------------------------------------------*/
args
@init { paraphrases.push("in these procedure arguments"); }
@after { paraphrases.pop(); }
: ( typeident ( ',' typeident )* )? -> ^( ARGS ( typeident ( typeident )* )? )? ;
body
@init { paraphrases.push("in this procedure body"); }
@after { paraphrases.pop(); }
: '{'! decls stmtlist '}'! ;
decls
@init { paraphrases.push("in these declarations"); }
@after { paraphrases.pop(); }
: ( typeident ';' )* -> ^( DECLS ( typeident )* )? ;
exp
@init { paraphrases.push("in this expression"); }
@after { paraphrases.pop(); }
: lexp ( ( '>' | '<' | '>=' | '<=' | '!=' | '==' )^ lexp )? ;
factor : '(' lexp ')'
| ( MINUS )? ( IDENT | NUMBER )
| CHARACTER
| IDENT '(' ( IDENT ( ',' IDENT )* )? ')' ;
lexp : term ( ( PLUS | MINUS )^ term )* ;
includes
@init { paraphrases.push("in the include statements"); }
@after { paraphrases.pop(); }
: ( '#include' STRING )* -> ^( INCLUDES ( STRING )* )? ;
main
@init { paraphrases.push("in the main method"); }
@after { paraphrases.pop(); }
: 'main' '(' ')' body -> ^( MAIN body ) ;
procedure
@init { paraphrases.push("in this procedure"); }
@after { paraphrases.pop(); }
: ( proc_return_char | proc_return_int )? IDENT^ '('! args ')'! body ;
procedures : ( procedure )* -> ^( PROCEDURES ( procedure)* )? ;
proc_return_char
: 'char' -> ^( RETURNTYPE CHAR ) ;
proc_return_int : 'int' -> ^( RETURNTYPE INT ) ;
// We hard-code the regex (\n)* to fix a bug whereby a program would be accepted
// if it had 0 or more than 1 new lines before EOF but not if it had exactly 1,
// and not if it had 0 new lines between components of the following rule.
program : includes decls procedures main EOF ;
stmt
@init { paraphrases.push("in this statement"); }
@after { paraphrases.pop(); }
: '{'! stmtlist '}'!
| WHILE '(' exp ')' s=stmt -> ^( WHILE ^( EXPR exp ) $s )
| 'if' '(' exp ')' s=stmt ( options {greedy=true;} : 'else' s2=stmt )? -> ^( IF ^( EXPR exp ) $s ^( ELSE $s2 )? )
| IDENT '='^ lexp ';'!
| ( 'read' | 'output' | 'readc' | 'outputc' )^ '('! IDENT ')'! ';'!
| 'print'^ '('! STRING ( options {greedy=true;} : ')'! ';'! )
| RETURN ( lexp )? ';' -> ^( RETURN ( lexp )? )
| IDENT^ '('! ( IDENT ( ','! IDENT )* )? ')'! ';'!;
stmtlist : ( stmt )* -> ^( STMTS ( stmt )* )? ;
term : factor ( ( MULT | DIV | MOD )^ factor )* ;
// We divide typeident into two grammar rules depending on whether the
// ident is of type 'char' or 'int', to allow us to implement different
// rewrite rules in each case.
typeident : typeident_char | typeident_int ;
typeident_char : 'char' s2=IDENT -> ^( CHAR $s2 ) ;
typeident_int : 'int' s2=IDENT -> ^( INT $s2 ) ;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
// Must come before CHARACTER to avoid ambiguity ('i' matches both IDENT and CHARACTER)
IDENT : ( LCASE_ALPHA | UCASE_ALPHA | '_' ) ( LCASE_ALPHA | UCASE_ALPHA | DIGIT | '_' )* ;
CHARACTER : PRINTABLE_CHAR
| '\n' | '\t' | EOF ;
NUMBER : ( DIGIT )+ ;
STRING : '\"' ( ~( '"' | '\n' | '\r' | 't' ) )* '\"' ;
WS : ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+ { $channel = HIDDEN; } ;
fragment
DIGIT : '0'..'9' ;
fragment
LCASE_ALPHA : 'a'..'z' ;
fragment
NONALPHA_CHAR : '`' | '~' | '!' | '@' | '#' | '
| '%' | '^' | '&' | '*' | '(' | ')' | '-'
| '_' | '+' | '=' | '{' | '[' | '}' | ']' | '|' | '\\' | ';' | ':' | '\''
| '\\"' | '<' | ',' | '>' | '.' | '?' | '/' ;
fragment
PRINTABLE_CHAR : LCASE_ALPHA | UCASE_ALPHA | DIGIT | NONALPHA_CHAR ;
fragment
UCASE_ALPHA : 'A'..'Z' ;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
从命令行,我确实收到警告:
但这不会阻止词法分析器/解析器的生成。
无论如何,问题是:ANTLR 的词法分析器尝试匹配它在文件中遇到的第一个词法分析器规则,如果它无法匹配所述标记,则会向下滴到下一个词法分析器规则。现在,您已在
WS
规则之前定义了CHARACTER
规则,它们都与字符\n
匹配。这就是为什么它在 Linux 下不起作用,因为\n
被标记为CHARACTER
。如果您在CHARACTER
规则之前定义WS
规则,则一切正常:运行测试类:
生成以下 AST:
没有任何错误消息。
但您应该修复语法警告,并从
CHARACTER
规则中删除\n
,因为它永远无法在CHARACTER
规则中匹配。另一件事:您在解析器规则中混合了相当多的关键字,而没有在词法分析器规则中明确定义它们。由于先来先服务的词法分析器规则,这很棘手:您不希望
'if'
被意外地标记为IDENT
。最好这样做:From the command line, I do get a warning:
but that won't stop the lexer/parser from being generated.
Anyway, the problem: ANTLR's lexer tries to match the first lexer rule it encounters in the file, and if it can't match said token, it trickles down to the next lexer rule. Now you have defined the
CHARACTER
rule before theWS
rule, which both match the character\n
. That is why it didn't work under Linux since the\n
was tokenized as aCHARACTER
. If you define theWS
rule before theCHARACTER
rule, it all works properly:Running the test class:
produces the following AST:
without any error messages.
But you should fix the grammar warning, and remove
\n
from theCHARACTER
rule since it can never be matched in theCHARACTER
rule.One other thing: you've mixed quite a few keywords inside your parser rules without defining them in your lexer rules explicitly. That is tricky because of the first-come-first-serve lexer rules: you don't want
'if'
to be accidentally being tokenized as anIDENT
. Better do it like this: