ANTLR 空白问题(不是典型问题)

考虑这个简短的 SmallC 程序:

#include "lib"
main() {
    int bob;

如果我在 ANTLWorks 中以及使用解释器时指定行结尾 -> ,我的 ANTLR 语法就可以很好地识别它。 “麦克(CR)”。如果我将行结束选项设置为 Unix (LF),则语法会抛出 NoViableAltException,并且在包含语句结束后不会识别任何内容。如果我在包含末尾添加换行符,此错误就会消失。我使用的计算机是 Mac,因此我认为将行结尾设置为 Mac 格式是有意义的。因此,我转而使用 Linux 机器 - 并得到了同样的结果。如果我在 ANTLRWorks 解释器框中键入任何内容,并且不选择行结尾 Mac (CR),则会遇到有关空白行不足的问题,如上面的情况,此外,每个语句块的最后一个语句需要一个分号后面的额外空格(即 bob 之后;上面)。

当我在要解析的代码输入文件上运行 Java 版本的语法时,这些错误再次出现...



WS      :   ( '\t' | ' ' | '\r' | '\n' )+   { $channel = HIDDEN; } ;


这是完整的语法文件(请随意忽略前几个块,它们会覆盖 ANTLR 的默认错误处理机制:

grammar SmallC;

options {
    output = AST ;  // Set output mode to AST

tokens {
    DIV = '/' ;
    MINUS   = '-' ;
    MOD = '%' ;
    MULT    = '*' ;
    PLUS    = '+' ;
    RETURN  = 'return' ;
    WHILE   = 'while' ;

    // The following are empty tokens used in AST generation
    ARGS ;
    CHAR ;
    DECLS ;
    ELSE ;
    EXPR ;
    IF ;
    INT ;
    MAIN ;
    STMTS ;

@members { 
// Force error throwing, and make sure we don't try to recover from invalid input.
// The exceptions are handled in the FrontEnd class, and gracefully end the
// compilation routine after displaying an error message.
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException {
    throw new MismatchedTokenException(ttype, input);
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow)throws RecognitionException {
    throw e;
protected Object recoverFromMismatchedToken(IntStream input, int ttype, BitSet follow) throws RecognitionException {
     throw new MissingTokenException(ttype, input, null);

// We override getErrorMessage() to include information about the specific
// grammar rule in which the error happened, using a stack of nested rules.
Stack paraphrases = new Stack();
public String getErrorMessage(RecognitionException e, String[] tokenNames) {
    String msg = super.getErrorMessage(e, tokenNames);
    if ( paraphrases.size()>0 ) {
        String paraphrase = (String)paraphrases.peek();
        msg = msg+" "+paraphrase;
    return msg;

// We override displayRecognitionError() to specify a clearer error message,
// and to include the error type (ie. class of the exception that was thrown)
// for the user's reference. The idea here is to come as close as possible
// to Java's exception output.
public void displayRecognitionError(String[] tokenNames, RecognitionException e)
    String exType;
    String hdr;
    if (e instanceof UnwantedTokenException) {
        exType = "UnwantedTokenException";
    } else if (e instanceof MissingTokenException) {
        exType = "MissingTokenException";
    } else if (e instanceof MismatchedTokenException) {
        exType = "MismatchedTokenException";
    } else if (e instanceof MismatchedTreeNodeException) {
        exType = "MismatchedTreeNodeException";
    } else if (e instanceof NoViableAltException) {
        exType = "NoViableAltException";
    } else if (e instanceof EarlyExitException) {
        exType = "EarlyExitException";
    } else if (e instanceof MismatchedSetException) {
        exType = "MismatchedSetException";
    } else if (e instanceof MismatchedNotSetException) {
        exType = "MismatchedNotSetException";
    } else if (e instanceof FailedPredicateException) {
        exType = "FailedPredicateException";
    } else {
        exType = "Unknown";

    if ( getSourceName()!=null ) {
        hdr = "Exception of type " + exType + " encountered in " + getSourceName() + " at line " + e.line + ", char " + e.charPositionInLine + ": "; 
    } else {
        hdr = "Exception of type " + exType + " encountered at line " + e.line + ", char " + e.charPositionInLine + ": "; 
    String msg = getErrorMessage(e, tokenNames);
    emitErrorMessage(hdr + msg + ".");

// Force the parser not to try to guess tokens and resume on faulty input,
// but rather display the error, and throw an exception for the program
// to quit gracefully.
@rulecatch {
catch (RecognitionException e) {
    throw e;

 * Many of these make use of ANTLR's rewrite rules to allow us to
 * specify the roots of AST sub-trees, and to allow us to do away
 * with certain insignificant literals (like parantheses and commas
 * in lists) and to add empty tokens to disambiguate the tree 
 * construction
 * The @init and @after definitions populate the paraphrase
 * stack to allow us to specify which grammar rule we are in when
 * errors are found.

@init { paraphrases.push("in these procedure arguments"); }
@after { paraphrases.pop(); }
        :   ( typeident ( ',' typeident )* )?   ->  ^( ARGS ( typeident ( typeident )* )? )? ;

@init { paraphrases.push("in this procedure body"); }
@after { paraphrases.pop(); }
        :   '{'! decls stmtlist '}'! ;

@init { paraphrases.push("in these declarations"); }
@after { paraphrases.pop(); }
        :   ( typeident ';' )*  ->  ^( DECLS ( typeident )* )? ;

@init { paraphrases.push("in this expression"); }
@after { paraphrases.pop(); }
        :   lexp ( ( '>' | '<' | '>=' | '<=' | '!=' | '==' )^ lexp )? ;

factor      :   '(' lexp ')'
        |   ( MINUS )? ( IDENT | NUMBER ) 
        |   CHARACTER
        |   IDENT '(' ( IDENT ( ',' IDENT )* )? ')' ;

lexp        :   term ( ( PLUS | MINUS )^ term )* ;

@init { paraphrases.push("in the include statements"); }
@after { paraphrases.pop(); }
        :   ( '#include' STRING )*  ->  ^( INCLUDES ( STRING )* )? ;

@init { paraphrases.push("in the main method"); }
@after { paraphrases.pop(); }
        :   'main' '(' ')' body ->  ^( MAIN body ) ;

@init { paraphrases.push("in this procedure"); }
@after { paraphrases.pop(); }
        :   ( proc_return_char | proc_return_int )? IDENT^ '('! args ')'! body ;

procedures  :   ( procedure )*  ->  ^( PROCEDURES ( procedure)* )? ;

        :   'char'  ->  ^( RETURNTYPE CHAR ) ;

proc_return_int :   'int'   ->  ^( RETURNTYPE INT ) ;

// We hard-code the regex (\n)* to fix a bug whereby a program would be accepted
// if it had 0 or more than 1 new lines before EOF but not if it had exactly 1,
// and not if it had 0 new lines between components of the following rule.
program     :   includes decls procedures main EOF ;

@init { paraphrases.push("in this statement"); }
@after { paraphrases.pop(); }
        :   '{'! stmtlist '}'!
        |   WHILE '(' exp ')' s=stmt    ->  ^( WHILE ^( EXPR exp ) $s )
        |   'if' '(' exp ')' s=stmt ( options {greedy=true;} : 'else' s2=stmt )?    ->  ^( IF ^( EXPR exp ) $s ^( ELSE $s2 )? )
        |   IDENT '='^ lexp ';'! 
        |   ( 'read' | 'output' | 'readc' | 'outputc' )^ '('! IDENT ')'! ';'!
        |   'print'^ '('! STRING ( options {greedy=true;} : ')'! ';'! )
        |   RETURN ( lexp )? ';'    ->  ^( RETURN ( lexp )? ) 
        |   IDENT^ '('! ( IDENT ( ','! IDENT )* )? ')'! ';'!;

stmtlist    :   ( stmt )*   ->  ^( STMTS ( stmt )* )? ;

term        :   factor ( ( MULT | DIV | MOD )^ factor )* ;

// We divide typeident into two grammar rules depending on whether the
// ident is of type 'char' or 'int', to allow us to implement different
// rewrite rules in each case.
typeident   :   typeident_char | typeident_int ;

typeident_char  :   'char' s2=IDENT ->  ^( CHAR $s2 ) ;

typeident_int   :   'int' s2=IDENT  ->  ^( INT $s2 ) ;


// Must come before CHARACTER to avoid ambiguity ('i' matches both IDENT and CHARACTER)

        |   '\n' | '\t' | EOF ;

NUMBER      :   ( DIGIT )+ ;

STRING      :   '\"' ( ~( '"' | '\n' | '\r' | 't' ) )* '\"' ;

WS      :   ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+    { $channel = HIDDEN; } ;

DIGIT       :   '0'..'9' ;

LCASE_ALPHA :   'a'..'z' ;

NONALPHA_CHAR   :   '`' | '~' | '!' | '@' | '#' | '$' | '%' | '^' | '&' | '*' | '(' | ')' | '-'
        |   '_' | '+' | '=' | '{' | '[' | '}' | ']' | '|' | '\\' | ';' | ':' | '\''
        |   '\\"' | '<' | ',' | '>' | '.' | '?' | '/' ; 

UCASE_ALPHA :   'A'..'Z' ;

java -cp antlr-3.2.jar org.antlr.Tool SmallC.g 
warning(200): SmallC.g:182:37: Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input


无论如何,问题是:ANTLR 的词法分析器尝试匹配它在文件中遇到的第一个词法分析器规则,如果它无法匹配所述标记,则会向下滴到下一个词法分析器规则。现在,您已在 WS 规则之前定义了 CHARACTER 规则,它们都与字符 \n 匹配。这就是为什么它在 Linux 下不起作用,因为 \n 被标记为 CHARACTER。如果您在 CHARACTER 规则之前定义 WS 规则,则一切正常:

// other rules ...

  :  ('\t' | ' ' | '\r' | '\n' | '\u000C')+ { $channel = HIDDEN; } 

  :  PRINTABLE_CHAR | '\n' | '\t' | EOF 

// other rules ...


import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
    public static void main(String[] args) throws Exception {
        String source = 
                "#include \"lib\"\n" + 
                "main() {\n" + 
                "   int bob;\n" + 
        ANTLRStringStream in = new ANTLRStringStream(source);
        SmallCLexer lexer = new SmallCLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        SmallCParser parser = new SmallCParser(tokens);
        SmallCParser.program_return returnValue = parser.program();
        CommonTree tree = (CommonTree)returnValue.getTree();
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);

生成以下 AST:



但您应该修复语法警告,并从 CHARACTER 规则中删除 \n,因为它永远无法在 CHARACTER 规则中匹配。

另一件事:您在解析器规则中混合了相当多的关键字,而没有在词法分析器规则中明确定义它们。由于先来先服务的词法分析器规则,这很棘手:您不希望 'if' 被意外地标记为 IDENT。最好这样做:

IF : 'if';
IDENT : 'a'..'z' ... ; // After the `IF` rule! 

From the command line, I do get a warning:

java -cp antlr-3.2.jar org.antlr.Tool SmallC.g 
warning(200): SmallC.g:182:37: Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

but that won't stop the lexer/parser from being generated.

Anyway, the problem: ANTLR's lexer tries to match the first lexer rule it encounters in the file, and if it can't match said token, it trickles down to the next lexer rule. Now you have defined the CHARACTER rule before the WS rule, which both match the character \n. That is why it didn't work under Linux since the \n was tokenized as a CHARACTER. If you define the WS rule before the CHARACTER rule, it all works properly:

// other rules ...

  :  ('\t' | ' ' | '\r' | '\n' | '\u000C')+ { $channel = HIDDEN; } 

  :  PRINTABLE_CHAR | '\n' | '\t' | EOF 

// other rules ...

Running the test class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
    public static void main(String[] args) throws Exception {
        String source = 
                "#include \"lib\"\n" + 
                "main() {\n" + 
                "   int bob;\n" + 
        ANTLRStringStream in = new ANTLRStringStream(source);
        SmallCLexer lexer = new SmallCLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        SmallCParser parser = new SmallCParser(tokens);
        SmallCParser.program_return returnValue = parser.program();
        CommonTree tree = (CommonTree)returnValue.getTree();
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);

produces the following AST:

enter image description here

without any error messages.

But you should fix the grammar warning, and remove \n from the CHARACTER rule since it can never be matched in the CHARACTER rule.

One other thing: you've mixed quite a few keywords inside your parser rules without defining them in your lexer rules explicitly. That is tricky because of the first-come-first-serve lexer rules: you don't want 'if' to be accidentally being tokenized as an IDENT. Better do it like this:

IF : 'if';
IDENT : 'a'..'z' ... ; // After the `IF` rule! 
