ANTLR 实现类似Python的缩进相关语法的最简单方法是什么？

发布于 2024-12-23 00:39:10 字数 376 浏览 4 评论 0原文

我正在尝试实现类似Python的缩进相关语法。

源代码示例：

ABC QWE
  CDE EFG
  EFG CDE
    ABC 
  QWE ZXC

正如我所见，我需要的是实现两个标记 INDENT 和 DEDENT，所以我可以编写如下内容：

grammar mygrammar;
text: (ID | block)+;
block: INDENT (ID|block)+ DEDENT;
INDENT: ????;
DEDENT: ????;

有没有简单的方法可以使用 ANTLR 来实现这一点？

（如果可能的话，我更愿意使用标准 ANTLR 词法分析器。）

原文

I am trying realize python like indent-depending grammar.

Source example:

ABC QWE
  CDE EFG
  EFG CDE
    ABC 
  QWE ZXC

As i see, what i need is to realize two tokens INDENT and DEDENT, so i could write something like:

grammar mygrammar;
text: (ID | block)+;
block: INDENT (ID|block)+ DEDENT;
INDENT: ????;
DEDENT: ????;

Is there any simple way to realize this using ANTLR?

(I'd prefer, if it's possible, to use standard ANTLR lexer.)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜访吸血鬼 2024-12-30 00:39:10

我不知道最简单的处理方法是什么，但以下是一个相对简单的方法。每当您在词法分析器中匹配换行符时，可以选择匹配一个或多个空格。如果换行符后面有空格，则将这些空格的长度与当前的缩进大小进行比较。如果它大于当前的缩进大小，则发出一个 Indent 标记，如果它小于当前的缩进大小，则发出一个 Dedent 标记，如果相同，则不发出什么都不做。

您还需要在文件末尾发出多个 Dedent 标记，以使每个 Indent 都有一个匹配的 Dedent 标记。

为了使其正常工作，您必须在输入源文件中添加前导和尾随换行符！

ANTRL3

一个快速演示：

grammar PyEsque;

options {
  output=AST;
}

tokens {
  BLOCK;
}

@lexer::members {

  private int previousIndents = -1;
  private int indentLevel = 0;
  java.util.Queue<Token> tokens = new java.util.LinkedList<Token>();

  @Override
  public void emit(Token t) {
    state.token = t;
    tokens.offer(t);
  }

  @Override
  public Token nextToken() {
    super.nextToken();
    return tokens.isEmpty() ? Token.EOF_TOKEN : tokens.poll();
  }

  private void jump(int ttype) {
    indentLevel += (ttype == Dedent ? -1 : 1);
    emit(new CommonToken(ttype, "level=" + indentLevel));
  }
}

parse
 : block EOF -> block
 ;

block
 : Indent block_atoms Dedent -> ^(BLOCK block_atoms)
 ;

block_atoms
 :  (Id | block)+
 ;

NewLine
 : NL SP?
   {
     int n = $SP.text == null ? 0 : $SP.text.length();
     if(n > previousIndents) {
       jump(Indent);
       previousIndents = n;
     }
     else if(n < previousIndents) {
       jump(Dedent);
       previousIndents = n;
     }
     else if(input.LA(1) == EOF) {
       while(indentLevel > 0) {
         jump(Dedent);
       }
     }
     else {
       skip();
     }
   }
 ;

Id
 : ('a'..'z' | 'A'..'Z')+
 ;

SpaceChars
 : SP {skip();}
 ;

fragment NL     : '\r'? '\n' | '\r';
fragment SP     : (' ' | '\t')+;
fragment Indent : ;
fragment Dedent : ;

您可以使用以下类测试解析器：

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    PyEsqueLexer lexer = new PyEsqueLexer(new ANTLRFileStream("in.txt"));
    PyEsqueParser parser = new PyEsqueParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

如果您现在将以下内容放入名为 in.txt 的文件中：

AAA AAAAA
  BBB BB B
  BB BBBBB BB
    CCCCCC C CC
  BB BBBBBB
    C CCC
      DDD DD D
      DDD D DDD

（请注意前导和尾随换行符！）

然后您将看到与以下 AST 相对应的输出：

在此处输入图像描述

请注意，我的演示不会连续产生足够的凹痕，比如从 ccc 到 aaa 的缩进（需要 2 个缩进标记）：

aaa
  bbb
    ccc
aaa

您需要调整 else if(n < previousIndents) { .. . } 可能会根据 n 和 previousIndents 之间的差异发出超过 1 个 dedent 标记。在我的脑海中，这可能看起来像这样：

 else if(n < previousIndents) {
   // Note: assuming indent-size is 2. Jumping from previousIndents=6 
   // to n=2 will result in emitting 2 `Dedent` tokens
   int numDedents = (previousIndents - n) / 2; 
   while(numDedents-- > 0) {
     jump(Dedent);
   }
   previousIndents = n;
 }

ANTLR4

对于 ANTLR4，执行如下操作：

grammar Python3;

tokens { INDENT, DEDENT }

@lexer::members {
  // A queue where extra tokens are pushed on (see the NEWLINE lexer rule).
  private java.util.LinkedList<Token> tokens = new java.util.LinkedList<>();
  // The stack that keeps track of the indentation level.
  private java.util.Stack<Integer> indents = new java.util.Stack<>();
  // The amount of opened braces, brackets and parenthesis.
  private int opened = 0;
  // The most recently produced token.
  private Token lastToken = null;
  @Override
  public void emit(Token t) {
    super.setToken(t);
    tokens.offer(t);
  }

  @Override
  public Token nextToken() {
    // Check if the end-of-file is ahead and there are still some DEDENTS expected.
    if (_input.LA(1) == EOF && !this.indents.isEmpty()) {
      // Remove any trailing EOF tokens from our buffer.
      for (int i = tokens.size() - 1; i >= 0; i--) {
        if (tokens.get(i).getType() == EOF) {
          tokens.remove(i);
        }
      }

      // First emit an extra line break that serves as the end of the statement.
      this.emit(commonToken(Python3Parser.NEWLINE, "\n"));

      // Now emit as much DEDENT tokens as needed.
      while (!indents.isEmpty()) {
        this.emit(createDedent());
        indents.pop();
      }

      // Put the EOF back on the token stream.
      this.emit(commonToken(Python3Parser.EOF, "<EOF>"));
    }

    Token next = super.nextToken();

    if (next.getChannel() == Token.DEFAULT_CHANNEL) {
      // Keep track of the last token on the default channel.
      this.lastToken = next;
    }

    return tokens.isEmpty() ? next : tokens.poll();
  }

  private Token createDedent() {
    CommonToken dedent = commonToken(Python3Parser.DEDENT, "");
    dedent.setLine(this.lastToken.getLine());
    return dedent;
  }

  private CommonToken commonToken(int type, String text) {
    int stop = this.getCharIndex() - 1;
    int start = text.isEmpty() ? stop : stop - text.length() + 1;
    return new CommonToken(this._tokenFactorySourcePair, type, DEFAULT_TOKEN_CHANNEL, start, stop);
  }

  // Calculates the indentation of the provided spaces, taking the
  // following rules into account:
  //
  // "Tabs are replaced (from left to right) by one to eight spaces
  //  such that the total number of characters up to and including
  //  the replacement is a multiple of eight [...]"
  //
  //  -- https://docs.python.org/3.1/reference/lexical_analysis.html#indentation
  static int getIndentationCount(String spaces) {
    int count = 0;
    for (char ch : spaces.toCharArray()) {
      switch (ch) {
        case '\t':
          count += 8 - (count % 8);
          break;
        default:
          // A normal space char.
          count++;
      }
    }

    return count;
  }

  boolean atStartOfInput() {
    return super.getCharPositionInLine() == 0 && super.getLine() == 1;
  }
}

single_input
 : NEWLINE
 | simple_stmt
 | compound_stmt NEWLINE
 ;

// more parser rules

NEWLINE
 : ( {atStartOfInput()}?   SPACES
   | ( '\r'? '\n' | '\r' ) SPACES?
   )
   {
     String newLine = getText().replaceAll("[^\r\n]+", "");
     String spaces = getText().replaceAll("[\r\n]+", "");
     int next = _input.LA(1);
     if (opened > 0 || next == '\r' || next == '\n' || next == '#') {
       // If we're inside a list or on a blank line, ignore all indents, 
       // dedents and line breaks.
       skip();
     }
     else {
       emit(commonToken(NEWLINE, newLine));
       int indent = getIndentationCount(spaces);
       int previous = indents.isEmpty() ? 0 : indents.peek();
       if (indent == previous) {
         // skip indents of the same size as the present indent-size
         skip();
       }
       else if (indent > previous) {
         indents.push(indent);
         emit(commonToken(Python3Parser.INDENT, spaces));
       }
       else {
         // Possibly emit more than 1 DEDENT token.
         while(!indents.isEmpty() && indents.peek() > indent) {
           this.emit(createDedent());
           indents.pop();
         }
       }
     }
   }
 ;

// more lexer rules

取自： https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4

I don't know what the easiest way to handle it is, but the following is a relatively easy way. Whenever you match a line break in your lexer, optionally match one or more spaces. If there are spaces after the line break, compare the length of these spaces with the current indent-size. If it's more than the current indent size, emit an Indent token, if it's less than the current indent-size, emit a Dedent token and if it's the same, don't do anything.

You'll also want to emit a number of Dedent tokens at the end of the file to let every Indent have a matching Dedent token.

For this to work properly, you must add a leading and trailing line break to your input source file!

ANTRL3

A quick demo:

grammar PyEsque;

options {
  output=AST;
}

tokens {
  BLOCK;
}

@lexer::members {

  private int previousIndents = -1;
  private int indentLevel = 0;
  java.util.Queue<Token> tokens = new java.util.LinkedList<Token>();

  @Override
  public void emit(Token t) {
    state.token = t;
    tokens.offer(t);
  }

  @Override
  public Token nextToken() {
    super.nextToken();
    return tokens.isEmpty() ? Token.EOF_TOKEN : tokens.poll();
  }

  private void jump(int ttype) {
    indentLevel += (ttype == Dedent ? -1 : 1);
    emit(new CommonToken(ttype, "level=" + indentLevel));
  }
}

parse
 : block EOF -> block
 ;

block
 : Indent block_atoms Dedent -> ^(BLOCK block_atoms)
 ;

block_atoms
 :  (Id | block)+
 ;

NewLine
 : NL SP?
   {
     int n = $SP.text == null ? 0 : $SP.text.length();
     if(n > previousIndents) {
       jump(Indent);
       previousIndents = n;
     }
     else if(n < previousIndents) {
       jump(Dedent);
       previousIndents = n;
     }
     else if(input.LA(1) == EOF) {
       while(indentLevel > 0) {
         jump(Dedent);
       }
     }
     else {
       skip();
     }
   }
 ;

Id
 : ('a'..'z' | 'A'..'Z')+
 ;

SpaceChars
 : SP {skip();}
 ;

fragment NL     : '\r'? '\n' | '\r';
fragment SP     : (' ' | '\t')+;
fragment Indent : ;
fragment Dedent : ;

You can test the parser with the class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    PyEsqueLexer lexer = new PyEsqueLexer(new ANTLRFileStream("in.txt"));
    PyEsqueParser parser = new PyEsqueParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

If you now put the following in a file called in.txt:

AAA AAAAA
  BBB BB B
  BB BBBBB BB
    CCCCCC C CC
  BB BBBBBB
    C CCC
      DDD DD D
      DDD D DDD

(Note the leading and trailing line breaks!)

then you'll see output that corresponds to the following AST:

enter image description here

Note that my demo wouldn't produce enough dedents in succession, like dedenting from ccc to aaa (2 dedent tokens are needed):

aaa
  bbb
    ccc
aaa

You would need to adjust the code inside else if(n < previousIndents) { ... } to possibly emit more than 1 dedent token based on the difference between n and previousIndents. Off the top of my head, that could look like this:

 else if(n < previousIndents) {
   // Note: assuming indent-size is 2. Jumping from previousIndents=6 
   // to n=2 will result in emitting 2 `Dedent` tokens
   int numDedents = (previousIndents - n) / 2; 
   while(numDedents-- > 0) {
     jump(Dedent);
   }
   previousIndents = n;
 }

ANTLR4

For ANTLR4, do something like this:

grammar Python3;

tokens { INDENT, DEDENT }

@lexer::members {
  // A queue where extra tokens are pushed on (see the NEWLINE lexer rule).
  private java.util.LinkedList<Token> tokens = new java.util.LinkedList<>();
  // The stack that keeps track of the indentation level.
  private java.util.Stack<Integer> indents = new java.util.Stack<>();
  // The amount of opened braces, brackets and parenthesis.
  private int opened = 0;
  // The most recently produced token.
  private Token lastToken = null;
  @Override
  public void emit(Token t) {
    super.setToken(t);
    tokens.offer(t);
  }

  @Override
  public Token nextToken() {
    // Check if the end-of-file is ahead and there are still some DEDENTS expected.
    if (_input.LA(1) == EOF && !this.indents.isEmpty()) {
      // Remove any trailing EOF tokens from our buffer.
      for (int i = tokens.size() - 1; i >= 0; i--) {
        if (tokens.get(i).getType() == EOF) {
          tokens.remove(i);
        }
      }

      // First emit an extra line break that serves as the end of the statement.
      this.emit(commonToken(Python3Parser.NEWLINE, "\n"));

      // Now emit as much DEDENT tokens as needed.
      while (!indents.isEmpty()) {
        this.emit(createDedent());
        indents.pop();
      }

      // Put the EOF back on the token stream.
      this.emit(commonToken(Python3Parser.EOF, "<EOF>"));
    }

    Token next = super.nextToken();

    if (next.getChannel() == Token.DEFAULT_CHANNEL) {
      // Keep track of the last token on the default channel.
      this.lastToken = next;
    }

    return tokens.isEmpty() ? next : tokens.poll();
  }

  private Token createDedent() {
    CommonToken dedent = commonToken(Python3Parser.DEDENT, "");
    dedent.setLine(this.lastToken.getLine());
    return dedent;
  }

  private CommonToken commonToken(int type, String text) {
    int stop = this.getCharIndex() - 1;
    int start = text.isEmpty() ? stop : stop - text.length() + 1;
    return new CommonToken(this._tokenFactorySourcePair, type, DEFAULT_TOKEN_CHANNEL, start, stop);
  }

  // Calculates the indentation of the provided spaces, taking the
  // following rules into account:
  //
  // "Tabs are replaced (from left to right) by one to eight spaces
  //  such that the total number of characters up to and including
  //  the replacement is a multiple of eight [...]"
  //
  //  -- https://docs.python.org/3.1/reference/lexical_analysis.html#indentation
  static int getIndentationCount(String spaces) {
    int count = 0;
    for (char ch : spaces.toCharArray()) {
      switch (ch) {
        case '\t':
          count += 8 - (count % 8);
          break;
        default:
          // A normal space char.
          count++;
      }
    }

    return count;
  }

  boolean atStartOfInput() {
    return super.getCharPositionInLine() == 0 && super.getLine() == 1;
  }
}

single_input
 : NEWLINE
 | simple_stmt
 | compound_stmt NEWLINE
 ;

// more parser rules

NEWLINE
 : ( {atStartOfInput()}?   SPACES
   | ( '\r'? '\n' | '\r' ) SPACES?
   )
   {
     String newLine = getText().replaceAll("[^\r\n]+", "");
     String spaces = getText().replaceAll("[\r\n]+", "");
     int next = _input.LA(1);
     if (opened > 0 || next == '\r' || next == '\n' || next == '#') {
       // If we're inside a list or on a blank line, ignore all indents, 
       // dedents and line breaks.
       skip();
     }
     else {
       emit(commonToken(NEWLINE, newLine));
       int indent = getIndentationCount(spaces);
       int previous = indents.isEmpty() ? 0 : indents.peek();
       if (indent == previous) {
         // skip indents of the same size as the present indent-size
         skip();
       }
       else if (indent > previous) {
         indents.push(indent);
         emit(commonToken(Python3Parser.INDENT, spaces));
       }
       else {
         // Possibly emit more than 1 DEDENT token.
         while(!indents.isEmpty() && indents.peek() > indent) {
           this.emit(createDedent());
           indents.pop();
         }
       }
     }
   }
 ;

// more lexer rules

Taken from: https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4

回复收藏 0 原文

傻比既视感 2024-12-30 00:39:10

ANTLR v4 有一个开源库 antlr-denter这有助于为您解析缩进和缩进。查看其 README 了解如何使用它。

由于它是一个库，而不是复制并粘贴到语法中的代码片段，因此它的缩进处理可以与语法的其余部分分开更新。

回复收藏 0 原文

我们的影子 2024-12-30 00:39:10

有一个相对简单的方法来执行此 ANTLR，我将其作为实验编写： DentLexer.g4。该解决方案与本页提到的由 Kiers 和 Shavit 编写的其他解决方案不同。它仅通过重写 Lexer 的 nextToken() 方法与运行时集成。它通过检查标记来完成工作：(1) NEWLINE 标记触发“跟踪缩进”阶段的开始； (2) 空白和注释，都设置为通道 HIDDEN，在该阶段分别被计数和忽略； (3) 任何非HIDDEN令牌都会结束该阶段。因此，控制缩进逻辑只是设置令牌通道的简单问题。

本页提到的两种解决方案都需要 NEWLINE 标记来获取所有后续空格，但这样做无法处理中断该空格的多行注释。相反，Dent 将 NEWLINE 和空白标记分开，并且可以处理多行注释。

你的语法将像下面这样设置。请注意，NEWLINE 和 WS 词法分析器规则具有控制 pendingDent 状态并使用 indentCount 变量跟踪缩进级别的操作。

grammar MyGrammar;

tokens { INDENT, DEDENT }

@lexer::members {
    // override of nextToken(), see Dent.g4 grammar on github
    // https://github.com/wevrem/wry/blob/master/grammars/Dent.g4
}

script : ( NEWLINE | statement )* EOF ;

statement
    :   simpleStatement
    |   blockStatements
    ;

simpleStatement : LEGIT+ NEWLINE ;

blockStatements : LEGIT+ NEWLINE INDENT statement+ DEDENT ;

NEWLINE : ( '\r'? '\n' | '\r' ) {
    if (pendingDent) { setChannel(HIDDEN); }
    pendingDent = true;
    indentCount = 0;
    initialIndentToken = null;
} ;

WS : [ \t]+ {
    setChannel(HIDDEN);
    if (pendingDent) { indentCount += getText().length(); }
} ;

BlockComment : '/*' ( BlockComment | . )*? '*/' -> channel(HIDDEN) ;   // allow nesting comments
LineComment : '//' ~[\r\n]* -> channel(HIDDEN) ;

LEGIT : ~[ \t\r\n]+ ~[\r\n]*;   // Replace with your language-specific rules...

There is a relatively simple way to do this ANTLR, which I wrote as an experiment: DentLexer.g4. This solution is different from the others mentioned on this page that were written by Kiers and Shavit. It integrates with the runtime solely via an override of the Lexer's nextToken() method. It does its work by examining tokens: (1) a NEWLINE token triggers the start of a "keep track of indentation" phase; (2) whitespace and comments, both set to channel HIDDEN, are counted and ignored, respectively, during that phase; and, (3) any non-HIDDEN token ends the phase. Thus controlling the indentation logic is a simple matter of setting a token's channel.

Both of the solutions mentioned on this page require a NEWLINE token to also grab all the subsequent whitespace, but in doing so can't handle multi-line comments interrupting that whitespace. Dent, instead, keeps NEWLINE and whitespace tokens separate and can handle multi-line comments.

Your grammar would be set up something like below. Note that the NEWLINE and WS lexer rules have actions that control the pendingDent state and keep track of indentation level with the indentCount variable.

grammar MyGrammar;

tokens { INDENT, DEDENT }

@lexer::members {
    // override of nextToken(), see Dent.g4 grammar on github
    // https://github.com/wevrem/wry/blob/master/grammars/Dent.g4
}

script : ( NEWLINE | statement )* EOF ;

statement
    :   simpleStatement
    |   blockStatements
    ;

simpleStatement : LEGIT+ NEWLINE ;

blockStatements : LEGIT+ NEWLINE INDENT statement+ DEDENT ;

NEWLINE : ( '\r'? '\n' | '\r' ) {
    if (pendingDent) { setChannel(HIDDEN); }
    pendingDent = true;
    indentCount = 0;
    initialIndentToken = null;
} ;

WS : [ \t]+ {
    setChannel(HIDDEN);
    if (pendingDent) { indentCount += getText().length(); }
} ;

BlockComment : '/*' ( BlockComment | . )*? '*/' -> channel(HIDDEN) ;   // allow nesting comments
LineComment : '//' ~[\r\n]* -> channel(HIDDEN) ;

LEGIT : ~[ \t\r\n]+ ~[\r\n]*;   // Replace with your language-specific rules...

回复收藏 0 原文

烛影斜 2024-12-30 00:39:10

你看过Python ANTLR语法吗？

编辑：添加了用于创建 INDENT/DEDENT 标记的伪 Python 代码

UNKNOWN_TOKEN = 0
INDENT_TOKEN = 1
DEDENT_TOKEN = 2

# filestream has already been processed so that each character is a newline and
# every tab outside of quotations is converted to 8 spaces.
def GetIndentationTokens(filestream):
    # Stores (indentation_token, line, character_index)
    indentation_record = list()
    line = 0
    character_index = 0
    column = 0
    counting_whitespace = true
    indentations = list()
    for c in filestream:
        if IsNewLine(c):
            character_index = 0
            column = 0
            line += 1
            counting_whitespace = true
        elif c != ' ' and counting_whitespace:
            counting_whitespace = false
            if(len(indentations) == 0):
                indentation_record.append((token, line, character_index))
            else:
                while(len(indentations) > 0 and indentations[-1] != column:
                    if(column < indentations[-1]):
                        indentations.pop()
                        indentation_record.append((
                            DEDENT, line, character_index))
                    elif(column > indentations[-1]):
                        indentations.append(column)
                        indentation_record.append((
                            INDENT, line, character_index))

        if not IsNewLine(c):
            column += 1

        character_index += 1
    while(len(indentations) > 0):
        indentations.pop()
        indentation_record.append((DEDENT_TOKEN, line, character_index))
    return indentation_record

Have you looked at the Python ANTLR grammar?

Edit: Added psuedo Python code for creating INDENT/DEDENT tokens

UNKNOWN_TOKEN = 0
INDENT_TOKEN = 1
DEDENT_TOKEN = 2

# filestream has already been processed so that each character is a newline and
# every tab outside of quotations is converted to 8 spaces.
def GetIndentationTokens(filestream):
    # Stores (indentation_token, line, character_index)
    indentation_record = list()
    line = 0
    character_index = 0
    column = 0
    counting_whitespace = true
    indentations = list()
    for c in filestream:
        if IsNewLine(c):
            character_index = 0
            column = 0
            line += 1
            counting_whitespace = true
        elif c != ' ' and counting_whitespace:
            counting_whitespace = false
            if(len(indentations) == 0):
                indentation_record.append((token, line, character_index))
            else:
                while(len(indentations) > 0 and indentations[-1] != column:
                    if(column < indentations[-1]):
                        indentations.pop()
                        indentation_record.append((
                            DEDENT, line, character_index))
                    elif(column > indentations[-1]):
                        indentations.append(column)
                        indentation_record.append((
                            INDENT, line, character_index))

        if not IsNewLine(c):
            column += 1

        character_index += 1
    while(len(indentations) > 0):
        indentations.pop()
        indentation_record.append((DEDENT_TOKEN, line, character_index))
    return indentation_record

回复收藏 0 原文

~没有更多了~