互动式安特尔

发布于 2024-10-19 03:19:17 字数 580 浏览 0 评论 0原文

我正在尝试使用antlr编写一种简单的交互式（使用System.in作为源）语言，但我遇到了一些问题。我在网上找到的例子都是使用每行循环，例如：

while(readline)
  result = parse(line)
  doStuff(result)

但是如果我正在编写类似 pascal/smtp/etc 的东西，并且“第一行”看起来像 X 需求，该怎么办？我知道它可以在 doStuff 中检查，但我认为从逻辑上讲它是语法的一部分。

或者如果一个命令被分成多行怎么办？我可以尝试

while(readline)
  lines.add(line)
  try
    result = parse(lines)
    lines = []
    doStuff(result)
  catch
    nop

但是这样我也隐藏了真正的错误。

或者我可以每次都重新解析所有行，但是：

它会很慢
，有一些指令我不想运行两次

这可以用 ANTLR 来完成吗？如果不能，可以用其他东西来完成吗？

原文

I'm trying to write a simple interactive (using System.in as source) language using antlr, and I have a few problems with it. The examples I've found on the web are all using a per line cycle, e.g.:

while(readline)
  result = parse(line)
  doStuff(result)

But what if I'm writing something like pascal/smtp/etc, with a "first line" looks like X requirment? I know it can be checked in doStuff, but I think logically it is part of the syntax.

Or what if a command is split into multiple lines? I can try

while(readline)
  lines.add(line)
  try
    result = parse(lines)
    lines = []
    doStuff(result)
  catch
    nop

But with this I'm also hiding real errors.

Or I could reparse all lines everytime, but:

it will be slow
there are instructions I don't want to run twice

Can this be done with ANTLR, or if not, with something else?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伪装你 2024-10-26 03:19:17

Dutow 写道：
或者我可以每次都重新解析所有行，但是：
会很慢
有说明我不想运行两次
这可以用 ANTLR 来完成吗？如果不能的话，可以用其他东西吗？

是的，ANTLR 可以做到这一点。也许不是开箱即用的，但通过一些自定义代码，这肯定是可能的。您也不需要为其重新解析整个令牌流。

假设您想要逐行解析一种非常简单的语言，其中每一行要么是一个 program 声明，要么是一个 uses 声明，要么是一个 statement.

它应该始终以 program 声明开头，后跟零个或多个 uses 声明，然后是零个或多个 statement。 uses 声明不能出现在 statement 之后，并且不能有多个 program 声明。

为了简单起见，语句只是一个简单的赋值：a = 4或b = a。

这种语言的 ANTLR 语法可能如下所示：

grammar REPL;

parse
  :  programDeclaration EOF
  |  usesDeclaration EOF
  |  statement EOF
  ;

programDeclaration
  :  PROGRAM ID
  ;

usesDeclaration
  :  USES idList
  ;

statement
  :  ID '=' (INT | ID)
  ;

idList
  :  ID (',' ID)*
  ;

PROGRAM : 'program';
USES    : 'uses';
ID      : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
INT     : '0'..'9'+;
SPACE   : (' ' | '\t' | '\r' | '\n') {skip();};

但是，我们当然需要添加一些检查。另外，默认情况下，解析器在其构造函数中接受令牌流，但由于我们计划在解析器中逐行滴入令牌，因此我们需要在解析器中创建一个新的构造函数。您可以通过将自定义成员放入 @parser::members { ... } 或 @lexer::members { ... } 中来在词法分析器或解析器类中添加自定义成员> 部分分别。我们还将添加几个布尔标志来跟踪 program 声明是否已经发生以及是否允许 uses 声明。最后，我们将添加一个 process(String source) 方法，该方法为每个新行创建一个词法分析器，并将其提供给解析器。

所有这些看起来像：

@parser::members {

  boolean programDeclDone;
  boolean usesDeclAllowed;

  public REPLParser() {
    super(null);
    programDeclDone = false;
    usesDeclAllowed = true;
  }

  public void process(String source) throws Exception {
    ANTLRStringStream in = new ANTLRStringStream(source);
    REPLLexer lexer = new REPLLexer(in);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    super.setTokenStream(tokens);
    this.parse(); // the entry point of our parser
  } 
}

现在在我们的语法中，我们将检查几个 门控语义谓词（如果我们以正确的顺序解析声明）。在解析某个声明或语句后，我们需要翻转某些布尔标志以允许或禁止声明。这些布尔标志的翻转是通过每个规则的 @after { ... } 部分完成的，该部分在匹配来自该解析器规则的标记之后执行（毫不奇怪）。

您的最终语法文件现在如下所示（包括一些用于调试目的的 System.out.println）：

grammar REPL;

@parser::members {

  boolean programDeclDone;
  boolean usesDeclAllowed;

  public REPLParser() {
    super(null);
    programDeclDone = false;
    usesDeclAllowed = true;
  }

  public void process(String source) throws Exception {
    ANTLRStringStream in = new ANTLRStringStream(source);
    REPLLexer lexer = new REPLLexer(in);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    super.setTokenStream(tokens);
    this.parse();
  } 
}

parse
  :  programDeclaration EOF
  |  {programDeclDone}? (usesDeclaration | statement) EOF
  ;

programDeclaration
@after{
  programDeclDone = true;
}
  :  {!programDeclDone}? PROGRAM ID {System.out.println("\t\t\t program <- " + $ID.text);}
  ;

usesDeclaration
  :  {usesDeclAllowed}? USES idList {System.out.println("\t\t\t uses <- " + $idList.text);}
  ;

statement
@after{
  usesDeclAllowed = false; 
}
  :  left=ID '=' right=(INT | ID) {System.out.println("\t\t\t " + $left.text + " <- " + $right.text);}
  ;

idList
  :  ID (',' ID)*
  ;

PROGRAM : 'program';
USES    : 'uses';
ID      : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
INT     : '0'..'9'+;
SPACE   : (' ' | '\t' | '\r' | '\n') {skip();};

可以使用以下类进行测试：

import org.antlr.runtime.*;
import java.util.Scanner;

public class Main {
    public static void main(String[] args) throws Exception {
        Scanner keyboard = new Scanner(System.in);
        REPLParser parser = new REPLParser();
        while(true) {
            System.out.print("\n> ");
            String input = keyboard.nextLine();
            if(input.equals("quit")) {
                break;
            }
            parser.process(input);
        }
        System.out.println("\nBye!");
    }
}

操作：

# generate a lexer and parser:
java -cp antlr-3.2.jar org.antlr.Tool REPL.g

# compile all .java source files:
javac -cp antlr-3.2.jar *.java

# run the main class on Windows:
java -cp .;antlr-3.2.jar Main 
# or on Linux/Mac:
java -cp .:antlr-3.2.jar Main

要运行此测试类，请执行以下你可以看到，你只能声明一个program一次：

> program A
                         program <- A

> program B
line 1:0 rule programDeclaration failed predicate: {!programDeclDone}?

uses不能出现在statements之后：

> program X
                         program <- X

> uses a,b,c
                         uses <- a,b,c

> a = 666
                         a <- 666

> uses d,e
line 1:0 rule usesDeclaration failed predicate: {usesDeclAllowed}?

并且你必须从一个program开始声明：

> uses foo
line 1:0 rule parse failed predicate: {programDeclDone}?

Dutow wrote:
Or I could reparse all lines everytime, but:
it will be slow
there are instructions I don't want to run twice
Can this be done with ANTLR, or if not, with something else?

Yes, ANTLR can do this. Perhaps not out of the box, but with a bit of custom code, it sure is possible. You also don't need to re-parse the entire token stream for it.

Let's say you want to parse a very simple language line by line that where each line is either a program declaration, or a uses declaration, or a statement.

It should always start with a program declaration, followed by zero or more uses declarations followed by zero or more statements. uses declarations cannot come after statements and there can't be more than one program declaration.

For simplicity, a statement is just a simple assignment: a = 4 or b = a.

An ANTLR grammar for such a language could look like this:

grammar REPL;

parse
  :  programDeclaration EOF
  |  usesDeclaration EOF
  |  statement EOF
  ;

programDeclaration
  :  PROGRAM ID
  ;

usesDeclaration
  :  USES idList
  ;

statement
  :  ID '=' (INT | ID)
  ;

idList
  :  ID (',' ID)*
  ;

PROGRAM : 'program';
USES    : 'uses';
ID      : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
INT     : '0'..'9'+;
SPACE   : (' ' | '\t' | '\r' | '\n') {skip();};

But, we'll need to add a couple of checks of course. Also, by default, a parser takes a token stream in its constructor, but since we're planning to trickle tokens in the parser line-by-line, we'll need to create a new constructor in our parser. You can add custom members in your lexer or parser classes by putting them in a @parser::members { ... } or @lexer::members { ... } section respectively. We'll also add a couple of boolean flags to keep track whether the program declaration has happened already and if uses declarations are allowed. Finally, we'll add a process(String source) method which, for each new line, creates a lexer which gets fed to the parser.

All of that would look like:

@parser::members {

  boolean programDeclDone;
  boolean usesDeclAllowed;

  public REPLParser() {
    super(null);
    programDeclDone = false;
    usesDeclAllowed = true;
  }

  public void process(String source) throws Exception {
    ANTLRStringStream in = new ANTLRStringStream(source);
    REPLLexer lexer = new REPLLexer(in);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    super.setTokenStream(tokens);
    this.parse(); // the entry point of our parser
  } 
}

Now inside our grammar, we're going to check through a couple of gated semantic predicates if we're parsing declarations in the correct order. And after parsing a certain declaration, or statement, we'll want to flip certain boolean flags to allow- or disallow declaration from then on. The flipping of these boolean flags is done through each rule's @after { ... } section that gets executed (not surprisingly) after the tokens from that parser rule are matched.

Your final grammar file now looks like this (including some System.out.println's for debugging purposes):

grammar REPL;

@parser::members {

  boolean programDeclDone;
  boolean usesDeclAllowed;

  public REPLParser() {
    super(null);
    programDeclDone = false;
    usesDeclAllowed = true;
  }

  public void process(String source) throws Exception {
    ANTLRStringStream in = new ANTLRStringStream(source);
    REPLLexer lexer = new REPLLexer(in);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    super.setTokenStream(tokens);
    this.parse();
  } 
}

parse
  :  programDeclaration EOF
  |  {programDeclDone}? (usesDeclaration | statement) EOF
  ;

programDeclaration
@after{
  programDeclDone = true;
}
  :  {!programDeclDone}? PROGRAM ID {System.out.println("\t\t\t program <- " + $ID.text);}
  ;

usesDeclaration
  :  {usesDeclAllowed}? USES idList {System.out.println("\t\t\t uses <- " + $idList.text);}
  ;

statement
@after{
  usesDeclAllowed = false; 
}
  :  left=ID '=' right=(INT | ID) {System.out.println("\t\t\t " + $left.text + " <- " + $right.text);}
  ;

idList
  :  ID (',' ID)*
  ;

PROGRAM : 'program';
USES    : 'uses';
ID      : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
INT     : '0'..'9'+;
SPACE   : (' ' | '\t' | '\r' | '\n') {skip();};

which can be tested wit the following class:

import org.antlr.runtime.*;
import java.util.Scanner;

public class Main {
    public static void main(String[] args) throws Exception {
        Scanner keyboard = new Scanner(System.in);
        REPLParser parser = new REPLParser();
        while(true) {
            System.out.print("\n> ");
            String input = keyboard.nextLine();
            if(input.equals("quit")) {
                break;
            }
            parser.process(input);
        }
        System.out.println("\nBye!");
    }
}

To run this test class, do the following:

# generate a lexer and parser:
java -cp antlr-3.2.jar org.antlr.Tool REPL.g

# compile all .java source files:
javac -cp antlr-3.2.jar *.java

# run the main class on Windows:
java -cp .;antlr-3.2.jar Main 
# or on Linux/Mac:
java -cp .:antlr-3.2.jar Main

As you can see, you can only declare a program once:

> program A
                         program <- A

> program B
line 1:0 rule programDeclaration failed predicate: {!programDeclDone}?

uses cannot come after statements:

> program X
                         program <- X

> uses a,b,c
                         uses <- a,b,c

> a = 666
                         a <- 666

> uses d,e
line 1:0 rule usesDeclaration failed predicate: {usesDeclAllowed}?

and you must start with a program declaration:

> uses foo
line 1:0 rule parse failed predicate: {programDeclDone}?

回复收藏 0 原文

装纯掩盖桑 2024-10-26 03:19:17

下面是一个示例，说明如何解析来自 System.in 的输入，而无需首先手动逐行解析它，并且不会在语法上做出重大妥协。我正在使用 ANTLR 3.4。 ANTLR 4 可能已经解决了这个问题。不过，我仍在使用 ANTLR 3，也许其他人也遇到这个问题。

在讨论解决方案之前，我遇到了一些障碍，使这个看似微不足道的问题不容易解决：

派生自 CharStream 的内置 ANTLR 类消耗整个数据流。正面。显然交互模式（或任何其他不确定长度的流源）无法提供所有数据。
内置 BufferedTokenStream 和派生类不会以跳过或脱离通道的令牌结束。在交互式设置中，这意味着当前语句无法结束（因此无法执行），直到使用这些类之一时消耗了下一个语句的第一个标记或 EOF。
语句本身的结束可能是不确定的，直到下一个语句开始。

考虑一个简单的例子：

statement: 'verb' 'noun' ('and' 'noun')*
         ;
WS: //etc...

交互式地解析单个语句（并且仅单个语句）是不可能的。要么必须开始下一个语句（即，在输入中点击“verb”），要么必须修改语法以标记语句的结束，例如使用' ;'。

我还没有找到一种方法来使用我的解决方案管理多通道词法分析器。这不会对我造成伤害，因为我可以用 skip() 替换我的 $channel = HIDDEN，但这仍然是一个值得一提的限制。
语法可能需要新规则来简化交互式解析。

例如，我的语法的正常入口点是以下规则：

script    
    : statement* EOF -> ^(STMTS statement*) 
    ;

我的交互式会话无法从 script 规则开始，因为它直到 EOF 才会结束。但它也不能从 statement 开始，因为 STMTS 可能会被我的树解析器使用。

因此，我专门针对交互式会话引入了以下规则：

interactive
    : statement -> ^(STMTS statement)
    ;

就我而言，没有“第一行”规则，因此我无法说为他们做类似的事情有多容易或多困难。这可能是制定这样的规则并在交互会话开始时执行它的问题：

interactive_start
    : first_line
    ;

语法背后的代码（例如，跟踪符号的代码）可能是在输入的生命周期和输入的生命周期的假设下编写的。解析器对象的生命周期实际上是相同的。对于我的解决方案，该假设不成立。解析器在每个语句之后都会被替换，因此新的解析器必须能够在最后一个语句停止的地方拾取符号跟踪（或其他内容）。这是一个典型的关注点分离问题，所以我认为对此没有太多可说的。

提到的第一个问题是内置 CharStream 类的限制，这是我唯一的主要障碍。 ANTLRStringStream 具有我需要的所有功能，因此我从中派生了自己的 CharStream 类。假设基类的 data 成员读取了所有过去的字符，因此我需要重写所有访问它的方法。然后我将直接读取更改为对（新方法）dataAt 的调用来管理从流中的读取。这基本上就是全部内容了。请注意，这里的代码可能存在未被注意到的问题，并且没有进行真正的错误处理。

public class MyInputStream extends ANTLRStringStream {
    private InputStream in;

    public MyInputStream(InputStream in) {
        super(new char[0], 0);
        this.in = in;
    }

    @Override
    // copied almost verbatim from ANTLRStringStream
    public void consume() {
        if (p < n) {
            charPositionInLine++;
            if (dataAt(p) == '\n') {
                line++;
                charPositionInLine = 0;
            }
            p++;
        }
    }

    @Override
    // copied almost verbatim from ANTLRStringStream
    public int LA(int i) {
        if (i == 0) {
            return 0; // undefined
        }
        if (i < 0) {
            i++; // e.g., translate LA(-1) to use offset i=0; then data[p+0-1]
            if ((p + i - 1) < 0) {
                return CharStream.EOF; // invalid; no char before first char
            }
        }

        // Read ahead
        return dataAt(p + i - 1);
    }

    @Override
    public String substring(int start, int stop) {
        if (stop >= n) {
            //Read ahead.
            dataAt(stop);
        }
        return new String(data, start, stop - start + 1);
    }

    private int dataAt(int i) {
        ensureRead(i);

        if (i < n) {
            return data[i];
        } else {
            // Nothing to read at that point.
            return CharStream.EOF;
        }
    }

    private void ensureRead(int i) {
        if (i < n) {
            // The data has been read.
            return;
        }

        int distance = i - n + 1;

        ensureCapacity(n + distance);

        // Crude way to copy from the byte stream into the char array.
        for (int pos = 0; pos < distance; ++pos) {
            int read;
            try {
                read = in.read();
            } catch (IOException e) {
                // TODO handle this better.
                throw new RuntimeException(e);
            }

            if (read < 0) {
                break;
            } else {
                data[n++] = (char) read;
            }
        }
    }

    private void ensureCapacity(int capacity) {
        if (capacity > n) {
            char[] newData = new char[capacity];
            System.arraycopy(data, 0, newData, 0, n);
            data = newData;
        }
    }
}

启动交互式会话与样板解析代码类似，不同之处在于使用了 UnbufferedTokenStream 并且解析在循环中进行：

    MyLexer lex = new MyLexer(new MyInputStream(System.in));
    TokenStream tokens = new UnbufferedTokenStream(lex);

    //Handle "first line" parser rule(s) here.

    while (true) {
        MyParser parser = new MyParser(tokens);
        //Set up the parser here.

        MyParser.interactive_return r = parser.interactive();

        //Do something with the return value.
        //Break on some meaningful condition.
    }

还在我身边吗？好吧，就这样吧。 :)

Here's an example of how to parse input from System.in without first manually parsing it one line at a time and without making major compromises in the grammar. I'm using ANTLR 3.4. ANTLR 4 may have addressed this problem already. I'm still using ANTLR 3, though, and maybe someone else with this problem still is too.

Before getting into the solution, here are the hurdles I ran into that keeps this seemingly trivial problem from being easy to solve:

The built-in ANTLR classes that derive from CharStream consume the entire stream of data up-front. Obviously an interactive mode (or any other indeterminate-length stream source) can't provide all the data.
The built-in BufferedTokenStream and derived class(es) will not end on a skipped or off-channel token. In an interactive setting, this means that the current statement can't end (and therefore can't execute) until the first token of the next statement or EOF has been consumed when using one of these classes.
The end of the statement itself may be indeterminate until the next statement begins.

Consider a simple example:

statement: 'verb' 'noun' ('and' 'noun')*
         ;
WS: //etc...

Interactively parsing a single statement (and only a single statement) isn't possible. Either the next statement has to be started (that is, hitting "verb" in the input), or the grammar has to be modified to mark the end of the statement, e.g. with a ';'.

I haven't found a way to manage a multi-channel lexer with my solution. It doesn't hurt me since I can replace my $channel = HIDDEN with skip(), but it's still a limitation worth mentioning.
A grammar may need a new rule to simplify interactive parsing.

For example, my grammar's normal entry point is this rule:

script    
    : statement* EOF -> ^(STMTS statement*) 
    ;

My interactive session can't start at the script rule because it won't end until EOF. But it can't start at statement either because STMTS might be used by my tree parser.

So I introduced the following rule specifically for an interactive session:

interactive
    : statement -> ^(STMTS statement)
    ;

In my case, there are no "first line" rules, so I can't say how easy or hard it would be to do something similar for them. It may be a matter of making a rule like so and execute it at the beginning of the interactive session:

interactive_start
    : first_line
    ;

The code behind a grammar (e.g., code that tracks symbols) may have been written under the assumption that the lifespan of the input and the lifespan of the parser object would effectively be the same. For my solution, that assumption doesn't hold. The parser gets replaced after each statement, so the new parser must be able to pick up the symbol tracking (or whatever) where the last one left off. This is a typical separation-of-concerns problem so I don't think there's much else to say about it.

The first problem mentioned, the limitations of the built-in CharStream classes, was my only major hang-up. ANTLRStringStream has all the workings that I need, so I derived my own CharStream class off of it. The base class's data member is assumed to have all the past characters read, so I needed to override all the methods that access it. Then I changed the direct read to a call to (new method) dataAt to manage reading from the stream. That's basically all there is to this. Please note that the code here may have unnoticed problems and does no real error handling.

public class MyInputStream extends ANTLRStringStream {
    private InputStream in;

    public MyInputStream(InputStream in) {
        super(new char[0], 0);
        this.in = in;
    }

    @Override
    // copied almost verbatim from ANTLRStringStream
    public void consume() {
        if (p < n) {
            charPositionInLine++;
            if (dataAt(p) == '\n') {
                line++;
                charPositionInLine = 0;
            }
            p++;
        }
    }

    @Override
    // copied almost verbatim from ANTLRStringStream
    public int LA(int i) {
        if (i == 0) {
            return 0; // undefined
        }
        if (i < 0) {
            i++; // e.g., translate LA(-1) to use offset i=0; then data[p+0-1]
            if ((p + i - 1) < 0) {
                return CharStream.EOF; // invalid; no char before first char
            }
        }

        // Read ahead
        return dataAt(p + i - 1);
    }

    @Override
    public String substring(int start, int stop) {
        if (stop >= n) {
            //Read ahead.
            dataAt(stop);
        }
        return new String(data, start, stop - start + 1);
    }

    private int dataAt(int i) {
        ensureRead(i);

        if (i < n) {
            return data[i];
        } else {
            // Nothing to read at that point.
            return CharStream.EOF;
        }
    }

    private void ensureRead(int i) {
        if (i < n) {
            // The data has been read.
            return;
        }

        int distance = i - n + 1;

        ensureCapacity(n + distance);

        // Crude way to copy from the byte stream into the char array.
        for (int pos = 0; pos < distance; ++pos) {
            int read;
            try {
                read = in.read();
            } catch (IOException e) {
                // TODO handle this better.
                throw new RuntimeException(e);
            }

            if (read < 0) {
                break;
            } else {
                data[n++] = (char) read;
            }
        }
    }

    private void ensureCapacity(int capacity) {
        if (capacity > n) {
            char[] newData = new char[capacity];
            System.arraycopy(data, 0, newData, 0, n);
            data = newData;
        }
    }
}

Launching an interactive session is similar to the boilerplate parsing code, except that UnbufferedTokenStream is used and the parsing takes place in a loop:

    MyLexer lex = new MyLexer(new MyInputStream(System.in));
    TokenStream tokens = new UnbufferedTokenStream(lex);

    //Handle "first line" parser rule(s) here.

    while (true) {
        MyParser parser = new MyParser(tokens);
        //Set up the parser here.

        MyParser.interactive_return r = parser.interactive();

        //Do something with the return value.
        //Break on some meaningful condition.
    }

Still with me? Okay, well that's it. :)

回复收藏 0 原文