如何使用 ANTLR 修改 CommonTokenStream 中的标记文本?

发布于 2024-08-21 14:27:48 字数 1306 浏览 5 评论 0原文

我正在尝试学习 ANTLR 并同时将其用于当前的项目。

我已经可以在一段代码上运行词法分析器并将其输出到 CommonTokenStream 了。这工作正常,并且我已经验证源文本已被分解为适当的标记。

现在,我希望能够修改该流中某些标记的文本,并显示现在修改的源代码。

例如,我尝试过:

import org.antlr.runtime.*;
import java.util.*;

public class LexerTest
{
    public static final int IDENTIFIER_TYPE = 4;

    public static void main(String[] args)
    {
    String input = "public static void main(String[] args) { int myVar = 0; }";
    CharStream cs = new ANTLRStringStream(input);


        JavaLexer lexer = new JavaLexer(cs);
        CommonTokenStream tokens = new CommonTokenStream();
        tokens.setTokenSource(lexer);

        int size = tokens.size();
        for(int i = 0; i < size; i++)
        {
            Token token = (Token) tokens.get(i);
            if(token.getType() == IDENTIFIER_TYPE)
            {
                token.setText("V");
            }
        }
        System.out.println(tokens.toString());
    }  
}

我试图将所有标识符标记的文本设置为字符串文字“V”。

  1. 为什么当我调用 tokens.toString() 时,我对令牌文本的更改没有反映出来?

  2. 我该如何知道各种令牌类型 ID?我使用调试器进行了检查,发现 IDENTIFIER 令牌的 ID 是“4”(因此我的常量位于顶部)。但否则我怎么会知道呢?是否有其他方法可以将令牌类型 ID 映射到令牌名称?


编辑:

对我来说重要的一件事是我希望标记具有其原始的开始和结束字符位置。也就是说,我不希望它们通过将变量名称更改为“V”来反映其新位置。这样我就知道标记在原始源文本中的位置。

I'm trying to learn ANTLR and at the same time use it for a current project.

I've gotten to the point where I can run the lexer on a chunk of code and output it to a CommonTokenStream. This is working fine, and I've verified that the source text is being broken up into the appropriate tokens.

Now, I would like to be able to modify the text of certain tokens in this stream, and display the now modified source code.

For example I've tried:

import org.antlr.runtime.*;
import java.util.*;

public class LexerTest
{
    public static final int IDENTIFIER_TYPE = 4;

    public static void main(String[] args)
    {
    String input = "public static void main(String[] args) { int myVar = 0; }";
    CharStream cs = new ANTLRStringStream(input);


        JavaLexer lexer = new JavaLexer(cs);
        CommonTokenStream tokens = new CommonTokenStream();
        tokens.setTokenSource(lexer);

        int size = tokens.size();
        for(int i = 0; i < size; i++)
        {
            Token token = (Token) tokens.get(i);
            if(token.getType() == IDENTIFIER_TYPE)
            {
                token.setText("V");
            }
        }
        System.out.println(tokens.toString());
    }  
}

I'm trying to set all Identifier token's text to the string literal "V".

  1. Why are my changes to the token's text not reflected when I call tokens.toString()?

  2. How am I suppose to know the various Token Type IDs? I walked through with my debugger and saw that the ID for the IDENTIFIER tokens was "4" (hence my constant at the top). But how would I have known that otherwise? Is there some other way of mapping token type ids to the token name?


EDIT:

One thing that is important to me is I wish for the tokens to have their original start and end character positions. That is, I don't want them to reflect their new positions with the variable names changed to "V". This is so I know where the tokens were in the original source text.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

長街聽風 2024-08-28 14:27:48

ANTLR 在其语法文件中有一种方法可以做到这一点。

假设您正在解析一个由数字和以逗号分隔的字符串组成的字符串。语法如下所示:

grammar Foo;

parse
  :  value ( ',' value )* EOF
  ;

value
  :  Number
  |  String
  ;

String
  :  '"' ( ~( '"' | '\\' ) | '\\\\' | '\\"' )* '"'
  ;

Number
  :  '0'..'9'+
  ;

Space
  :  ( ' ' | '\t' ) {skip();}
  ;

这对您来说应该很熟悉。假设您想将所有整数值括在方括号内。具体做法如下:

grammar Foo;

options {output=template; rewrite=true;} 

parse
  :  value ( ',' value )* EOF
  ;

value
  :  n=Number -> template(num={$n.text}) "[<num>]" 
  |  String
  ;

String
  :  '"' ( ~( '"' | '\\' ) | '\\\\' | '\\"' )* '"'
  ;

Number
  :  '0'..'9'+
  ;

Space
  :  ( ' ' | '\t' ) {skip();}
  ;

如您所见,我在顶部添加了一些 options,并在 -> 之后添加了重写规则(-> 之后的所有内容) value 解析器规则中的 code>Number。

现在要测试这一切,编译并运行这个类:

import org.antlr.runtime.*;

public class FooTest {
  public static void main(String[] args) throws Exception {
    String text = "12, \"34\", 56, \"a\\\"b\", 78";
    System.out.println("parsing: "+text);
    ANTLRStringStream in = new ANTLRStringStream(text);
    FooLexer lexer = new FooLexer(in);
    CommonTokenStream tokens = new TokenRewriteStream(lexer); // Note: a TokenRewriteStream!
    FooParser parser = new FooParser(tokens);
    parser.parse();
    System.out.println("tokens: "+tokens.toString());
  }
}

它会产生:

parsing: 12, "34", 56, "a\"b", 78
tokens: [12],"34",[56],"a\"b",[78]

ANTLR has a way to do this in it's grammar file.

Let's say you're parsing a string consisting of numbers and strings delimited by comma's. A grammar would look like this:

grammar Foo;

parse
  :  value ( ',' value )* EOF
  ;

value
  :  Number
  |  String
  ;

String
  :  '"' ( ~( '"' | '\\' ) | '\\\\' | '\\"' )* '"'
  ;

Number
  :  '0'..'9'+
  ;

Space
  :  ( ' ' | '\t' ) {skip();}
  ;

This should all look familiar to you. Let's say you want to wrap square brackets around all integer values. Here's how to do that:

grammar Foo;

options {output=template; rewrite=true;} 

parse
  :  value ( ',' value )* EOF
  ;

value
  :  n=Number -> template(num={$n.text}) "[<num>]" 
  |  String
  ;

String
  :  '"' ( ~( '"' | '\\' ) | '\\\\' | '\\"' )* '"'
  ;

Number
  :  '0'..'9'+
  ;

Space
  :  ( ' ' | '\t' ) {skip();}
  ;

As you see, I've added some options at the top, and added a rewrite rule (everything after the ->) after the Number in the value parser rule.

Now to test it all, compile and run this class:

import org.antlr.runtime.*;

public class FooTest {
  public static void main(String[] args) throws Exception {
    String text = "12, \"34\", 56, \"a\\\"b\", 78";
    System.out.println("parsing: "+text);
    ANTLRStringStream in = new ANTLRStringStream(text);
    FooLexer lexer = new FooLexer(in);
    CommonTokenStream tokens = new TokenRewriteStream(lexer); // Note: a TokenRewriteStream!
    FooParser parser = new FooParser(tokens);
    parser.parse();
    System.out.println("tokens: "+tokens.toString());
  }
}

which produces:

parsing: 12, "34", 56, "a\"b", 78
tokens: [12],"34",[56],"a\"b",[78]
梦忆晨望 2024-08-28 14:27:48

在 ANTLR 4 中,有一个使用解析树侦听器和 TokenStreamRewriter(注意名称差异)的新工具,可用于观察或转换树。 (建议 TokenRewriteStream 的回复适用于 ANTLR 3,不适用于 ANTLR 4。)

在 ANTL4 中,会为您生成一个 XXXBaseListener 类,并带有用于进入和退出语法中每个非终端节点的回调(例如 EnterClassDeclaration() )。

您可以通过两种方式使用监听器:

  1. 作为观察者 - 通过简单地重写方法来生成与输入文本相关的任意输出 - 例如重写 EnterClassDeclaration() 并为程序中声明的每个类输出一行。

  2. 作为一个转换器,使用 TokenRewriteStream 在原始文本通过时修改它。为此,您可以使用重写器在回调方法中进行修改(添加、删除、替换)标记,并使用重写器和结尾来输出修改后的文本。

有关如何进行转换的示例,请参阅 ANTL4 书中的以下示例:

https://github.com/mquinn/ANTLR4/blob/master/book_code/tour/InsertSerialIDListener.java

https://github.com/mquinn/ANTLR4/blob/master/book_code/tour/InsertSerialID.java

In ANTLR 4 there is a new facility using parse tree listeners and TokenStreamRewriter (note the name difference) that can be used to observe or transform trees. (The replies suggesting TokenRewriteStream apply to ANTLR 3 and will not work with ANTLR 4.)

In ANTL4 an XXXBaseListener class is generated for you with callbacks for entering and exiting each non-terminal node in the grammar (e.g. enterClassDeclaration() ).

You can use the Listener in two ways:

  1. As an observer - By simply overriding the methods to produce arbitrary output related to the input text - e.g. override enterClassDeclaration() and output a line for each class declared in your program.

  2. As a transformer using TokenRewriteStream to modify the original text as it passes through. To do this you use the rewriter to make modifications (add, delete, replace) tokens in the callback methods and you use the rewriter and the end to output the modified text.

See the following examples from the ANTL4 book for an example of how to do transformations:

https://github.com/mquinn/ANTLR4/blob/master/book_code/tour/InsertSerialIDListener.java

and

https://github.com/mquinn/ANTLR4/blob/master/book_code/tour/InsertSerialID.java

慕巷 2024-08-28 14:27:48

如果您想在所有情况下全局替换文本,则更改词法分析器中的文本的另一个给定示例效果很好,但是您通常只想在某些情况下替换标记的文本。

使用 TokenRewriteStream 使您可以灵活地仅在某些上下文中更改文本。

这可以使用您正在使用的令牌流类的子类来完成。您可以使用 TokenRewriteStream,而不是使用 CommonTokenStream 类。

因此,您将让 TokenRewriteStream 使用词法分析器,然后运行解析器。

在你的语法中,通常你会像这样进行替换:

/** Convert "int foo() {...}" into "float foo();" */
function
:
{
    RefTokenWithIndex t(LT(1));  // copy the location of the token you want to replace
    engine.replace(t, "float");
}
type id:ID LPAREN (formalParameter (COMMA formalParameter)*)? RPAREN
    block[true]
;

这里我们替换了与文本 float 匹配的标记 int。位置信息被保留,但它“匹配”的文本已被更改。

要在之后检查您的令牌流,您将使用与之前相同的代码。

The other given example of changing the text in the lexer works well if you want to globally replace the text in all situations, however you often only want to replace a token's text during certain situations.

Using the TokenRewriteStream allows you the flexibility of changing the text only during certain contexts.

This can be done using a subclass of the token stream class you were using. Instead of using the CommonTokenStream class you can use the TokenRewriteStream.

So you'd have the TokenRewriteStream consume the lexer and then you'd run your parser.

In your grammar typically you'd do the replacement like this:

/** Convert "int foo() {...}" into "float foo();" */
function
:
{
    RefTokenWithIndex t(LT(1));  // copy the location of the token you want to replace
    engine.replace(t, "float");
}
type id:ID LPAREN (formalParameter (COMMA formalParameter)*)? RPAREN
    block[true]
;

Here we've replaced the token int that we matched with the text float. The location information is preserved but the text it "matches" has been changed.

To check your token stream after you would use the same code as before.

终难遇 2024-08-28 14:27:48

我使用示例 Java 语法创建了一个 ANTLR 脚本来处理 R.java 文件,并使用 R.string 形式的值重写反编译的 Android 应用程序中的所有十六进制值。 *R.id.*R.layout.* 等等。

关键是使用TokenStreamRewriter来处理令牌,然后输出结果。

该项目(Python)称为 RestoreR

修改后的 ANTLR 监听器用于重写

我用监听器解析以读取R.java 文件并创建从整数到字符串的映射,然后将十六进制值替换为我使用包含重写器实例的不同侦听器解析程序 java 文件。

class RValueReplacementListener(ParseTreeListener):
    replacements = 0
    r_mapping = {}
    rewriter = None

    def __init__(self, tokens):
        self.rewriter = TokenStreamRewriter(tokens)

    // Code removed for the sake of brevity

    # Enter a parse tree produced by JavaParser#integerLiteral.
    def enterIntegerLiteral(self, ctx:JavaParser.IntegerLiteralContext):
        hex_literal = ctx.HEX_LITERAL()
        if hex_literal is not None:
            int_literal = int(hex_literal.getText(), 16)
            if int_literal in self.r_mapping:
                # print('Replace: ' + ctx.getText() + ' with ' + self.r_mapping[int_literal])
                self.rewriter.replaceSingleToken(ctx.start, self.r_mapping[int_literal])
                self.replacements += 1

I've used the sample Java grammar to create an ANTLR script to process an R.java file and rewrite all the hex values in a decompiled Android app with values of the form R.string.*, R.id.*, R.layout.* and so forth.

The key is using TokenStreamRewriter to process the tokens and then output the result.

The project (Python) is called RestoreR

The modified ANTLR listener for rewriting

I parse with a listener to read in the R.java file and create a mapping from integer to string and then replace the hex values as a I parse the programs java files with a different listener containing a rewriter instance.

class RValueReplacementListener(ParseTreeListener):
    replacements = 0
    r_mapping = {}
    rewriter = None

    def __init__(self, tokens):
        self.rewriter = TokenStreamRewriter(tokens)

    // Code removed for the sake of brevity

    # Enter a parse tree produced by JavaParser#integerLiteral.
    def enterIntegerLiteral(self, ctx:JavaParser.IntegerLiteralContext):
        hex_literal = ctx.HEX_LITERAL()
        if hex_literal is not None:
            int_literal = int(hex_literal.getText(), 16)
            if int_literal in self.r_mapping:
                # print('Replace: ' + ctx.getText() + ' with ' + self.r_mapping[int_literal])
                self.rewriter.replaceSingleToken(ctx.start, self.r_mapping[int_literal])
                self.replacements += 1
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文