使用 ANTLR4 解析字符串

发布于 2025-01-16 17:18:32 字数 1004 浏览 0 评论 0原文

示例：(CHGA/B234A/B231

String:
        a) Designator: 3 LETTERS
        b) Message number (OPTIONAL): 1 to 4 LETTERS, followed by A SLASH (/) followed by 1 to 4 LETTERS, followed by 3 NUMBERS indicating the serial number.
        c) Reference data (OPTIONAL): 1 to 4 LETTERS, followed by A SLASH (/) followed by 1 to 4 LETTERS, followed by 3 NUMBERS indicating the serial number.

Result: 
 CHG
 A/B234
 A/B231

在语法文件中：

/*
 * Parser Rules
 */

tipo3: designador idmensaje? idmensaje?;
designador: PARENTHESIS CHG;
idmensaje: LETTER4 SLASH LETTER4 DIGIT3;

/*
 * Lexer Rules
 */

CHG     : 'CHG' ;

fragment DIGIT      : [0-9] ;
fragment LETTER     : [a-zA-Z] ;

SLASH               : '/' ;
PARENTHESIS         : '(' ;

DIGIT3              : DIGIT DIGIT DIGIT ;
LETTER4             : LETTER LETTER? LETTER? LETTER? ;

但是在测试 tipo3 规则时，它给了我以下消息：

第 1:1 行在“CHGA”处缺少“CHG”

我如何解析antlr4中的该字符串？

原文

Example: (CHGA/B234A/B231

String:
        a) Designator: 3 LETTERS
        b) Message number (OPTIONAL): 1 to 4 LETTERS, followed by A SLASH (/) followed by 1 to 4 LETTERS, followed by 3 NUMBERS indicating the serial number.
        c) Reference data (OPTIONAL): 1 to 4 LETTERS, followed by A SLASH (/) followed by 1 to 4 LETTERS, followed by 3 NUMBERS indicating the serial number.

Result: 
 CHG
 A/B234
 A/B231

In grammar file:

/*
 * Parser Rules
 */

tipo3: designador idmensaje? idmensaje?;
designador: PARENTHESIS CHG;
idmensaje: LETTER4 SLASH LETTER4 DIGIT3;

/*
 * Lexer Rules
 */

CHG     : 'CHG' ;

fragment DIGIT      : [0-9] ;
fragment LETTER     : [a-zA-Z] ;

SLASH               : '/' ;
PARENTHESIS         : '(' ;

DIGIT3              : DIGIT DIGIT DIGIT ;
LETTER4             : LETTER LETTER? LETTER? LETTER? ;

But when testing the tipo3 rule its giving me the following message:

line 1:1 missing 'CHG' at 'CHGA'

How can i parse that string in antlr4?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

牵强ㄟ 2025-01-23 17:18:32

当您困惑为什么某个解析器规则不匹配时，请始终从词法分析器开始。转储您的词法分析器在标准输出上生成的标记。具体方法如下：

// I've placed your grammar in a file called T.g4 (hence the name `TLexer`)
String source = "(CHGA/B234A/B231";
TLexer lexer = new TLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();

for (Token t : stream.getTokens()) {
  System.out.printf("%-20s `%s`%n",
      TLexer.VOCABULARY.getSymbolicName(t.getType()),
      t.getText().replace("\n", "\\n"));
}

如果运行上面的 Java 代码，将打印以下内容：

PARENTHESIS          `(`
LETTER4              `CHGA`
SLASH                `/`
LETTER4              `B`
DIGIT3               `234`
LETTER4              `A`
SLASH                `/`
LETTER4              `B`
DIGIT3               `231`
EOF                  `<EOF>`

如您所见，CHGA 变成单个 LETTER4，而不是 CHG + LETTER4 令牌。尝试将 LETTER4 更改为 LETTER4 : LETTER; 并重新测试。现在您将得到预期的结果。

在您当前的语法中，CHGA 将始终成为单个LETTER4。这就是 ANTLR 的工作原理（词法分析器尝试为单个规则消耗尽可能多的字符）。你无法改变这一点。

您可以做什么，它将多字母规则的构造移至解析器而不是词法分析器：

tipo3       : designador idmensaje? idmensaje?;
designador  : PARENTHESIS CHG;
idmensaje   : letter4 SLASH letter4 DIGIT3;
letter4     : LETTER LETTER? LETTER? LETTER?
            | CHG
            ;

CHG         : 'CHG' ;
LETTER      : [a-zA-Z] ;
SLASH       : '/';
PARENTHESIS : '(';
DIGIT3      : DIGIT DIGIT DIGIT;

fragment DIGIT : [0-9];

导致：

When you're confused why a certain parser rule is not being matched, always start with the lexer. Dump what tokens your lexer is producing on the stdout. Here's how you can do that:

// I've placed your grammar in a file called T.g4 (hence the name `TLexer`)
String source = "(CHGA/B234A/B231";
TLexer lexer = new TLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);
stream.fill();

for (Token t : stream.getTokens()) {
  System.out.printf("%-20s `%s`%n",
      TLexer.VOCABULARY.getSymbolicName(t.getType()),
      t.getText().replace("\n", "\\n"));
}

If you runt the Java code above, this will be printed:

PARENTHESIS          `(`
LETTER4              `CHGA`
SLASH                `/`
LETTER4              `B`
DIGIT3               `234`
LETTER4              `A`
SLASH                `/`
LETTER4              `B`
DIGIT3               `231`
EOF                  `<EOF>`

As you can see, CHGA becomes a single LETTER4, not a CHG + LETTER4 token. Try changing LETTER4 into LETTER4 : LETTER; and re-test. Now you'll get the expected result.

In your current grammar CHGA will always become a single LETTER4. This is just how ANTLR works (the lexer tries to consume as many chars for a single rule as possible). You cannot change this.

What you could do, it move the construction of the multi-letter rule to the parser instead of the lexer:

tipo3       : designador idmensaje? idmensaje?;
designador  : PARENTHESIS CHG;
idmensaje   : letter4 SLASH letter4 DIGIT3;
letter4     : LETTER LETTER? LETTER? LETTER?
            | CHG
            ;

CHG         : 'CHG' ;
LETTER      : [a-zA-Z] ;
SLASH       : '/';
PARENTHESIS : '(';
DIGIT3      : DIGIT DIGIT DIGIT;

fragment DIGIT : [0-9];

resulting in:

回复收藏 0 原文

~没有更多了~