当前位置：文江博客话题详情

antlr

Antlr 词法分析器标记匹配相似的字符串，如果贪婪的词法分析器出错怎么办？

发布于 2024-12-26 00:44:41 字数 1699 浏览 3 评论 0 原文

似乎有时 Antlr 词法分析器在标记字符流时对使用哪个规则做出了错误的选择......我试图找出如何帮助 Antlr 做出对人类来说显而易见的正确选择。我想解析这样的文本：

d/dt(x)=a
a=d/dt
d=3
dt=4

这是现有语言使用的一种不幸的语法，我正在尝试为其编写一个解析器。 “d/dt(x)”表示微分方程的左侧。如果必须的话，请忽略术语，只需知道它不是“d”除以“dt”即可。然而，第二次出现的“d/dt”实际上是“d”除以“dt”。

这是我的语法：

grammar diffeq_grammar;

program :   (statement? NEWLINE)*;

statement
    :   diffeq
    |   assignment;

diffeq  :   DDT ID ')' '=' ID;

assignment
    :   ID '=' NUMBER
    |   ID '=' ID '/' ID
    ;

DDT :   'd/dt(';
ID  :   'a'..'z'+;
NUMBER  :   '0'..'9'+;
NEWLINE :   '\r\n'|'\r'|'\n';

当使用此语法时，词法分析器会抓取第一个“d/dt(”并将其转换为标记 DDT。完美！现在词法分析器会看到第二个“d”后跟一个“/”并说“嗯，我可以将其作为 ID 和“/”进行匹配，或者我可以贪婪地匹配 DDT”。词法分析器选择贪婪......但它几乎不知道，输入中后面的几个字符没有“（”当词法分析器查找丢失的“(”时，它会抛出 MismatchedTokenException！

到目前为止，我找到的唯一解决方案是将所有规则移动到解析器中，语法如下：

grammar diffeq_grammar;

program :   (statement? NEWLINE)*;

statement
    :   diffeq
    |   assignment;

diffeq  :   ddt id ')' '=' id;

assignment
    :   id '=' number
    |   id '=' id '/' id
    ;

ddt :   'd' '/' 'd' 't' '(';
id  :   CHAR+;
number  :   DIGIT+;
CHAR    :   'a'..'z';
DIGIT   :   '0'..'9';
NEWLINE :   '\r\n'|'\r'|'\n';

如果我没有这样做，那么这是一个很好的解决方案。在花了两天时间研究这个问题之后，我已经有了数千行依赖于第一个语法的工作代码，我得出的结论是词法分析器......确实应该能够在某些时候区分这两种情况。 Antlr 词法分析器在两个规则之间做出选择：DDT 和 ID。它选择 DDT，因为词法分析器很贪婪，但是当匹配 DDT 失败时，我希望词法分析器返回使用 ID。

只要语法保持基本相同（即词法分析器中的规则，保留在词法分析器中。并且大多数规则保持不变），我就可以使用谓词或其他技巧。

理想情况下，我可以使用任何有效的 Antlr 代码修改 DDT 的词法分析器规则......然后完成。

我的目标语言是Java。

谢谢！

更新

谢谢你们的一些精彩回答！我接受了最适合我的问题的答案。我使用的实际解决方案是我自己的答案（不是公认的答案），并且还有更多可能有效的答案。读者们，请查看所有答案；其中一些可能比我的更适合您的情况。

原文

It seems that sometimes the Antlr lexer makes a bad choice on which rule to use when tokenizing a stream of characters... I'm trying to figure out how to help Antlr make the obvious-to-a-human right choice. I want to parse text like this:

d/dt(x)=a
a=d/dt
d=3
dt=4

This is an unfortunate syntax that an existing language uses and I'm trying to write a parser for. The "d/dt(x)" is representing the left hand side of a differential equation. Ignore the lingo if you must, just know that it is not "d" divided by "dt". However, the second occurrence of "d/dt" really is "d" divided by "dt".

Here's my grammar:

grammar diffeq_grammar;

program :   (statement? NEWLINE)*;

statement
    :   diffeq
    |   assignment;

diffeq  :   DDT ID ')' '=' ID;

assignment
    :   ID '=' NUMBER
    |   ID '=' ID '/' ID
    ;

DDT :   'd/dt(';
ID  :   'a'..'z'+;
NUMBER  :   '0'..'9'+;
NEWLINE :   '\r\n'|'\r'|'\n';

When using this grammar the lexer grabs the first "d/dt(" and turns it to the token DDT. Perfect! Now later the lexer sees the second "d" followed by a "/" and says "hmmm, I can match this as an ID and a '/' or I can be greedy and match DDT". The lexer chooses to be greedy... but little does it know, there is no "(" a few characters later in the input stream. When the lexer looks for the missing "(" it throws a MismatchedTokenException!

The only solution I've found so far, is to move all the rules into the parser with a grammar like:

grammar diffeq_grammar;

program :   (statement? NEWLINE)*;

statement
    :   diffeq
    |   assignment;

diffeq  :   ddt id ')' '=' id;

assignment
    :   id '=' number
    |   id '=' id '/' id
    ;

ddt :   'd' '/' 'd' 't' '(';
id  :   CHAR+;
number  :   DIGIT+;
CHAR    :   'a'..'z';
DIGIT   :   '0'..'9';
NEWLINE :   '\r\n'|'\r'|'\n';

This is a fine solution if I didn't already have thousands of lines of working code that depend on the first grammar working. After spending 2 days researching this problem I have come to the conclusion that a lexer... really ought to be able to distinguish the two cases. At some point the Antlr lexer is deciding between two rules: DDT and ID. It chooses DDT because the lexer is greedy. But when matching DDT fails, I'd like the lexer to go back to using ID.

I'm okay with using predicates or other tricks as long as the grammar remains basically the same (i.e., the rules in the lexer, stay in the lexer. And most rules are left untouched.).

Ideally I can modify the lexer rule for DDT with any valid Antlr code... and be done.

My target language is Java.

Thanks!

UPDATE

Thank you guys for some great answers!! I accepted the answer that best fit my question. The actual solution I used is in my own answer (not the accepted answer), and there are more answers that could have worked. Readers, check out all the answers; some of them may suit your case better than mine.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

溇涏 2025-01-02 00:44:41

只要语法保持基本相同（即词法分析器中的规则，保留在词法分析器中。并且大多数规则保持不变），我就可以使用谓词或其他技巧。

在这种情况下，强制词法分析器在字符流中向前查看，以确保确实存在使用 /3056441/what-is-a-semantic-predicate-in-antlr">门控句法谓词。

演示：

grammar diffeq_grammar;

@parser::members {
  public static void main(String[] args) throws Exception {
    String src = 
        "d/dt(x)=a\n" +
        "a=d/dt\n" +
        "d=3\n" +
        "dt=4\n";
    diffeq_grammarLexer lexer = new diffeq_grammarLexer(new ANTLRStringStream(src));
    diffeq_grammarParser parser = new diffeq_grammarParser(new CommonTokenStream(lexer));
    parser.program();
  }
}

@lexer::members {
  private boolean ahead(String text) {
    for(int i = 0; i < text.length(); i++) {
      if(input.LA(i + 1) != text.charAt(i)) {
        return false;
      }
    }
    return true;
  }
}

program
 : (statement? NEWLINE)* EOF
 ;

statement
 : diffeq     {System.out.println("diffeq     : " + $text);}
 | assignment {System.out.println("assignment : " + $text);}
 ;

diffeq
 : DDT ID ')' '=' ID
 ;

assignment
 : ID '=' NUMBER
 | ID '=' ID '/' ID
 ;

DDT     : {ahead("d/dt(")}?=> 'd/dt(';
ID      : 'a'..'z'+;
NUMBER  : '0'..'9'+;
NEWLINE : '\r\n' | '\r' | '\n';

如果您现在运行演示：

java -cp antlr-3.3.jar org.antlr.Tool diffeq_grammar.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar diffeq_grammarParser

（使用 Windows 时，请替换 ： 在最后一个命令中使用 ; ），

您将看到以下输出：

diffeq     : d/dt(x)=a
assignment : a=d/dt
assignment : d=3
assignment : dt=4

I'm okay with using predicates or other tricks as long as the grammar remains basically the same (i.e., the rules in the lexer, stay in the lexer. And most rules are left untouched.).

In that case, force the lexer to look ahead in the char-stream to make sure there really is "d/dt(" using a gated syntactic predicate.

A demo:

grammar diffeq_grammar;

@parser::members {
  public static void main(String[] args) throws Exception {
    String src = 
        "d/dt(x)=a\n" +
        "a=d/dt\n" +
        "d=3\n" +
        "dt=4\n";
    diffeq_grammarLexer lexer = new diffeq_grammarLexer(new ANTLRStringStream(src));
    diffeq_grammarParser parser = new diffeq_grammarParser(new CommonTokenStream(lexer));
    parser.program();
  }
}

@lexer::members {
  private boolean ahead(String text) {
    for(int i = 0; i < text.length(); i++) {
      if(input.LA(i + 1) != text.charAt(i)) {
        return false;
      }
    }
    return true;
  }
}

program
 : (statement? NEWLINE)* EOF
 ;

statement
 : diffeq     {System.out.println("diffeq     : " + $text);}
 | assignment {System.out.println("assignment : " + $text);}
 ;

diffeq
 : DDT ID ')' '=' ID
 ;

assignment
 : ID '=' NUMBER
 | ID '=' ID '/' ID
 ;

DDT     : {ahead("d/dt(")}?=> 'd/dt(';
ID      : 'a'..'z'+;
NUMBER  : '0'..'9'+;
NEWLINE : '\r\n' | '\r' | '\n';

If you now run the demo:

java -cp antlr-3.3.jar org.antlr.Tool diffeq_grammar.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar diffeq_grammarParser

(when using Windows, replace the : with ; in the last command)

you will see the following output:

diffeq     : d/dt(x)=a
assignment : a=d/dt
assignment : d=3
assignment : dt=4

回复收藏 0 原文

只是我以为 2025-01-02 00:44:41

尽管考虑到项目中拥有大量工作代码，这不是您想要做的事情，但您仍然应该考虑更彻底地分离解析器和词法分析器。我最好让解析器和词法分析器做他们最擅长的事情，而不是将它们“融合”在一起。出现错误的最明显迹象是 ( 和 ) 标记之间缺乏对称性：一个是复合标记的一部分，而另一个是独立标记。单独的令牌。

如果重构是一个选项，您可以像这样更改解析器和词法分析器：

grammar diffeq_grammar;

program :   (statement? NEWLINE)* EOF; // <-- You forgot EOF

statement
    :   diffeq
    |   assignment;

diffeq  :   D OVER DT OPEN id CLOSE EQ id; // <-- here, id is a parser rule

assignment
    :   id EQ NUMBER
    |   id EQ id OVER id
    ;

id  : ID | D | DT; // <-- Nice trick, isn't it?

D       : 'D';
DT      : 'DT';
OVER    : '/';
EQ      : '=';
OPEN    : '(';
CLOSE   : ')';
ID      : 'a'..'z'+;
NUMBER  : '0'..'9'+;
NEWLINE : '\r\n'|'\r'|'\n';

您可能需要启用回溯和记忆才能使其工作（但首先尝试在不回溯的情况下编译它）。

Although this is not what you are trying to do considering the large amount of working code that you have in the project, you should still consider separating your parser and lexer more thoroughly. I is best to let the parser and the lexer do what they do best, rather than "fusing" them together. The most obvious indication of something being wrong is the lack of symmetry between your ( and ) tokens: one is part of a composite token, while the other one is a stand-alone token.

If refactoring is at all an option, you could change the parser and lexer like this:

grammar diffeq_grammar;

program :   (statement? NEWLINE)* EOF; // <-- You forgot EOF

statement
    :   diffeq
    |   assignment;

diffeq  :   D OVER DT OPEN id CLOSE EQ id; // <-- here, id is a parser rule

assignment
    :   id EQ NUMBER
    |   id EQ id OVER id
    ;

id  : ID | D | DT; // <-- Nice trick, isn't it?

D       : 'D';
DT      : 'DT';
OVER    : '/';
EQ      : '=';
OPEN    : '(';
CLOSE   : ')';
ID      : 'a'..'z'+;
NUMBER  : '0'..'9'+;
NEWLINE : '\r\n'|'\r'|'\n';

You may need to enable backtracking and memoization for this to work (but try compiling it without backtracking first).

回复收藏 0 原文

不乱于心 2025-01-02 00:44:41

这是我最终使用的解决方案。我知道它违反了我的要求之一：将词法分析器规则保留在词法分析器中，并将解析器规则保留在解析器中，但事实证明，将 DDT 移动到 ddt 不需要更改我的代码。此外，dasblinkenlight 在他的回答和评论中对括号不匹配提出了一些很好的观点。

grammar ddt_problem;

program :   (statement? NEWLINE)*;

statement
    :   diffeq
    |   assignment;

diffeq  :   ddt ID ')' '=' ID;

assignment
    :   ID '=' NUMBER
    |   ID '=' ID '/' ID
    ;

ddt :   ( d=ID ) { $d.getText().equals("d") }? '/' ( dt=ID ) { $dt.getText().equals("dt") }? '(';
ID  :   'a'..'z'+;
NUMBER  :   '0'..'9'+;
NEWLINE :   '\r\n'|'\r'|'\n';

Here's the solution I finally used. I know it violates one of my requirements: to keep lexer rules in the lexer and parser rules in the parser, but as it turns out moving DDT to ddt required no change in my code. Also, dasblinkenlight makes some good points about mismatched parenthesis in his answer and comments.

grammar ddt_problem;

program :   (statement? NEWLINE)*;

statement
    :   diffeq
    |   assignment;

diffeq  :   ddt ID ')' '=' ID;

assignment
    :   ID '=' NUMBER
    |   ID '=' ID '/' ID
    ;

ddt :   ( d=ID ) { $d.getText().equals("d") }? '/' ( dt=ID ) { $dt.getText().equals("dt") }? '(';
ID  :   'a'..'z'+;
NUMBER  :   '0'..'9'+;
NEWLINE :   '\r\n'|'\r'|'\n';

回复收藏 0 原文

~没有更多了~