ANTLR 重用令牌子集

发布于 2025-01-13 11:38:12 字数 2976 浏览 5 评论 0原文

在我的 ANTLR 语法中，我有一组操作（OP）：

OP: 'I1', 'I2'....'I9', 'I10' （一组 TOKENS）；

每当我发现 TOKEN:

CASE : 有效操作是 'I1','I2','I3' (move1);
SWITCH ：有效操作为“I2”、“I4”、“I5”（move2）；

其余操作由其他指令使用。

当然，在我的 Lexer 中，我不能将两个标记定义为：

OP_MOVE1      : 'I1' | 'I2' | 'I3';
OP_MOVE2      : 'I2' | 'I4' | 'I5';

因为我会得到::

OP_MOVE2 值无法访问。 I2 总是与代币 OP_MOVE1 重叠

因此，想象一下交易不仅是从 I1 到 I10，而且是从 I1 到 I5000。

一种可能的解决方案可能是：

LEXER.G4

lexer grammar LexerComment;

CASE   : 'CASE' -> pushMode(CASE_MODE); 
SWITCH : 'SWITCH' -> pushMode(CASE_SWITCH); 

WS  : [ \t] -> skip ;   
EOL : [\r\n]+;  

// ------------ Everything INSIDE a CASE ------------ 
mode CASE_MODE;
CASE_MODE_MOVE1 : 'I1' | 'I2' | 'I3'; 

CASE_MODE_WS        : [ \t] -> channel(HIDDEN) ;   
CASE_MODE_EOL : EOL -> type(EOL),popMode;

// ------------ Everything INSIDE a SWITCH ------------ 
mode CASE_SWITCH;
CASE_SWITCH_MOVE2: 'I2' | 'I4' | 'I5';

CASE_SWITCH_WS       : [ \t] -> channel(HIDDEN) ;   
CASE_SWITCHT_EOL : EOL -> type(EOL),popMode;

PARSER.g4:

parser grammar ParserComment;

options {
      tokenVocab = LexerComment;
  }

prog : (line? EOL)+;   
line : instruction; 
 
instruction: CASE CASE_MODE_MOVE1
            |SWITCH CASE_SWITCH_MOVE2;

inputFile:

CASE I1
CASE I2
CASE I3
SWITCH I2
SWITCH I4
SWITCH I5

语法似乎工作正常，尽管我对该解决方案不满意，因为它需要大量代码，每种情况 1 种模式，并且在中重复标记模式中常见。

还因为如果我想识别除 CASE 和 SWITCH 之外的以 MOVE1 OR MOVE2 开头的行，如下所示：

 instruction: CASE CASE_MODE_MOVE1
             |SWITCH CASE_SWITCH_MOVE2
             | MOVE1 ;

我还没有找到解决问题的最佳解决方案：

我无法重用之前的 MOVE1 (CASE_MODE_MOVE1)；
我无法定义新模式。

有没有办法正确处理类似的情况？给定一组令牌，我想要一个可以根据上下文使用的子集。

可能试图避免将每个操作定义为基本令牌：

fragment I1: 'I1';
fragment I2: 'I2';
etc

编辑： 在我的语法中，相同的 TOKEN 可以有不同的含义。

例如，对于以下语法，TOKEN I1 具有不同的含义。

解析器：

parser grammar ParserComment;

options {
      tokenVocab = LexerComment;
  }

prog : (line? EOL)+;   
line : instruction; 
instruction: CONTEXT case_instruction; 

case_instruction
 : I1
 | I2
 | I3
 ;

词法分析器：

 lexer grammar LexerComment;
    
    // I1 IS A CONTEXT
    CONTEXT: I1 | CASE;
    
    CASE   : 'CASE';
    SWITCH : 'SWITCH';
    //OPERATIONS
    I1     : 'I1';
    I2     : 'I2';
    I3     : 'I3';
    I4     : 'I4';
    I5     : 'I5';
    
    WS  : [ \t] -> skip ;   
    EOL : [\r\n]+;

尽管名称相同，但 I1 有两个完全不同的含义（CONTEXT 和 OP）。我想认识这两种情况，并避免出现常见的 I1：

 prog : (line? EOL)+;   
    line : instruction; 
    instruction: context case_instruction; 
    
    context: I1 | CASE; 
    
    case_instruction
     : I1
     | I2
     | I3
     ;

出于这个原因，我曾尝试使用词法分析器中的模式进行管理。

原文

In my ANTLR grammar I have a set of operations(OP):

OP: 'I1', 'I2'....'I9', 'I10' (a set of TOKENS);

whenever I find TOKEN:

CASE : valid operations are 'I1','I2','I3' (move1);
SWITCH : valid operations are 'I2','I4','I5' (move2);

the remaining operations are used by other instructions.

Certainly in my Lexer I can't define two tokens as:

OP_MOVE1      : 'I1' | 'I2' | 'I3';
OP_MOVE2      : 'I2' | 'I4' | 'I5';

because I would get::

OP_MOVE2 values unreachable. I2 is always overlapped by token OP_MOVE1

Consequently, imagine that the transactions are not just from I1 to I10 but from I1 to I5000.

One possible solution might be:

LEXER.G4

lexer grammar LexerComment;

CASE   : 'CASE' -> pushMode(CASE_MODE); 
SWITCH : 'SWITCH' -> pushMode(CASE_SWITCH); 

WS  : [ \t] -> skip ;   
EOL : [\r\n]+;  

// ------------ Everything INSIDE a CASE ------------ 
mode CASE_MODE;
CASE_MODE_MOVE1 : 'I1' | 'I2' | 'I3'; 

CASE_MODE_WS        : [ \t] -> channel(HIDDEN) ;   
CASE_MODE_EOL : EOL -> type(EOL),popMode;

// ------------ Everything INSIDE a SWITCH ------------ 
mode CASE_SWITCH;
CASE_SWITCH_MOVE2: 'I2' | 'I4' | 'I5';

CASE_SWITCH_WS       : [ \t] -> channel(HIDDEN) ;   
CASE_SWITCHT_EOL : EOL -> type(EOL),popMode;

PARSER.g4:

parser grammar ParserComment;

options {
      tokenVocab = LexerComment;
  }

prog : (line? EOL)+;   
line : instruction; 
 
instruction: CASE CASE_MODE_MOVE1
            |SWITCH CASE_SWITCH_MOVE2;

inputFile:

CASE I1
CASE I2
CASE I3
SWITCH I2
SWITCH I4
SWITCH I5

The grammar seems to work correctly, although I'm not satisfied with the solution as it requires a lot of code, 1 mode for each case, and repetition of tokens in common in the modes.

Also because if I wanted to recognize, in addition to CASE and SWITCH, a line that begins with MOVE1 OR MOVE2 as:

 instruction: CASE CASE_MODE_MOVE1
             |SWITCH CASE_SWITCH_MOVE2
             | MOVE1 ;

I have not found an optimal solution to solve the problem:

I cannot reuse the previous MOVE1 (CASE_MODE_MOVE1);
I cannot define a new mode.

Is there a way to correctly handle similar cases?
Given a set of TOKENs I would like a subset that can be used depending on the context.

possibly trying to avoid having to define every single operation as a basic TOKEN:

fragment I1: 'I1';
fragment I2: 'I2';
etc

EDIT:
In my grammar equal TOKEN can have different meanings.

for example with the following grammar the TOKEN I1 has a different meaning.

Parser:

parser grammar ParserComment;

options {
      tokenVocab = LexerComment;
  }

prog : (line? EOL)+;   
line : instruction; 
instruction: CONTEXT case_instruction; 

case_instruction
 : I1
 | I2
 | I3
 ;

Lexer:

 lexer grammar LexerComment;
    
    // I1 IS A CONTEXT
    CONTEXT: I1 | CASE;
    
    CASE   : 'CASE';
    SWITCH : 'SWITCH';
    //OPERATIONS
    I1     : 'I1';
    I2     : 'I2';
    I3     : 'I3';
    I4     : 'I4';
    I5     : 'I5';
    
    WS  : [ \t] -> skip ;   
    EOL : [\r\n]+;

despite having the same name, I1 has two completely different meanings (CONTEXT and OP). I would like to recognize these two cases and avoid having a common I1:

 prog : (line? EOL)+;   
    line : instruction; 
    instruction: context case_instruction; 
    
    context: I1 | CASE; 
    
    case_instruction
     : I1
     | I2
     | I3
     ;

I had tried to manage with modes in the lexer for this reason.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

尴尬癌患者 2025-01-20 11:38:12

IMO，您不应该让词法分析器决定何时应创建某些标记。让解析器决定标记在某个位置何时是正确的。

像这样的事情：

prog
 : instruction* EOF
 ;

instruction
 : CASE case_instruction EOL
 | SWITCH switch_instruction EOL
 | EOL
 ;

case_instruction
 : I1
 | I2
 | I3
 ;

switch_instruction
 : I2
 | I4
 | I5
 ;

CASE   : 'CASE';
SWITCH : 'SWITCH';
I1     : 'I1';
I2     : 'I2';
I3     : 'I3';
I4     : 'I4';
I5     : 'I5';
EOL    : '\r'? '\n' | '\r';
SPACES : [ \t]+ -> skip;

IMO, you should not let the lexer decide when certain tokens should be created. Let the parser decide when a token is correct in a certain spot.

Something like this:

prog
 : instruction* EOF
 ;

instruction
 : CASE case_instruction EOL
 | SWITCH switch_instruction EOL
 | EOL
 ;

case_instruction
 : I1
 | I2
 | I3
 ;

switch_instruction
 : I2
 | I4
 | I5
 ;

CASE   : 'CASE';
SWITCH : 'SWITCH';
I1     : 'I1';
I2     : 'I2';
I3     : 'I3';
I4     : 'I4';
I5     : 'I5';
EOL    : '\r'? '\n' | '\r';
SPACES : [ \t]+ -> skip;

回复收藏 0 原文

~没有更多了~