ANTLR 不匹配 unicode 转义字符

发布于 2024-12-08 13:48:17 字数 639 浏览 0 评论 0原文

我正在为类 C 语言编写一个解析器/解释器，我需要解释转义字符。其中之一是带有此模式“\uXXXX”的 unicode 转义序列，其中 X 是某个十六进制数字。

我的 ANTLR 规则如下所示：

public char returns [char c] 
    : '\\"' { $c = '"'; } 
    | '\\\\' { $c = '\\'; }
    | '\\/' { $c = '/'; }
    | '\\b' { $c = '\b'; }
    | '\\f' { $c = '\f'; }
    | '\\n' { $c = '\n'; }
    | '\\r' { $c = '\r'; }
    | '\\t' { $c = '\t'; }
    | '\\u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT { $c = 'e'; }
    | ~('\\' | '"') { $c = '/'; }
    ;

fragment HEXDIGIT
    : ('0'..'9'|'a'..'f'|'A'..'F')

我向它提供这个字符串“\u1234”，我期望它是“e”，但我得到的是“/”，这是其他所有内容的后备规则。

是否有一些魔法符咒正在发生，碎片和规则或者我不知道的东西？

原文

I'm writing a parser/interpreter for a C-like language and I need to interpret escaped characters. One of them is the unicode-escaped sequence with this pattern "\uXXXX" where X is some hex number.

My ANTLR rules look like this:

public char returns [char c] 
    : '\\"' { $c = '"'; } 
    | '\\\\' { $c = '\\'; }
    | '\\/' { $c = '/'; }
    | '\\b' { $c = '\b'; }
    | '\\f' { $c = '\f'; }
    | '\\n' { $c = '\n'; }
    | '\\r' { $c = '\r'; }
    | '\\t' { $c = '\t'; }
    | '\\u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT { $c = 'e'; }
    | ~('\\' | '"') { $c = '/'; }
    ;

fragment HEXDIGIT
    : ('0'..'9'|'a'..'f'|'A'..'F')

I'm feeding it this string "\u1234" for which I expect an 'e' but I'm getting a '/' instead which is the fallback rule for everything else.

Is there some magic juju going on with fragments and rules or something that I'm not aware of?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

十年九夏 2024-12-15 13:48:17

正如 Adam 所提到的， char 目前是一个解析器规则，但应该改为词法分析器规则，在这种情况下，你不能让它返回 char （词法分析器规则始终返回 Token 的实例！）。

您可以使用其 setText(...) 方法调整令牌的内部文本，如下所示（假设 Java 是目标语言）：

// lexer rules start with a capital!
Char
  :  '\\"'                                     { setText("\""); } 
  |  '\\\\'                                    { setText("\\"); } 
  |  '\\/'                                     { setText("/"); } 
  |  '\\b'                                     { setText("\b"); } 
  |  '\\f'                                     { setText("\f"); } 
  |  '\\n'                                     { setText("\n"); } 
  |  '\\r'                                     { setText("\r"); } 
  |  '\\t'                                     { setText("\t"); } 
  |  '\\u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT 
     { 
       String hex = getText();
       int i = Integer.parseInt(hex.substring(2), 16);
       setText(hex + " base 10 = " + i);
     } 
  |  ~('\\' | '"')
  ;

fragment HEXDIGIT
  :  ('0'..'9'|'a'..'f'|'A'..'F')
  ;

As mentioned by Adam, char is a parser rule at the moment, but should be made a lexer rule instead, in which case you can't let it return a char (lexer rules always return an instance of a Token!).

You can adjust the inner-text of a token using its setText(...) method like this (assuming Java is the target language):

// lexer rules start with a capital!
Char
  :  '\\"'                                     { setText("\""); } 
  |  '\\\\'                                    { setText("\\"); } 
  |  '\\/'                                     { setText("/"); } 
  |  '\\b'                                     { setText("\b"); } 
  |  '\\f'                                     { setText("\f"); } 
  |  '\\n'                                     { setText("\n"); } 
  |  '\\r'                                     { setText("\r"); } 
  |  '\\t'                                     { setText("\t"); } 
  |  '\\u' HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT 
     { 
       String hex = getText();
       int i = Integer.parseInt(hex.substring(2), 16);
       setText(hex + " base 10 = " + i);
     } 
  |  ~('\\' | '"')
  ;

fragment HEXDIGIT
  :  ('0'..'9'|'a'..'f'|'A'..'F')
  ;

回复收藏 0 原文

~没有更多了~