如何在 Antlr 分词器中进行 Unicode 转义解码

发布于 2024-09-25 10:36:50 字数 968 浏览 4 评论 0原文

我使用 AntlrWorks 创建了一个 antlr 语法，并创建了一个供内部使用的本地化工具。我想在解析时将 unicode 转义序列转换为实际的 Java 字符，但不确定执行此操作的最佳方法。这是我的语法中的标记定义。是否有某种方法可以为片段 UNICODE_ESC 指定一个操作，该操作将返回字符，而不是六个字符转义序列？

ID  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
    ;

INT :   '0'..'9'+
    ;

COMMENT
    :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
    |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
    ;

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

STRING
    :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
    ;

fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

原文

I've created a antlr grammar using AntlrWorks, and have created a localization tool for internal use. I would like to convert unicode escape sequences into the actual Java character while parsing, but am unsure of the best way to do this. Here are the token definitions in my grammar. Is there some way to specify an action for the fragment UNICODE_ESC, that would return the character, instead of the six character escape sequence?

ID  :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
    ;

INT :   '0'..'9'+
    ;

COMMENT
    :   '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
    |   '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
    ;

WS  :   ( ' '
        | '\t'
        | '\r'
        | '\n'
        ) {$channel=HIDDEN;}
    ;

STRING
    :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
    ;

fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无风消散 2024-10-02 10:36:50

迈克尔写道：
这是用 Java 编写的，因此对于字符或字符串来说，表示不应该成为问题。

是的，但是在 Java 源文件中，Unicode 文字看起来是一样的......所以我不确定你的意思。

迈克尔写道：
我只是想知道如何进行替换。如果它更容易，假设我想用字符“？”替换所有 UNICODE_ESC 片段解析时。

好的，可以这样完成：

Token : 'x' {setText("?");} ;

其中 Token 与文字 x 匹配，然后用 ? 重写。

Michael wrote:
This is in Java, so representation shouldn't be an issue for Character or String.

Yeah but in Java source file, the Unicode literals look just the same... So I'm not sure what you mean.

Michael wrote:
I am just wondering how to do the replacement. If it makes it easier, say I want to replace all UNICODE_ESC fragments with the character '?' while parsing.

Okay, that can be done like this: