如何在antlr中匹配unicode字符

发布于 2024-08-18 07:53:12 字数 523 浏览 8 评论 0原文

我试图挑选出文本中的所有标记，并且需要匹配所有 Ascii 和 Unicode 字符，因此以下是我对它们的布局方式。

fragment CHAR     :  ('A'..'Z') | ('a'..'z');
fragment DIGIT    :  ('0'..'9');
fragment UNICODE  :  '\u0000'..'\u00FF';

现在，如果我将令牌规则写为：

TOKEN  :  (CHAR|DIGIT|UNICODE)+;

我得到“决策可以使用多种替代方案匹配输入，例如“'A'..'Z'”：1, 3 因此，该输入的替代方案 3 被禁用” " 决策可以使用多种选择来匹配输入，例如“'0'..'9'”：2、3 因此，该输入禁用了替代方案 3”

并且没有任何内容匹配：而且如果我把它写成

TOKEN  :  (UNICODE)+;

没有匹配。

有没有办法做到这一点。

原文

I am trying to pick out all tokens in a text and need to match all Ascii and Unicode characters, so here is how I have laid them out.

fragment CHAR     :  ('A'..'Z') | ('a'..'z');
fragment DIGIT    :  ('0'..'9');
fragment UNICODE  :  '\u0000'..'\u00FF';

Now if I write my token rule as:

TOKEN  :  (CHAR|DIGIT|UNICODE)+;

I get "Decision can match input such as "'A'..'Z'" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input"
" Decision can match input such as "'0'..'9'" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input"

And nothing gets matched:
And also if I write it as

TOKEN  :  (UNICODE)+;

Nothing gets matched.

Is there a way of doing this.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风流物 2024-08-25 07:53:12

如果您计划使用 Unicode，需要考虑的另一件事是您应该设置 charvocabulary 选项来表示您希望允许 Unicode 范围 0 到 FFFE 中的任何字符。

options
{
charVocabulary='\u0000'..'\uFFFE';
}

您通常会使用默认值在示例中看到的是

options
{
charVocabulary = '\3'..'\377';
}

为了涵盖上面提出的观点。一般来说，如果您需要 ascii 字符范围 'A'..'Z' 和 unicode 范围，您将创建一个 unicode 词法分析器规则，例如：
'\u0080'..'\ufffe'

One other thing to consider if you are planning on using Unicode is that you should set the charvocabulary option to say that you want to allow any char in the Unicode range of 0 through FFFE

options
{
charVocabulary='\u0000'..'\uFFFE';
}

The default you'll usually see in the examples is

options
{
charVocabulary = '\3'..'\377';
}

To cover the point made above. Generally if you needed both the ascii character range 'A'..'Z' and the unicode range you'd make a unicode lexer rule like:
'\u0080'..'\ufffe'

回复收藏 0 原文

稚气少女 2024-08-25 07:53:12

实际上，TOKEN: (UNICODE)+ 完全没用。

由于一切都是令牌字符，因此如果您尝试使用这样的规则来匹配 Java 程序，比如说，它只会匹配整个程序并将其作为一个大令牌返回给您。

如果您想将输入分成有意义的片段，您确实需要将角色分成不同的组。

了解“专业人士”是如何做到这一点可能会帮助您。这是Java 的 BNF 语法，这里是BNF 标识符，这显示了他们如何采取分组的麻烦

identifier 
  ::= "a..z,$,_" { "a..z,$,_,0..9,unicode character over 00C0" }

Practically speaking, TOKEN: (UNICODE)+ is completely useless.

Since everything is a token character, if you try to use such a rule to match a Java program, say, it will simply match the whole program and return it to you as one big token.

You really do need to break your characters down into different groups if you want to split your input apart into meaningful fragments.

It might help you to take a look at how the "pros" have done it. Here is a BNF grammar for Java, and here is BNF for an identifier, which shows how they took to the trouble to group out

identifier 
  ::= "a..z,$,_" { "a..z,$,_,0..9,unicode character over 00C0" }

回复收藏 0 原文

~没有更多了~

关于作者

三生殊途

暂无简介

0 文章

0 评论

21 人气

关注发私信

友情链接

文江博客

如何在antlr中匹配unicode字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

linfzu01

§对你不离不弃

可遇━不可求

枕梦

qq_3LFa8Q

JP

友情链接

如何在antlr中匹配unicode字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

linfzu01

§对你不离不弃

可遇━不可求

枕梦

qq_3LFa8Q

JP

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。