Lexer 和 unicode,例如德语变异元音

发布于 2024-11-30 01:16:13 字数 392 浏览 0 评论 0原文

两个问题:
1. 为什么下面的语法无法识别字符串abäcd(ANTLRWorks 1.4.2)(结果只有abcd,即德语变异元音ä缺少)?
2. 如何将Vowels分为VowelsUpperVowelsLower并在规则Vowels中使用这两个规则?

grammar Vowels1a;

CharLower
  : 'a'..'z'
  ;

Vowels
  : 'ä' | 'ö' | 'ü' | 'Ä'| 'Ö' | 'Ü'
  ;

test
  : ( CharLower | Vowels )+
  ;

Two questions:
1. Why is the string abäcd not recognized (ANTLRWorks 1.4.2) with the grammar below (the result is only abcd, that means the German mutated vowel ä is missing)?
2. How can I divide Vowels in VowelsUpper and VowelsLower and use both rules in rule Vowels?

grammar Vowels1a;

CharLower
  : 'a'..'z'
  ;

Vowels
  : 'ä' | 'ö' | 'ü' | 'Ä'| 'Ö' | 'Ü'
  ;

test
  : ( CharLower | Vowels )+
  ;

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

所谓喜欢 2024-12-07 01:16:13

ANTLRStarter 写道:

1.为什么下面的语法无法识别字符串 abäcd(ANTLRWorks 1.4.2)(结果只有 abcd,这意味着德语变异元音 ä 丢失了?

我无法重现这一点。ANTLRWorks 的解释器和调试器(1.4.2) )产生以下解析树:

在此处输入图像描述

并且一个小的手动测试也显示this:

Main.java

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    Vowels1aLexer lexer = new Vowels1aLexer(new ANTLRStringStream("abäcd"));
    Vowels1aParser parser = new Vowels1aParser(new CommonTokenStream(lexer));
    parser.test();
  }
}

Vowels1a.g

grammar Vowels1a;

test
 : ( CharLower {System.out.println("CharLower :: " + $CharLower.text);}
   | Vowels    {System.out.println("Vowels    :: " + $Vowels.text);}
   )+
 ;

CharLower
 : 'a'..'z'
 ;

Vowels
 : 'ä' | 'ö' | 'ü' | 'Ä'| 'Ö' | 'Ü'
 ;

并运行演示:

java -cp antlr-3.3.jar org.antlr.Tool Vowels1a.g 
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main

它将打印:

CharLower :: a
CharLower :: b
Vowels    :: ä
CharLower :: c
CharLower :: d

ANTLRStarter 写道:

2.如何在 VowelsUpper 和 VowelsLower 中划分元音并在规则元音中使用这两个规则?

创建两个 fragment 规则(VowelsUpperVowelsLower),并让 Vowels 匹配这两个 fragment >s:

Vowels
 : VowelsUpper
 | VowelsLower
 ;

fragment VowelsUpper
 : 'Ä'| 'Ö' | 'Ü'
 ;

fragment VowelsLower
 : 'ä' | 'ö' | 'ü'
 ;

请注意,您不能在解析器规则中使用 fragment 规则,只能从其他词法分析器规则中使用!

ANTLRStarter wrote:

1 . Why is the string abäcd not recognized (ANTLRWorks 1.4.2) with the grammar below (the result is only abcd, that means the German mutated vowel ä is missing?

I could not reproduce this. Both ANTLRWorks' interpreter and debugger (1.4.2) produce the following parse tree:

enter image description here

And a small manual test also shows this:

Main.java

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    Vowels1aLexer lexer = new Vowels1aLexer(new ANTLRStringStream("abäcd"));
    Vowels1aParser parser = new Vowels1aParser(new CommonTokenStream(lexer));
    parser.test();
  }
}

Vowels1a.g

grammar Vowels1a;

test
 : ( CharLower {System.out.println("CharLower :: " + $CharLower.text);}
   | Vowels    {System.out.println("Vowels    :: " + $Vowels.text);}
   )+
 ;

CharLower
 : 'a'..'z'
 ;

Vowels
 : 'ä' | 'ö' | 'ü' | 'Ä'| 'Ö' | 'Ü'
 ;

And to run the demo:

java -cp antlr-3.3.jar org.antlr.Tool Vowels1a.g 
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main

which will print:

CharLower :: a
CharLower :: b
Vowels    :: ä
CharLower :: c
CharLower :: d

ANTLRStarter wrote:

2 . How can I divide Vowels in VowelsUpper and VowelsLower and use both rules in rule Vowels?

Create two fragment rules (VowelsUpper and VowelsLower) and let Vowels match both these fragments:

Vowels
 : VowelsUpper
 | VowelsLower
 ;

fragment VowelsUpper
 : 'Ä'| 'Ö' | 'Ü'
 ;

fragment VowelsLower
 : 'ä' | 'ö' | 'ü'
 ;

Be aware that you cannot use fragment rules in your parser rules, only from other lexer rules!

捂风挽笑 2024-12-07 01:16:13

关于问题1:
这听起来很像编码问题。 “61 62 E4 63 64”表示该文件是使用 iso-8859-1(或 windows-something 变体)编码的。 ANTLRWorks 似乎使用 utf-8,我认为没有明显的方法可以改变它。

我假设您使用该文件作为输入运行调试器。将文件保存为 utf-8 时,它对我来说工作正常,而使用 iso-8859-1 时,缺少“ä”。我无法重现 ANTLRWorks 1.4.3 中的 NoViableAlt 错误,输入流中似乎缺少“ä” - 也许 java 的 utf8 解码器默默地跳过无效序列...

如果您构建自己的应用程序,您可以自己指定哪种编码输入流/文件使用。因此,在 Python 中,ANTLRFileStream/ANTLRInputStream 有一个方便的“编码”参数。

Regarding question 1:
That smells very much like an encoding problem. "61 62 E4 63 64" means that the file is encoded using iso-8859-1 (or that windows-something variant). ANTLRWorks seems to use utf-8 and I see no obvious way to change that.

I assume you ran the debugger with that file as input. When saving the file as utf-8, it works fine for me and with iso-8859-1 the 'ä' is missing. I cannot reproduce the NoViableAlt error in ANTLRWorks 1.4.3, the 'ä' seems to be just missing from the input stream - perhaps java's utf8 decoder silently skips invalid sequences...

If you build your own app, you can specify yourself which encoding the input stream/file uses. In Python the ANTLRFileStream/ANTLRInputStream have a handy 'encoding' argument for that reason.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文