Lexer 和 unicode,例如德语变异元音
两个问题:
1. 为什么下面的语法无法识别字符串abäcd
(ANTLRWorks 1.4.2)(结果只有abcd
,即德语变异元音ä缺少
)?
2. 如何将Vowels
分为VowelsUpper
和VowelsLower
并在规则Vowels
中使用这两个规则?
grammar Vowels1a;
CharLower
: 'a'..'z'
;
Vowels
: 'ä' | 'ö' | 'ü' | 'Ä'| 'Ö' | 'Ü'
;
test
: ( CharLower | Vowels )+
;
Two questions:
1. Why is the string abäcd
not recognized (ANTLRWorks 1.4.2) with the grammar below (the result is only abcd
, that means the German mutated vowel ä
is missing)?
2. How can I divide Vowels
in VowelsUpper
and VowelsLower
and use both rules in rule Vowels
?
grammar Vowels1a;
CharLower
: 'a'..'z'
;
Vowels
: 'ä' | 'ö' | 'ü' | 'Ä'| 'Ö' | 'Ü'
;
test
: ( CharLower | Vowels )+
;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我无法重现这一点。ANTLRWorks 的解释器和调试器(1.4.2) )产生以下解析树:
并且一个小的手动测试也显示this:
Main.java
Vowels1a.g
并运行演示:
它将打印:
创建两个
fragment
规则(VowelsUpper
和VowelsLower
),并让Vowels
匹配这两个fragment
>s:请注意,您不能在解析器规则中使用
fragment
规则,只能从其他词法分析器规则中使用!I could not reproduce this. Both ANTLRWorks' interpreter and debugger (1.4.2) produce the following parse tree:
And a small manual test also shows this:
Main.java
Vowels1a.g
And to run the demo:
which will print:
Create two
fragment
rules (VowelsUpper
andVowelsLower
) and letVowels
match both thesefragment
s:Be aware that you cannot use
fragment
rules in your parser rules, only from other lexer rules!关于问题1:
这听起来很像编码问题。 “61 62 E4 63 64”表示该文件是使用 iso-8859-1(或 windows-something 变体)编码的。 ANTLRWorks 似乎使用 utf-8,我认为没有明显的方法可以改变它。
我假设您使用该文件作为输入运行调试器。将文件保存为 utf-8 时,它对我来说工作正常,而使用 iso-8859-1 时,缺少“ä”。我无法重现 ANTLRWorks 1.4.3 中的 NoViableAlt 错误,输入流中似乎缺少“ä” - 也许 java 的 utf8 解码器默默地跳过无效序列...
如果您构建自己的应用程序,您可以自己指定哪种编码输入流/文件使用。因此,在 Python 中,ANTLRFileStream/ANTLRInputStream 有一个方便的“编码”参数。
Regarding question 1:
That smells very much like an encoding problem. "61 62 E4 63 64" means that the file is encoded using iso-8859-1 (or that windows-something variant). ANTLRWorks seems to use utf-8 and I see no obvious way to change that.
I assume you ran the debugger with that file as input. When saving the file as utf-8, it works fine for me and with iso-8859-1 the 'ä' is missing. I cannot reproduce the NoViableAlt error in ANTLRWorks 1.4.3, the 'ä' seems to be just missing from the input stream - perhaps java's utf8 decoder silently skips invalid sequences...
If you build your own app, you can specify yourself which encoding the input stream/file uses. In Python the ANTLRFileStream/ANTLRInputStream have a handy 'encoding' argument for that reason.