使用 Antlr3 匹配词位变体
我正在尝试使用 Antlr 3.2 和 Java1.6 来匹配英文输入文本中的测量值。我有如下的词汇规则:
fragment
MILLIMETRE
: 'millimetre' | 'millimetres'
| 'millimeter' | 'millimeters'
| 'mm'
;
MEASUREMENT
: MILLIMETRE | CENTIMETRE | ... ;
我希望能够接受大小写输入的任意组合,并且更重要的是,只为 MILLIMETRE 的所有变体返回一个词汇标记。但目前,我的 AST 包含“毫米”、“毫米”、“毫米”等,就像输入文本中一样。
阅读http://www.antlr.org/wiki/pages/viewpage后.action?pageId=1802308,我想我需要执行如下操作:
tokens {
T_MILLIMETRE;
}
fragment
MILLIMETRE
: ('millimetre' | 'millimetres'
| 'millimeter' | 'millimeters'
| 'mm') { $type = T_MILLIMETRE; }
;
但是,当我这样做时,我在 Antlr 生成的 Java 代码中收到以下编译器错误:
cannot find symbol
_type = T_MILLIMETRE;
我尝试了以下操作:
MEASUREMENT
: MILLIMETRE { $type = T_MILLIMETRE; }
| ...
但是那么 MEASUREMENT 不再匹配。
使用重写规则的更明显的解决方案:
MEASUREMENT
: MILLIMETRE -> ^(T_MILLIMETRE MILLIMETRE)
| ...
导致 NPE:
java.lang.NullPointerException at org.antlr.grammar.v2.DefineGrammarItemsWalker.alternative(DefineGrammarItemsWalker.java:1555).
将 MEASUREMENT 放入解析器规则中会给我带来可怕的“以下标记定义永远无法匹配,因为先前的标记与相同的输入匹配”错误。
通过创建解析器规则,
measurement : T_MILLIMETRE | ...
我收到警告“没有与标记对应的词法分析器规则:T_MILLIMETRE”。 Antlr 虽然运行,但它仍然给我 AST 中的输入文本,而不是 T_MILLIMETRE。
显然我还没有像 Antlr 那样看待世界。有人可以给我任何提示或建议吗?
史蒂夫
I'm trying to match measurements in English input text, using Antlr 3.2 and Java1.6. I've got lexical rules like the following:
fragment
MILLIMETRE
: 'millimetre' | 'millimetres'
| 'millimeter' | 'millimeters'
| 'mm'
;
MEASUREMENT
: MILLIMETRE | CENTIMETRE | ... ;
I'd like to be able to accept any combination of upper- and lowercase input and - more importantly - just return a single lexical token for all the variants of MILLIMETRE. But at the moment, my AST contains 'millimetre', 'millimeters', 'mm' etc. just as in the input text.
After reading http://www.antlr.org/wiki/pages/viewpage.action?pageId=1802308, I think I need to do something like the following:
tokens {
T_MILLIMETRE;
}
fragment
MILLIMETRE
: ('millimetre' | 'millimetres'
| 'millimeter' | 'millimeters'
| 'mm') { $type = T_MILLIMETRE; }
;
However, when I do this, I get the following compiler errors in the Java code generated by Antlr:
cannot find symbol
_type = T_MILLIMETRE;
I tried the following instead:
MEASUREMENT
: MILLIMETRE { $type = T_MILLIMETRE; }
| ...
but then MEASUREMENT is not matched anymore.
The more obvious solution with a rewrite rule:
MEASUREMENT
: MILLIMETRE -> ^(T_MILLIMETRE MILLIMETRE)
| ...
causes an NPE:
java.lang.NullPointerException at org.antlr.grammar.v2.DefineGrammarItemsWalker.alternative(DefineGrammarItemsWalker.java:1555).
Making MEASUREMENT into a parser rule gives me the dreaded "The following token definitions can never be matched because prior tokens match the same input" error.
By creating a parser rule
measurement : T_MILLIMETRE | ...
I get the warning "no lexer rule corresponding to token: T_MILLIMETRE". Antlr runs though, but it still gives me the input text in the AST and not T_MILLIMETRE.
I'm obviously not yet seeing the world the way Antlr does. Can anyone give me any hints or advice please?
Steve
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一种方法:
可以使用以下类进行测试:
它会生成以下 DOT 文件:
对应于树:
(由 http://graph.gafol.net/ 创建的图像)
编辑
请注意以下内容:
将始终打印
true
,无论(毫米)标记的“内容”是否为mm
、mmimeter,
毫米
, ...Here's a way to do that:
It can be tested with the following class:
which produces the following DOT file:
which corresponds to the tree:
(image created by http://graph.gafol.net/)
EDIT
Note that the following:
will always print
true
, regardless if the "contents" of the (millimeter) tokens aremm
,millimetre
,millimetres
, ...请注意,
fragment
规则仅在词法分析器内“存活”,并且不再存在于解析器中。例如:输入文本:
将打印:
换句话说:
MILLIMETRE
类型不存在,它们都是MEASUREMENT
类型。Note that
fragment
rules only "live" inside the lexer and cease to exist in the parser. For example:with input text:
will print:
in other words: the type
MILLIMETRE
does not exist, they're all of typeMEASUREMENT
.