HTML 子集语法中的 MismatchedTokenException
我正在编写一个 ANTLR 语法来识别纯文本中的 HTML 块级元素。下面是一个相关片段,仅限于 div 标记:
grammar Test;
blockElement
: div
;
div
: '<' D I V HTML_ATTRIBUTES? '>' (blockElement | TEXT)* '</' D I V '>'
;
D : ('d' | 'D') ;
I : ('i' | 'I') ;
V : ('v' | 'V') ;
HTML_ATTRIBUTES
: WS (~( '<' | '\n' | '\r' | '"' | '>' ))+
;
TEXT
: (. | '\r' | '\n')
;
fragment WS
: (' ' | '\t')
;
TEXT 标记应该表示任何非块级元素,例如纯文本或内联标记(例如<\b>
)。当我在嵌套块元素上测试它时,例如:
它会正确解析它们。但是,一旦我添加一些随机文本,它就会在消耗第一个 TEXT 标记后立即抛出 MismatchedTokenException(0!=0),例如大写的 T :
这是一些随机文本
有什么建议吗?我在概念上做错了什么吗?我正在使用 ANTLR v. 3.2 并使用 ANTLRWorks v. 1.4 进行测试。
谢谢
I am writing an ANTLR grammar to recognize HTML block-level elements within plain text. Here is a relevant snippet, limited to the div tag:
grammar Test;
blockElement
: div
;
div
: '<' D I V HTML_ATTRIBUTES? '>' (blockElement | TEXT)* '</' D I V '>'
;
D : ('d' | 'D') ;
I : ('i' | 'I') ;
V : ('v' | 'V') ;
HTML_ATTRIBUTES
: WS (~( '<' | '\n' | '\r' | '"' | '>' ))+
;
TEXT
: (. | '\r' | '\n')
;
fragment WS
: (' ' | '\t')
;
The TEXT token is supposed to represent anything that is no block-level element, such as plain text or inline tags (e. g. <b><\b>
). When I test it on nested block elements, like:
<div level_0><div level_1></div></div>
it parses them correctly. However, as soon as I add some random text, it throws a MismatchedTokenException(0!=0) right after having consumed the first TEXT token, e. g. the capital T in:
<div level_0>This is some random text</div>
Any suggestions? Am I doing something conceptually wrong? I am using ANTLR v. 3.2 and doing the testing with ANTLRWorks v. 1.4.
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我建议不要使用 ANTLRWorks 测试您的语法:控制台中很容易错过错误消息,因此它可能不会按照您的预期解释您的测试输入。使用自定义创建的类执行此操作,如下所示:
现在,以下规则不正确:
.
已匹配\r
和\n
,所以应该是:当改变它时,你可以创建一个解析器& lexter,编译所有 .java 文件并运行 Main 类:
这将产生以下错误:
因为
This
中的i
正在由规则I 进行标记: ('i' | 'I') ;
。您当前的方法还有更多问题:
HTML_ATTRIBUTES
做得太多:您应该使用ATTRIBUTE
、=
和VALUE
> 规则,然后将复数(html 属性)移至解析器;<
和>
这是不正确的(可以包含它们,尽管不推荐)。如果我是你,我就会重新开始。如果你愿意,我愿意提出一个开始:就这么说吧。
I recommend not testing your grammar with ANTLRWorks: error messages are easily missed in the console and it might therefor interpret your test input not as you expect. Do it with a custom created class like this:
Now, the following rule is not correct:
The
.
already matches both\r
and\n
, so it should be:When changing that, you can create a parser & lexter, compile all .java files and run the Main class:
which will produce the following error:
because the
i
fromThis
is being tokenized by the ruleI : ('i' | 'I') ;
.There are more problems with your current approach:
HTML_ATTRIBUTES
does too much: you should instead haveATTRIBUTE
,=
andVALUE
rules and then move the plural (html attributes) to your parser instead;<
and>
which is incorrect (the can contain them, although it is not recommend).I'd start over if I were you. If you want, I'm willing to propose a start: just says so.