使用 ANTLR 解析非结构化文本
举个例子,假设我想用单个标记元素双星 **
来解析大部分非结构化文本。这是我的 ANTLR 语法:
text : (plain | tag)+ ;
plain : ~(TAG) ;
tag : TAG tag_inner TAG ;
tag_inner : ~(TAG) ;
TAG : '**' ;
TEXT : ('a'..'z' | ' ' | '.')+ ;
如果我正在解析的文本在语法上正确,即对于每个开头 **
都有一个结束 **
,则此语法工作得很好。如果 **
的数量为奇数,ANTLR 会发出错误消息并输出错误。
如何解决这个问题,以便 ANTLR 向前寻找结束双星,并且如果没有人将那个单独的双星视为纯文本?我很确定 ANTLR 可以做到这一点,并且句法/语义谓词就是答案,但是在我们阅读了文档之后,我仍然无法解决这个问题。
As an example, lets say I want to parse mostly unstructured text with single markup element, double star **
. This is my ANTLR grammar:
text : (plain | tag)+ ;
plain : ~(TAG) ;
tag : TAG tag_inner TAG ;
tag_inner : ~(TAG) ;
TAG : '**' ;
TEXT : ('a'..'z' | ' ' | '.')+ ;
This grammar works just fine if the text I'm parsing is syntactically correct, that is for every opening **
there is a closing **
. If there is an odd number of **
s, ANTLR complains, and errors out.
How would one fix this, so that ANTLR will look ahead for a closing double star, and if there is no one treat that lone double star as plain text? I'm pretty sure ANTLR can do this and that syntactic/semantic predicates are the answer, but after an our spent reading the docs, I still can't work it out.
当你扩展语法时,这会变得混乱! :)
但是,当然,可以使用谓词。这是一个演示:
Tg
Main.java
运行该演示
将产生一些对应于以下 AST:
(使用 graphviz-dev.appspot.com)
This will get messy when you expand your grammar! :)
But, sure, it is possible using predicates. Here's a demo:
T.g
Main.java
Run the demo
which will produce some DOT-output that corresponds to the following AST:
(image created using graphviz-dev.appspot.com)