使用 ANTLR 解析非结构化文本

发布于 12-09 09:56 字数 474 浏览 0 评论 0原文

举个例子，假设我想用单个标记元素双星 ** 来解析大部分非结构化文本。这是我的 ANTLR 语法：

text : (plain | tag)+ ;
plain : ~(TAG) ;

tag : TAG tag_inner TAG ;
tag_inner : ~(TAG) ;

TAG : '**' ;
TEXT : ('a'..'z' | ' ' | '.')+ ;

如果我正在解析的文本在语法上正确，即对于每个开头 ** 都有一个结束 **，则此语法工作得很好。如果 ** 的数量为奇数，ANTLR 会发出错误消息并输出错误。

如何解决这个问题，以便 ANTLR 向前寻找结束双星，并且如果没有人将那个单独的双星视为纯文本？我很确定 ANTLR 可以做到这一点，并且句法/语义谓词就是答案，但是在我们阅读了文档之后，我仍然无法解决这个问题。

原文

As an example, lets say I want to parse mostly unstructured text with single markup element, double star **. This is my ANTLR grammar:

text : (plain | tag)+ ;
plain : ~(TAG) ;

tag : TAG tag_inner TAG ;
tag_inner : ~(TAG) ;

TAG : '**' ;
TEXT : ('a'..'z' | ' ' | '.')+ ;

This grammar works just fine if the text I'm parsing is syntactically correct, that is for every opening ** there is a closing **. If there is an odd number of **s, ANTLR complains, and errors out.

How would one fix this, so that ANTLR will look ahead for a closing double star, and if there is no one treat that lone double star as plain text? I'm pretty sure ANTLR can do this and that syntactic/semantic predicates are the answer, but after an our spent reading the docs, I still can't work it out.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浅语花开2024-12-16 09:56:12

当你扩展语法时，这会变得混乱！ :)

但是，当然，可以使用谓词。这是一个演示：

Tg

grammar T;

options {
  output=AST;
}

tokens {
  ROOT;
  PROPER_TAG;
}

parse
  :  text+ EOF -> ^(ROOT text+)
  ;

text
  :  (tag)=> tag // syntactic predicate here! (the `(...)=>`)
  |  plain
  |  TAG
  ;

plain
  :  ~TAG 
  ;

tag
  :  TAG plain TAG -> ^(PROPER_TAG plain)
  ;

TAG  : '**' ;
TEXT : ('a'..'z' | ' ' | '.')+ ;

Main.java

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    TLexer lexer = new TLexer(new ANTLRStringStream("this **is** just **a simple** demo **."));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

运行该演示

java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main

将产生一些对应于以下 AST：

在此处输入图像描述

（使用 graphviz-dev.appspot.com)

This will get messy when you expand your grammar! :)

But, sure, it is possible using predicates. Here's a demo:

T.g

grammar T;

options {
  output=AST;
}

tokens {
  ROOT;
  PROPER_TAG;
}

parse
  :  text+ EOF -> ^(ROOT text+)
  ;

text
  :  (tag)=> tag // syntactic predicate here! (the `(...)=>`)
  |  plain
  |  TAG
  ;

plain
  :  ~TAG 
  ;

tag
  :  TAG plain TAG -> ^(PROPER_TAG plain)
  ;

TAG  : '**' ;
TEXT : ('a'..'z' | ' ' | '.')+ ;

Main.java

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    TLexer lexer = new TLexer(new ANTLRStringStream("this **is** just **a simple** demo **."));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

Run the demo

java -cp antlr-3.3.jar org.antlr.Tool T.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main

which will produce some DOT-output that corresponds to the following AST:

enter image description here

(image created using graphviz-dev.appspot.com)

回复收藏 0 原文

~没有更多了~

关于作者

硪扪都還晓

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

Tg
Main.java
运行该演示
T.g
Main.java
Run the demo

使用 ANTLR 解析非结构化文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

Tg

Main.java

运行该演示

T.g

Main.java

Run the demo

关于作者

相关话题

热门标签

推荐作者

马化腾

thousandcents

辰『辰』

ailin001

再摆5分钟就干活

冷情妓

友情链接

使用 ANTLR 解析非结构化文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

Tg

Main.java

运行该演示

T.g

Main.java

Run the demo

关于作者

相关话题

热门标签

推荐作者

马化腾

thousandcents

辰『辰』

ailin001

再摆5分钟就干活

冷情妓

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。