解析模板语言

发布于 2024-12-04 10:17:20 字数 833 浏览 1 评论 0原文

我正在尝试解析模板语言,但无法正确解析标签之间出现的任意 html。到目前为止我所拥有的如下,有什么建议吗?有效输入的一个例子是

{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}This Should Be Parsed as a Buffer.{/bar2}

语法是:

grammar g;

options {
  language=Java;
  output=AST;
  ASTLabelType=CommonTree;
}

/* LEXER RULES */
tokens {

}

LD  :    '{';
RD  :    '}';
LOOP    :    '#';  
END_LOOP:   '/';
PARTIAL :   '>';
fragment DIGIT  : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER options {greedy=false;} : ~(LD | RD)+ ;

/* PARSER RULES */
start   : body EOF
;

body    : (tag | loop | partial | BUFFER)*
;

tag     : LD! IDENT^ RD!
;

loop    : LD! LOOP^ IDENT RD!
  body
  LD! END_LOOP! IDENT RD!
;

 partial : LD! PARTIAL^ IDENT RD!
;

buffer  : BUFFER 
;

I'm trying to parse a templating language and I'm having trouble correctly parsing the arbitrary html that can appear between tags. So far what I have is below, any suggestions? An example of a valid input would be

{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}This Should Be Parsed as a Buffer.{/bar2}

And the grammar is:

grammar g;

options {
  language=Java;
  output=AST;
  ASTLabelType=CommonTree;
}

/* LEXER RULES */
tokens {

}

LD  :    '{';
RD  :    '}';
LOOP    :    '#';  
END_LOOP:   '/';
PARTIAL :   '>';
fragment DIGIT  : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER options {greedy=false;} : ~(LD | RD)+ ;

/* PARSER RULES */
start   : body EOF
;

body    : (tag | loop | partial | BUFFER)*
;

tag     : LD! IDENT^ RD!
;

loop    : LD! LOOP^ IDENT RD!
  body
  LD! END_LOOP! IDENT RD!
;

 partial : LD! PARTIAL^ IDENT RD!
;

buffer  : BUFFER 
;

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

用心笑 2024-12-11 10:17:20

您的词法分析器独立于解析器进行标记。如果您的解析器尝试匹配 BUFFER 标记,则词法分析器不会考虑此信息。在您的情况下,输入如下:“blah blah blah”,词法分析器创建 3 个 IDENT 令牌,而不是单个 BUFFER 令牌。

您需要“告诉”词法分析器的是,当您位于标签内时(即遇到 LD 标签),应创建一个 IDENT 令牌,并且当如果您位于标签之外(即遇到 RD 标签),则应创建 BUFFER 令牌而不是 IDENT 令牌。

为了实现这一点,您需要:

  1. 在词法分析器内创建一个 boolean 标志,用于跟踪您位于标签内或标签外的事实。这可以在语法的 @lexer::members { ... } 部分内完成;
  2. 在词法分析器创建 LD- 或 RD-token 后,翻转 (1) 中的 boolean 标志。这可以在词法分析器规则的 @after{ ... } 部分中完成;
  3. 在词法分析器内创建 BUFFER 标记之前,请检查当前是否位于标记之外。这可以通过使用语义谓词 在词法分析器规则的开头。

一个简短的演示:

grammar g;

options { 
  output=AST;
  ASTLabelType=CommonTree; 
}

@lexer::members {
  private boolean insideTag = false;
}

start   
  :  body EOF -> body
  ;

body
  :  (tag | loop | partial | BUFFER)*
  ;

tag
  :  LD IDENT RD -> IDENT
  ;

loop    
  :  LD LOOP IDENT RD body LD END_LOOP IDENT RD -> ^(LOOP body IDENT IDENT)
  ;

partial 
  :  LD PARTIAL IDENT RD -> ^(PARTIAL IDENT)
  ;

LD @after{insideTag=true;}  : '{';
RD @after{insideTag=false;} : '}';

LOOP     : '#';  
END_LOOP : '/';
PARTIAL  : '>';
SPACE    : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT    :  (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER   : {!insideTag}?=> ~(LD | RD)+;

fragment DIGIT  : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');

(请注意,您可能想丢弃标记之间的空格,因此我添加了 SPACE 规则并丢弃了这些空格)

以下类进行测试:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String src = "{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}" + 
                 "This Should Be Parsed as a Buffer.{/bar2}";
    gLexer lexer = new gLexer(new ANTLRStringStream(src));
    gParser parser = new gParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.start().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

使用 运行主类:

*nix/MacOS

java -cp antlr-3.3.jar org.antlr.Tool g.g 
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main

Windows

java -cp antlr-3.3.jar org.antlr.Tool g.g 
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main

您将看到一些 DOT 源代码被打印到控制台,它对应于以下 AST:

在此输入图像描述

(使用 graphviz-dev.appspot.com 创建的图像)

Your lexer tokenizes independently from your parser. If your parser tries to match a BUFFER token, the lexer does not take this info into account. In your case with input like: "blah blah blah", the lexer creates 3 IDENT tokens, not a single BUFFER token.

What you need to "tell" your lexer is that when you're inside a tag (i.e. you encountered a LD tag), a IDENT token should be created, and when you're outside a tag (i.e. you encountered a RD tag), a BUFFER token should be created instead of an IDENT token.

In order to implement this, you need to:

  1. create a boolean flag inside the lexer that keeps track of the fact that you're in- or outside a tag. This can be done inside the @lexer::members { ... } section of your grammar;
  2. after the lexer either creates a LD- or RD-token, flip the boolean flag from (1). This can be done in the @after{ ... } section of the lexer rules;
  3. before creating a BUFFER token inside the lexer, check if you're outside a tag at the moment. This can be done by using a semantic predicate at the start of your lexer rule.

A short demo:

grammar g;

options { 
  output=AST;
  ASTLabelType=CommonTree; 
}

@lexer::members {
  private boolean insideTag = false;
}

start   
  :  body EOF -> body
  ;

body
  :  (tag | loop | partial | BUFFER)*
  ;

tag
  :  LD IDENT RD -> IDENT
  ;

loop    
  :  LD LOOP IDENT RD body LD END_LOOP IDENT RD -> ^(LOOP body IDENT IDENT)
  ;

partial 
  :  LD PARTIAL IDENT RD -> ^(PARTIAL IDENT)
  ;

LD @after{insideTag=true;}  : '{';
RD @after{insideTag=false;} : '}';

LOOP     : '#';  
END_LOOP : '/';
PARTIAL  : '>';
SPACE    : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT    :  (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER   : {!insideTag}?=> ~(LD | RD)+;

fragment DIGIT  : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');

(note that you probably want to discard spaces between tag, so I added a SPACE rule and discarded these spaces)

Test it with the following class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String src = "{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}" + 
                 "This Should Be Parsed as a Buffer.{/bar2}";
    gLexer lexer = new gLexer(new ANTLRStringStream(src));
    gParser parser = new gParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.start().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

and after running the main class:

*nix/MacOS

java -cp antlr-3.3.jar org.antlr.Tool g.g 
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main

Windows

java -cp antlr-3.3.jar org.antlr.Tool g.g 
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main

You'll see some DOT-source being printed to the console, which corresponds to the following AST:

enter image description here

(image created using graphviz-dev.appspot.com)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文