解析模板语言
我正在尝试解析模板语言,但无法正确解析标签之间出现的任意 html。到目前为止我所拥有的如下,有什么建议吗?有效输入的一个例子是
{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}This Should Be Parsed as a Buffer.{/bar2}
语法是:
grammar g;
options {
language=Java;
output=AST;
ASTLabelType=CommonTree;
}
/* LEXER RULES */
tokens {
}
LD : '{';
RD : '}';
LOOP : '#';
END_LOOP: '/';
PARTIAL : '>';
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER options {greedy=false;} : ~(LD | RD)+ ;
/* PARSER RULES */
start : body EOF
;
body : (tag | loop | partial | BUFFER)*
;
tag : LD! IDENT^ RD!
;
loop : LD! LOOP^ IDENT RD!
body
LD! END_LOOP! IDENT RD!
;
partial : LD! PARTIAL^ IDENT RD!
;
buffer : BUFFER
;
I'm trying to parse a templating language and I'm having trouble correctly parsing the arbitrary html that can appear between tags. So far what I have is below, any suggestions? An example of a valid input would be
{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}This Should Be Parsed as a Buffer.{/bar2}
And the grammar is:
grammar g;
options {
language=Java;
output=AST;
ASTLabelType=CommonTree;
}
/* LEXER RULES */
tokens {
}
LD : '{';
RD : '}';
LOOP : '#';
END_LOOP: '/';
PARTIAL : '>';
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER options {greedy=false;} : ~(LD | RD)+ ;
/* PARSER RULES */
start : body EOF
;
body : (tag | loop | partial | BUFFER)*
;
tag : LD! IDENT^ RD!
;
loop : LD! LOOP^ IDENT RD!
body
LD! END_LOOP! IDENT RD!
;
partial : LD! PARTIAL^ IDENT RD!
;
buffer : BUFFER
;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的词法分析器独立于解析器进行标记。如果您的解析器尝试匹配
BUFFER
标记,则词法分析器不会考虑此信息。在您的情况下,输入如下:“blah blah blah”
,词法分析器创建 3 个IDENT
令牌,而不是单个BUFFER
令牌。您需要“告诉”词法分析器的是,当您位于标签内时(即遇到
LD
标签),应创建一个IDENT
令牌,并且当如果您位于标签之外(即遇到RD
标签),则应创建BUFFER
令牌而不是IDENT
令牌。为了实现这一点,您需要:
boolean
标志,用于跟踪您位于标签内或标签外的事实。这可以在语法的@lexer::members { ... }
部分内完成;LD
- 或RD
-token 后,翻转 (1) 中的boolean
标志。这可以在词法分析器规则的@after{ ... }
部分中完成;BUFFER
标记之前,请检查当前是否位于标记之外。这可以通过使用语义谓词 在词法分析器规则的开头。一个简短的演示:
(请注意,您可能想丢弃标记之间的空格,因此我添加了
SPACE
规则并丢弃了这些空格)以下类进行测试:
使用 运行主类:
*nix/MacOS
Windows
您将看到一些 DOT 源代码被打印到控制台,它对应于以下 AST:
(使用 graphviz-dev.appspot.com 创建的图像)
Your lexer tokenizes independently from your parser. If your parser tries to match a
BUFFER
token, the lexer does not take this info into account. In your case with input like:"blah blah blah"
, the lexer creates 3IDENT
tokens, not a singleBUFFER
token.What you need to "tell" your lexer is that when you're inside a tag (i.e. you encountered a
LD
tag), aIDENT
token should be created, and when you're outside a tag (i.e. you encountered aRD
tag), aBUFFER
token should be created instead of anIDENT
token.In order to implement this, you need to:
boolean
flag inside the lexer that keeps track of the fact that you're in- or outside a tag. This can be done inside the@lexer::members { ... }
section of your grammar;LD
- orRD
-token, flip theboolean
flag from (1). This can be done in the@after{ ... }
section of the lexer rules;BUFFER
token inside the lexer, check if you're outside a tag at the moment. This can be done by using a semantic predicate at the start of your lexer rule.A short demo:
(note that you probably want to discard spaces between tag, so I added a
SPACE
rule and discarded these spaces)Test it with the following class:
and after running the main class:
*nix/MacOS
Windows
You'll see some DOT-source being printed to the console, which corresponds to the following AST:
(image created using graphviz-dev.appspot.com)