ANTLR 3 中 wikitext-to-HTML 的工作示例
我试图在 ANTLR 3 中充实一个 wikitext-to-HTML 翻译器,但我一直陷入困境。
您知道我可以检查的工作示例吗?我尝试了 MediaWiki ANTLR 语法和 Wiki Creole 语法,但我无法让它们生成词法分析器和语法分析器。 ANTLR 3 中的解析器。
以下是我尝试使用的两种语法的链接:
- http://www .mediawiki.org/wiki/Markup_spec/ANTLR
- http://www.wikicreole.org /wiki/EBNFGrammarForCreole1.0
我无法使用这两个来生成我的 Java Lexer 和 Parser。 (我使用 ANTLR3 作为 Eclipse 插件)。 MediaWiki 需要很长的时间来构建,然后在某些时候它会抛出 OutOfMemory 异常。另一个有错误,我不知道如何调试。
编辑:好吧,我有一个非常基本的语法:
grammar wikitext;
options {
//output = AST;
//ASTLabelType = CommonTree;
output = template;
language = Java;
}
document: line (NL line?)*;
line: horizontal_line | list | heading | paragraph;
/* horizontal line */
horizontal_line: HRLINE;
/* lists */
list: unordered_list | ordered_list;
unordered_list: '*'+ content;
ordered_list: '#'+ content;
/* Headings */
heading: heading1 | heading2 | heading3 | heading4 | heading5 | heading6;
heading1: H1 plain H1;
heading2: H2 plain H2;
heading3: H3 plain H3;
heading4: H4 plain H4;
heading5: H5 plain H5;
heading6: H6 plain H6;
/* Paragraph */
paragraph: content;
content: (formatted | link)+;
/* links */
link: external_link | internal_link;
external_link: '[' external_link_uri ('|' external_link_title)? ']';
internal_link: '[[' internal_link_ref ('|' internal_link_title)? ']]' ;
external_link_uri: CHARACTER+;
external_link_title: plain;
internal_link_ref: plain;
internal_link_title: plain;
/* bold & italic */
formatted: bold_italic | bold | italic | plain;
bold_italic: BOLD_ITALIC plain BOLD_ITALIC;
bold: BOLD plain BOLD;
italic: ITALIC plain ITALIC;
/* Plain text */
plain: (CHARACTER | SPACE)+;
/**
* LEXER RULES
* --------------------------------------------------------------------------
*/
HRLINE: '---' '-'+;
H1: '=';
H2: '==';
H3: '===';
H4: '====';
H5: '=====';
H6: '======';
BOLD_ITALIC: '\'\'\'\'\'';
BOLD: '\'\'\'';
ITALIC: '\'\'';
NL: '\r'?'\n';
CHARACTER : '!' | '"' | '#' | '$' | '%' | '&'
| '*' | '+' | ',' | '-' | '.' | '/'
| ':' | ';' | '?' | '@' | '\\' | '^' | '_' | '`' | '~'
| '0'..'9' | 'A'..'Z' |'a'..'z'
| '\u0080'..'\u7fff'
| '(' | ')'
| '\'' | '<' | '>' | '=' | '[' | ']' | '|'
;
SPACE: ' ' | '\t';
虽然我不清楚如何输出 HTML。我一直在研究 StringTemplate,但我不明白如何构建我的模板。具体来说,哪个模板在语法中的位置。你能帮我举一个简短的例子吗?
I'm trying to flesh out a wikitext-to-HTML translator in ANTLR 3, but I keep getting stuck.
Do you know of a working example that I can inspect? I tried the MediaWiki ANTLR grammar and the Wiki Creole grammar, but I can't get them to generate the lexer & parser in ANTLR 3.
Here are the links to two grammars I've tried using:
- http://www.mediawiki.org/wiki/Markup_spec/ANTLR
- http://www.wikicreole.org/wiki/EBNFGrammarForCreole1.0
I can't get any of these two to generate my Java Lexer and Parser. (I'm using ANTLR3 as Eclipse plugin). MediaWiki takes a looong time to build and then at some point it throws an OutOfMemory exception. The other one has errors in it which I don't know how to debug.
EDIT: Okay I've got a very basic grammar:
grammar wikitext;
options {
//output = AST;
//ASTLabelType = CommonTree;
output = template;
language = Java;
}
document: line (NL line?)*;
line: horizontal_line | list | heading | paragraph;
/* horizontal line */
horizontal_line: HRLINE;
/* lists */
list: unordered_list | ordered_list;
unordered_list: '*'+ content;
ordered_list: '#'+ content;
/* Headings */
heading: heading1 | heading2 | heading3 | heading4 | heading5 | heading6;
heading1: H1 plain H1;
heading2: H2 plain H2;
heading3: H3 plain H3;
heading4: H4 plain H4;
heading5: H5 plain H5;
heading6: H6 plain H6;
/* Paragraph */
paragraph: content;
content: (formatted | link)+;
/* links */
link: external_link | internal_link;
external_link: '[' external_link_uri ('|' external_link_title)? ']';
internal_link: '[[' internal_link_ref ('|' internal_link_title)? ']]' ;
external_link_uri: CHARACTER+;
external_link_title: plain;
internal_link_ref: plain;
internal_link_title: plain;
/* bold & italic */
formatted: bold_italic | bold | italic | plain;
bold_italic: BOLD_ITALIC plain BOLD_ITALIC;
bold: BOLD plain BOLD;
italic: ITALIC plain ITALIC;
/* Plain text */
plain: (CHARACTER | SPACE)+;
/**
* LEXER RULES
* --------------------------------------------------------------------------
*/
HRLINE: '---' '-'+;
H1: '=';
H2: '==';
H3: '===';
H4: '====';
H5: '=====';
H6: '======';
BOLD_ITALIC: '\'\'\'\'\'';
BOLD: '\'\'\'';
ITALIC: '\'\'';
NL: '\r'?'\n';
CHARACTER : '!' | '"' | '#' | '
It's not clear for me though how one would go about outputting HTML. I've been looking into StringTemplate, but I don't understand how to structure my templates. Specifically, which template goes where in the grammar. Can you help me with a short example?
| '%' | '&'
| '*' | '+' | ',' | '-' | '.' | '/'
| ':' | ';' | '?' | '@' | '\\' | '^' | '_' | '`' | '~'
| '0'..'9' | 'A'..'Z' |'a'..'z'
| '\u0080'..'\u7fff'
| '(' | ')'
| '\'' | '<' | '>' | '=' | '[' | ']' | '|'
;
SPACE: ' ' | '\t';
It's not clear for me though how one would go about outputting HTML. I've been looking into StringTemplate, but I don't understand how to structure my templates. Specifically, which template goes where in the grammar. Can you help me with a short example?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好的,在您编辑之后,我有一些建议。
正如我在评论中所说,为这种语言编写语法几乎是不可能的。至少,尝试一次性做到这一点。我认为这项工作的唯一方法是使用多个解析器来完成此操作,其中第一个“解析阶段”将非常“粗略”地解析维基源。例如:
table
将被标记为:TABLE : '{|' .* '|}'
然后您将创建另一个解析器来正确解析该表。在一个解析器中执行此操作将导致解析器规则中存在相当多的歧义(IMO)。关于发出 HTML 代码,“正确”的方法确实是使用 StringTemplate,但考虑到您对 ANTLR 本身相当陌生,我会让事情变得简单。您可以在解析器类中创建一个 StringBuilder 属性,该属性将在解析源文件时收集所有 HTML 代码。您可以通过使用
{
和}
包装代码来将代码嵌入到 ANTLR 规则中。这是一个快速演示:
根据该语法,您生成一个解析器和词法分析器:
然后创建一个小类来测试您的解析器:
然后编译所有源文件:
最后,运行您的主类,
该类会将以下内容打印到控制台:
但是,再说一次,如果你可以自由选择不同的语言来解析,我会这样做并忘记解析这个可怕的 Wiki 东西。
不管怎样,无论你做什么:祝你好运!
Okay, after your EDIT, I have a couple of recommendations.
Like I said in the comments, writing a grammar for such a language is nearly impossible. At least, trying to do so in one go, that is. The only way I see this working would be to do this with multiple parsers where the first "parsing-stage" would parse the wiki-source very "coarsely". For example: a
table
would be tokenized as:TABLE : '{|' .* '|}'
and then you'd create another parser that parses this table properly. Doing it in one parser will result in quite a few ambiguities in your parser rules IMO.About emitting HTML code, the "proper" way to do this is indeed with StringTemplate, but given the fact that you're rather new to ANTLR itself, I'd keep things simple. You could create a StringBuilder attribute in your parser class that would collect all your HTML code as you parse your source file. You can embed code in ANTLR rules by wrapping it with
{
and}
.Here's a quick demo:
From that grammar, you generate a parser and lexer:
and then create a little class to test your parser:
and then compile all your source files:
and finally, run your main class
which will print the following to the console:
But, again, if you are free to choose a different language to parse, I'd do that and forget about parsing this horrible Wiki-thing.
Anyway, whatever you do: best of luck!