ANTLR 3 中 wikitext-to-HTML 的工作示例

发布于 2024-10-14 11:08:05 字数 2676 浏览 3 评论 0原文

我试图在 ANTLR 3 中充实一个 wikitext-to-HTML 翻译器,但我一直陷入困境。

您知道我可以检查的工作示例吗?我尝试了 MediaWiki ANTLR 语法和 Wiki Creole 语法,但我无法让它们生成词法分析器和语法分析器。 ANTLR 3 中的解析器。

以下是我尝试使用的两种语法的链接:

我无法使用这两个来生成我的 Java Lexer 和 Parser。 (我使用 ANTLR3 作为 Eclipse 插件)。 MediaWiki 需要很长的时间来构建,然后在某些时候它会抛出 OutOfMemory 异常。另一个有错误,我不知道如何调试。

编辑:好吧,我有一个非常基本的语法:

grammar wikitext;

options {
  //output = AST;
  //ASTLabelType = CommonTree;
  output = template;
  language = Java;
}

document: line (NL line?)*;

line: horizontal_line | list | heading | paragraph;

/* horizontal line */
horizontal_line: HRLINE;

/* lists */
list: unordered_list | ordered_list;

unordered_list: '*'+ content;
ordered_list: '#'+ content;

/* Headings */
heading: heading1 | heading2 | heading3 | heading4 | heading5 | heading6;
heading1: H1 plain H1;
heading2: H2 plain H2;
heading3: H3 plain H3;
heading4: H4 plain H4;
heading5: H5 plain H5;
heading6: H6 plain H6;

/* Paragraph */
paragraph: content;

content: (formatted | link)+;

/* links */
link: external_link | internal_link;

external_link: '[' external_link_uri ('|' external_link_title)? ']';
internal_link: '[[' internal_link_ref ('|' internal_link_title)? ']]' ;

external_link_uri: CHARACTER+;
external_link_title: plain;
internal_link_ref: plain;
internal_link_title: plain;

/* bold & italic */
formatted: bold_italic | bold | italic | plain;

bold_italic: BOLD_ITALIC plain BOLD_ITALIC;
bold: BOLD plain BOLD;
italic: ITALIC plain ITALIC;

/* Plain text */
plain: (CHARACTER | SPACE)+;


/**
 * LEXER RULES
 * --------------------------------------------------------------------------
 */

HRLINE: '---' '-'+;

H1: '=';
H2: '==';
H3: '===';
H4: '====';
H5: '=====';
H6: '======';

BOLD_ITALIC: '\'\'\'\'\'';
BOLD: '\'\'\'';
ITALIC: '\'\'';

NL: '\r'?'\n';

CHARACTER       :       '!' | '"' | '#' | '$' | '%' | '&'
                |       '*' | '+' | ',' | '-' | '.' | '/'
                |       ':' | ';' | '?' | '@' | '\\' | '^' | '_' | '`' | '~'
                |       '0'..'9' | 'A'..'Z' |'a'..'z' 
                |       '\u0080'..'\u7fff'
                |       '(' | ')'
                |       '\'' | '<' | '>' | '=' | '[' | ']' | '|' 
                ;

SPACE: ' ' | '\t';

虽然我不清楚如何输出 HTML。我一直在研究 StringTemplate,但我不明白如何构建我的模板。具体来说,哪个模板在语法中的位置。你能帮我举一个简短的例子吗?

I'm trying to flesh out a wikitext-to-HTML translator in ANTLR 3, but I keep getting stuck.

Do you know of a working example that I can inspect? I tried the MediaWiki ANTLR grammar and the Wiki Creole grammar, but I can't get them to generate the lexer & parser in ANTLR 3.

Here are the links to two grammars I've tried using:

I can't get any of these two to generate my Java Lexer and Parser. (I'm using ANTLR3 as Eclipse plugin). MediaWiki takes a looong time to build and then at some point it throws an OutOfMemory exception. The other one has errors in it which I don't know how to debug.

EDIT: Okay I've got a very basic grammar:

grammar wikitext;

options {
  //output = AST;
  //ASTLabelType = CommonTree;
  output = template;
  language = Java;
}

document: line (NL line?)*;

line: horizontal_line | list | heading | paragraph;

/* horizontal line */
horizontal_line: HRLINE;

/* lists */
list: unordered_list | ordered_list;

unordered_list: '*'+ content;
ordered_list: '#'+ content;

/* Headings */
heading: heading1 | heading2 | heading3 | heading4 | heading5 | heading6;
heading1: H1 plain H1;
heading2: H2 plain H2;
heading3: H3 plain H3;
heading4: H4 plain H4;
heading5: H5 plain H5;
heading6: H6 plain H6;

/* Paragraph */
paragraph: content;

content: (formatted | link)+;

/* links */
link: external_link | internal_link;

external_link: '[' external_link_uri ('|' external_link_title)? ']';
internal_link: '[[' internal_link_ref ('|' internal_link_title)? ']]' ;

external_link_uri: CHARACTER+;
external_link_title: plain;
internal_link_ref: plain;
internal_link_title: plain;

/* bold & italic */
formatted: bold_italic | bold | italic | plain;

bold_italic: BOLD_ITALIC plain BOLD_ITALIC;
bold: BOLD plain BOLD;
italic: ITALIC plain ITALIC;

/* Plain text */
plain: (CHARACTER | SPACE)+;


/**
 * LEXER RULES
 * --------------------------------------------------------------------------
 */

HRLINE: '---' '-'+;

H1: '=';
H2: '==';
H3: '===';
H4: '====';
H5: '=====';
H6: '======';

BOLD_ITALIC: '\'\'\'\'\'';
BOLD: '\'\'\'';
ITALIC: '\'\'';

NL: '\r'?'\n';

CHARACTER       :       '!' | '"' | '#' | '

It's not clear for me though how one would go about outputting HTML. I've been looking into StringTemplate, but I don't understand how to structure my templates. Specifically, which template goes where in the grammar. Can you help me with a short example?

| '%' | '&' | '*' | '+' | ',' | '-' | '.' | '/' | ':' | ';' | '?' | '@' | '\\' | '^' | '_' | '`' | '~' | '0'..'9' | 'A'..'Z' |'a'..'z' | '\u0080'..'\u7fff' | '(' | ')' | '\'' | '<' | '>' | '=' | '[' | ']' | '|' ; SPACE: ' ' | '\t';

It's not clear for me though how one would go about outputting HTML. I've been looking into StringTemplate, but I don't understand how to structure my templates. Specifically, which template goes where in the grammar. Can you help me with a short example?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

陪我终i 2024-10-21 11:08:05

好的,在您编辑之后,我有一些建议。

正如我在评论中所说,为这种语言编写语法几乎是不可能的。至少,尝试一次性做到这一点。我认为这项工作的唯一方法是使用多个解析器来完成此操作,其中第一个“解析阶段”将非常“粗略”地解析维基源。例如:table 将被标记为:TABLE : '{|' .* '|}' 然后您将创建另一个解析器来正确解析该表。在一个解析器中执行此操作将导致解析器规则中存在相当多的歧义(IMO)。

关于发出 HTML 代码,“正确”的方法确实是使用 StringTemplate,但考虑到您对 ANTLR 本身相当陌生,我会让事情变得简单。您可以在解析器类中创建一个 StringBuilder 属性,该属性将在解析源文件时收集所有 HTML 代码。您可以通过使用 {} 包装代码来将代码嵌入到 ANTLR 规则中。

这是一个快速演示:

grammar T;

@parser::members {

  // an attribute that is only available in your 
  // parser (so only in parser rules!)
  protected StringBuilder htmlBuilder = new StringBuilder();
}

// Parser rules
parse
  :  atom+ EOF
  ;

atom
  :  header
  |  Any    {htmlBuilder.append($Any.text);} // append the text from 'Any' token
  ;

header
  :  H3 h3Content H3 {htmlBuilder.append("<h3>" + $h3Content.text + "</h3>");}
  |  H2 h2Content H2 {htmlBuilder.append("<h2>" + $h2Content.text + "</h2>");}
  |  H1 h1Content H1 {htmlBuilder.append("<h1>" + $h1Content.text + "</h1>");}
  ;

h3Content : ~H3*; // match any token except H3, zero or more times
h2Content : ~H2*; //        "               H2          "
h1Content : ~H1*; //        "               H1          "

// Lexer rules    
H3 : '===';
H2 : '==';
H1 : '=';

// Fall through rule: if non of the above 
// lexer rules matched, this one will.
Any
  :  .
  ;

根据该语法,您生成一个解析器和词法分析器:

java -cp antlr-3.2.jar org.antlr.Tool T.g

然后创建一个小类来测试您的解析器:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {

        // the source to be parsed
        String source = 
                "= header 1 =             \n"+
                "                         \n"+
                "some text here           \n"+
                "                         \n"+
                "=== header level 3 ===   \n"+
                "                         \n"+
                "and some more text         ";

        ANTLRStringStream in = new ANTLRStringStream(source);
        TLexer lexer = new TLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TParser parser = new TParser(tokens);

        // invoke the start-rule in your parser
        parser.parse();

        // print the contents of your parser's StringBuilder
        System.out.println(parser.htmlBuilder);
    }
}

然后编译所有源文件:

javac -cp antlr-3.2.jar *.java

最后,运行您的主类,

// *nix & MacOS
java -cp .:antlr-3.2.jar Main

// Windows
java -cp .;antlr-3.2.jar Main

该类会将以下内容打印到控制台:

<h1> header 1 </h1>             

some text here           

<h3> header level 3 </h3>   

and some more text  

但是,再说一次,如果你可以自由选择不同的语言来解析,我会这样做并忘记解析这个可怕的 Wiki 东西。

不管怎样,无论你做什么:祝你好运!

Okay, after your EDIT, I have a couple of recommendations.

Like I said in the comments, writing a grammar for such a language is nearly impossible. At least, trying to do so in one go, that is. The only way I see this working would be to do this with multiple parsers where the first "parsing-stage" would parse the wiki-source very "coarsely". For example: a table would be tokenized as: TABLE : '{|' .* '|}' and then you'd create another parser that parses this table properly. Doing it in one parser will result in quite a few ambiguities in your parser rules IMO.

About emitting HTML code, the "proper" way to do this is indeed with StringTemplate, but given the fact that you're rather new to ANTLR itself, I'd keep things simple. You could create a StringBuilder attribute in your parser class that would collect all your HTML code as you parse your source file. You can embed code in ANTLR rules by wrapping it with { and }.

Here's a quick demo:

grammar T;

@parser::members {

  // an attribute that is only available in your 
  // parser (so only in parser rules!)
  protected StringBuilder htmlBuilder = new StringBuilder();
}

// Parser rules
parse
  :  atom+ EOF
  ;

atom
  :  header
  |  Any    {htmlBuilder.append($Any.text);} // append the text from 'Any' token
  ;

header
  :  H3 h3Content H3 {htmlBuilder.append("<h3>" + $h3Content.text + "</h3>");}
  |  H2 h2Content H2 {htmlBuilder.append("<h2>" + $h2Content.text + "</h2>");}
  |  H1 h1Content H1 {htmlBuilder.append("<h1>" + $h1Content.text + "</h1>");}
  ;

h3Content : ~H3*; // match any token except H3, zero or more times
h2Content : ~H2*; //        "               H2          "
h1Content : ~H1*; //        "               H1          "

// Lexer rules    
H3 : '===';
H2 : '==';
H1 : '=';

// Fall through rule: if non of the above 
// lexer rules matched, this one will.
Any
  :  .
  ;

From that grammar, you generate a parser and lexer:

java -cp antlr-3.2.jar org.antlr.Tool T.g

and then create a little class to test your parser:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {

        // the source to be parsed
        String source = 
                "= header 1 =             \n"+
                "                         \n"+
                "some text here           \n"+
                "                         \n"+
                "=== header level 3 ===   \n"+
                "                         \n"+
                "and some more text         ";

        ANTLRStringStream in = new ANTLRStringStream(source);
        TLexer lexer = new TLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TParser parser = new TParser(tokens);

        // invoke the start-rule in your parser
        parser.parse();

        // print the contents of your parser's StringBuilder
        System.out.println(parser.htmlBuilder);
    }
}

and then compile all your source files:

javac -cp antlr-3.2.jar *.java

and finally, run your main class

// *nix & MacOS
java -cp .:antlr-3.2.jar Main

// Windows
java -cp .;antlr-3.2.jar Main

which will print the following to the console:

<h1> header 1 </h1>             

some text here           

<h3> header level 3 </h3>   

and some more text  

But, again, if you are free to choose a different language to parse, I'd do that and forget about parsing this horrible Wiki-thing.

Anyway, whatever you do: best of luck!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文