HTML 子集语法中的 MismatchedTokenException

发布于 2024-09-06 20:14:00 字数 896 浏览 4 评论 0原文

我正在编写一个 ANTLR 语法来识别纯文本中的 HTML 块级元素。下面是一个相关片段,仅限于 div 标记:

grammar Test;

blockElement
  : div
  ;

div
  : '<' D I V HTML_ATTRIBUTES? '>' (blockElement | TEXT)* '</' D I V '>'
  ;

D : ('d' | 'D') ;
I : ('i' | 'I') ;
V : ('v' | 'V') ;

HTML_ATTRIBUTES
  : WS (~( '<' | '\n' | '\r' | '"' | '>' ))+
  ;

TEXT
  : (. | '\r' | '\n')
  ;

fragment WS
  : (' ' | '\t')
  ;

TEXT 标记应该表示任何非块级元素,例如纯文本或内联标记(例如<\b>)。当我在嵌套块元素上测试它时,例如:

它会正确解析它们。但是,一旦我添加一些随机文本,它就会在消耗第一个 TEXT 标记后立即抛出 MismatchedTokenException(0!=0),例如大写的 T

这是一些随机文本

有什么建议吗?我在概念上做错了什么吗?我正在使用 ANTLR v. 3.2 并使用 ANTLRWorks v. 1.4 进行测试。

谢谢

I am writing an ANTLR grammar to recognize HTML block-level elements within plain text. Here is a relevant snippet, limited to the div tag:

grammar Test;

blockElement
  : div
  ;

div
  : '<' D I V HTML_ATTRIBUTES? '>' (blockElement | TEXT)* '</' D I V '>'
  ;

D : ('d' | 'D') ;
I : ('i' | 'I') ;
V : ('v' | 'V') ;

HTML_ATTRIBUTES
  : WS (~( '<' | '\n' | '\r' | '"' | '>' ))+
  ;

TEXT
  : (. | '\r' | '\n')
  ;

fragment WS
  : (' ' | '\t')
  ;

The TEXT token is supposed to represent anything that is no block-level element, such as plain text or inline tags (e. g. <b><\b>). When I test it on nested block elements, like:

<div level_0><div level_1></div></div>

it parses them correctly. However, as soon as I add some random text, it throws a MismatchedTokenException(0!=0) right after having consumed the first TEXT token, e. g. the capital T in:

<div level_0>This is some random text</div>

Any suggestions? Am I doing something conceptually wrong? I am using ANTLR v. 3.2 and doing the testing with ANTLRWorks v. 1.4.

Thank you

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

巡山小妖精 2024-09-13 20:14:00

我建议不要使用 ANTLRWorks 测试您的语法:控制台中很容易错过错误消息,因此它可能不会按照您的预期解释您的测试输入。使用自定义创建的类执行此操作,如下所示:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("<div level_0>This is some random text</div>");
        TestLexer lexer = new TestLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        Sparser.parse());
    }
}

现在,以下规则不正确:

TEXT
  :  (. | '\r' | '\n')
  ;

. 已匹配 \r\n ,所以应该是:

TEXT
  :  .
  ;

当改变它时,你可以创建一个解析器& lexter,编译所有 .java 文件并运行 Main 类:

java -cp antlr-3.2.jar org.antlr.Tool Test.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main

这将产生以下错误:

line 1:15 mismatched input 'i' expecting '</'

因为 This 中的 i 正在由规则 I 进行标记: ('i' | 'I') ;

您当前的方法还有更多问题:

  • HTML_ATTRIBUTES 做得太多:您应该使用 ATTRIBUTE=VALUE > 规则,然后将复数(html 属性)移至解析器;
  • 现在您的属性不能包含 <> 这是不正确的(可以包含它们,尽管不推荐)。

如果我是你,我就会重新开始。如果你愿意,我愿意提出一个开始:就这么说吧。

I recommend not testing your grammar with ANTLRWorks: error messages are easily missed in the console and it might therefor interpret your test input not as you expect. Do it with a custom created class like this:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("<div level_0>This is some random text</div>");
        TestLexer lexer = new TestLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        Sparser.parse());
    }
}

Now, the following rule is not correct:

TEXT
  :  (. | '\r' | '\n')
  ;

The . already matches both \r and \n, so it should be:

TEXT
  :  .
  ;

When changing that, you can create a parser & lexter, compile all .java files and run the Main class:

java -cp antlr-3.2.jar org.antlr.Tool Test.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main

which will produce the following error:

line 1:15 mismatched input 'i' expecting '</'

because the i from This is being tokenized by the rule I : ('i' | 'I') ;.

There are more problems with your current approach:

  • HTML_ATTRIBUTES does too much: you should instead have ATTRIBUTE, = and VALUE rules and then move the plural (html attributes) to your parser instead;
  • now your attributes cannot contain < and > which is incorrect (the can contain them, although it is not recommend).

I'd start over if I were you. If you want, I'm willing to propose a start: just says so.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文