HTML 子集语法中的 MismatchedTokenException

发布于 2024-09-06 20:14:00 字数 896 浏览 17 评论 0原文

我正在编写一个 ANTLR 语法来识别纯文本中的 HTML 块级元素。下面是一个相关片段，仅限于 div 标记：

grammar Test;

blockElement
  : div
  ;

div
  : '<' D I V HTML_ATTRIBUTES? '>' (blockElement | TEXT)* '</' D I V '>'
  ;

D : ('d' | 'D') ;
I : ('i' | 'I') ;
V : ('v' | 'V') ;

HTML_ATTRIBUTES
  : WS (~( '<' | '\n' | '\r' | '"' | '>' ))+
  ;

TEXT
  : (. | '\r' | '\n')
  ;

fragment WS
  : (' ' | '\t')
  ;

TEXT 标记应该表示任何非块级元素，例如纯文本或内联标记（例如<\b>）。当我在嵌套块元素上测试它时，例如：

它会正确解析它们。但是，一旦我添加一些随机文本，它就会在消耗第一个 TEXT 标记后立即抛出 MismatchedTokenException(0!=0)，例如大写的 T ：

这是一些随机文本

有什么建议吗？我在概念上做错了什么吗？我正在使用 ANTLR v. 3.2 并使用 ANTLRWorks v. 1.4 进行测试。

谢谢

原文

I am writing an ANTLR grammar to recognize HTML block-level elements within plain text. Here is a relevant snippet, limited to the div tag:

grammar Test;

blockElement
  : div
  ;

div
  : '<' D I V HTML_ATTRIBUTES? '>' (blockElement | TEXT)* '</' D I V '>'
  ;

D : ('d' | 'D') ;
I : ('i' | 'I') ;
V : ('v' | 'V') ;

HTML_ATTRIBUTES
  : WS (~( '<' | '\n' | '\r' | '"' | '>' ))+
  ;

TEXT
  : (. | '\r' | '\n')
  ;

fragment WS
  : (' ' | '\t')
  ;

The TEXT token is supposed to represent anything that is no block-level element, such as plain text or inline tags (e. g. <b><\b>). When I test it on nested block elements, like:

<div level_0><div level_1></div></div>

it parses them correctly. However, as soon as I add some random text, it throws a MismatchedTokenException(0!=0) right after having consumed the first TEXT token, e. g. the capital T in:

<div level_0>This is some random text</div>

Any suggestions? Am I doing something conceptually wrong? I am using ANTLR v. 3.2 and doing the testing with ANTLRWorks v. 1.4.

Thank you

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

巡山小妖精 2024-09-13 20:14:00

我建议不要使用 ANTLRWorks 测试您的语法：控制台中很容易错过错误消息，因此它可能不会按照您的预期解释您的测试输入。使用自定义创建的类执行此操作，如下所示：

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("<div level_0>This is some random text</div>");
        TestLexer lexer = new TestLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        Sparser.parse());
    }
}

现在，以下规则不正确：

TEXT
  :  (. | '\r' | '\n')
  ;

. 已匹配 \r 和 \n ，所以应该是：

TEXT
  :  .
  ;

当改变它时，你可以创建一个解析器& lexter，编译所有 .java 文件并运行 Main 类：

java -cp antlr-3.2.jar org.antlr.Tool Test.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main

这将产生以下错误：

line 1:15 mismatched input 'i' expecting '</'

因为 This 中的 i 正在由规则 I 进行标记： ('i' | 'I') ;。

您当前的方法还有更多问题：

HTML_ATTRIBUTES 做得太多：您应该使用 ATTRIBUTE、= 和 VALUE > 规则，然后将复数（html 属性）移至解析器；
现在您的属性不能包含 < 和 > 这是不正确的（可以包含它们，尽管不推荐）。

如果我是你，我就会重新开始。如果你愿意，我愿意提出一个开始：就这么说吧。

I recommend not testing your grammar with ANTLRWorks: error messages are easily missed in the console and it might therefor interpret your test input not as you expect. Do it with a custom created class like this:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        ANTLRStringStream in = new ANTLRStringStream("<div level_0>This is some random text</div>");
        TestLexer lexer = new TestLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TestParser parser = new TestParser(tokens);
        Sparser.parse());
    }
}

Now, the following rule is not correct:

TEXT
  :  (. | '\r' | '\n')
  ;

The . already matches both \r and \n, so it should be:

TEXT
  :  .
  ;

When changing that, you can create a parser & lexter, compile all .java files and run the Main class:

java -cp antlr-3.2.jar org.antlr.Tool Test.g
javac -cp antlr-3.2.jar *.java
java -cp .:antlr-3.2.jar Main

which will produce the following error:

line 1:15 mismatched input 'i' expecting '</'

because the i from This is being tokenized by the rule I : ('i' | 'I') ;.

There are more problems with your current approach:

HTML_ATTRIBUTES does too much: you should instead have ATTRIBUTE, = and VALUE rules and then move the plural (html attributes) to your parser instead;
now your attributes cannot contain < and > which is incorrect (the can contain them, although it is not recommend).

I'd start over if I were you. If you want, I'm willing to propose a start: just says so.

回复收藏 0 原文

~没有更多了~