为什么会发生这个错误? - “以下选项永远无法匹配”
我对制作编译器、解析器和解析器生成器很感兴趣,但我对它们了解不多。
在阅读了这个问题的答案后,我尝试做出“非常”的回答简单的 LaTeX 解析器。
这是代码:
grammar Latex;
latex : ITEM*;
ITEM : CMD|LAWTEXT;
CMD : CHEAD ARGS;
CHEAD : '\\' LETTER(LETTER|DIGIT)*;
LETTER : 'A'..'Z'|'a'..'z';
DIGIT : '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9';
ARGS : '{' ITEM* '}';
LAWTEXT : (LETTER|DIGIT|WHITESPACE|PUNC)*;
WHITESPACE
: ' '|'\t'|'\n'|'\r';
PUNC : '!'|'^';
(PUNC 中只有两个字符,用于测试目的)
这是错误消息:
[18:39:09] warning(200): C:\Users\***\Documents\Latex.g:9:12: Decision can match input such as "{'\t'..'\n', '\r', ' '..'!', '0'..'9', 'A'..'Z', '\\', '^', 'a'..'z', '}'}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[18:39:09] error(201): C:\Users\***\Documents\Latex.g:9:12: The following alternatives can never be matched: 2
[18:39:09] error(211): C:\Users\***\Documents\Latex.g:1:8: [fatal] rule Tokens has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
我发现发生此错误是因为存在歧义,代码可以用两种以上的方式解释,但我没有知道这种歧义是如何产生的。
这是图表和可以解释某些事物的两种方式(也许)。
...但是 \
和 }
可以如何感到困惑吗?
I'm interested in making compiler, parser, and parser generator, but I don't know much about them.
After reading an answer of this question, I tried to make a 'very' simple LaTeX parser.
This is the code:
grammar Latex;
latex : ITEM*;
ITEM : CMD|LAWTEXT;
CMD : CHEAD ARGS;
CHEAD : '\\' LETTER(LETTER|DIGIT)*;
LETTER : 'A'..'Z'|'a'..'z';
DIGIT : '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9';
ARGS : '{' ITEM* '}';
LAWTEXT : (LETTER|DIGIT|WHITESPACE|PUNC)*;
WHITESPACE
: ' '|'\t'|'\n'|'\r';
PUNC : '!'|'^';
(There's only two char in PUNC, for test purpose)
And this is the error message:
[18:39:09] warning(200): C:\Users\***\Documents\Latex.g:9:12: Decision can match input such as "{'\t'..'\n', '\r', ' '..'!', '0'..'9', 'A'..'Z', '\\', '^', 'a'..'z', '}'}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[18:39:09] error(201): C:\Users\***\Documents\Latex.g:9:12: The following alternatives can never be matched: 2
[18:39:09] error(211): C:\Users\***\Documents\Latex.g:1:8: [fatal] rule Tokens has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
I found that this error happens because there's an ambiguity, a code can be interpreted in more than two ways, but I have no idea how this ambiguity is generated.
And this is the diagram and two ways something can be interpreted (maybe).
... but how \
and }
can be confused?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
ANTLR 根据您编写的语法为您创建词法分析器和解析器。 ANTLR 本身就是解析器生成器,因此您不需要编写解析器生成器本身(幸运的是!)。编译器是一个应用程序,它获取解析器生成的树并将输入转换为其他形式:这是您需要自己做的事情。所以,强调一下:ANTLR 仅帮助您为您的语言创建一个解析器,其余的取决于您。
现在,问题。
您的语法几乎只包含词法分析器规则。 Lexer 规则以大写字母开头,用于标记您的输入源。因此,像这样的规则:
可能会导致词法分析器自行创建一个
LETTER
标记。如果您总是希望将小写或大写 ascii 字母变成LAWTEXT
标记,那么您需要将LETTER
设置为如下片段规则:正如您所看到的,我使用
+
而不是*
结束LAWTEXT
规则:您不想创建不包含任何内容(空字符串)的标记。此外,
args
、item
和cmd
不适合词法分析器规则:它们应该是解析器规则。这是生成词法分析器和解析器而没有任何错误的语法:
编辑
正如我已经提到的:词法分析器规则以大写字母开头,而解析器规则以小写字母开头。词法分析器有时称为分词器或扫描器,负责分割输入源。输入源一开始只是一个字符流。然后,词法分析器将这些字符分组在一起。因此,给定以下词法分析器规则:
可以采用如下输入源:
词法分析器将其视为:
并将创建以下标记:
Identifier "foo"
Assign "="
(请注意,没有从空格创建标记:我跳过了这些!)
词法分析器从输入源创建标记后,会将这些标记传递给解析器。赋值解析器规则可能如下所示:
重要的是要记住,输入源首先由词法分析器进行标记,只有在该过程之后,解析器规则才会发挥作用。
ANTLR creates a lexer and parser for you based on the grammar you write. ANTLR itself is the parser generator, so you don't write the parser generator itself (luckily!). A compiler is an application that takes the tree your parser generates and translates the input into some other form: this is something you need to do yourself. So, to emphasize: ANTLR only helps you create a parser for your language, the rest is up to you.
Now, the problem(s).
Your grammar contains almost only lexer rules. Lexer rules start with a capital letter and are used to tokenize your input source. So rules like these:
might cause the lexer to create a
LETTER
token on its own. If you always want a lower- or uppercase ascii letter to become aLAWTEXT
token, then you need to makeLETTER
a fragment rule like this:And as you can see, I ended the
LAWTEXT
rule with a+
instead of a*
: you don't want to create tokens that contain nothing (empty string).Also,
args
,item
andcmd
are no suitable candidates for lexer rule: they should be parser rules instead.Here's a grammar that generates a lexer and parser without any errors:
EDIT
As I already mentioned: lexer rules start with a capital letter while parser rules start with a lower case letter. A lexer, which is sometimes called a tokenizer or scanner, is responsible for chopping up the input source. The input source starts of as being just a stream of characters. These characters are then grouped together by the lexer. So, given the following lexer rules:
could take input source like:
which the lexer sees as:
and will create the following tokens:
Identifier "foo"
Assign "="
Number "12.34"
(note that there are no tokens being created from the white spaces: I skipped these!)
After the lexer has created tokens from your input source, the parser is passed these tokens. An assignment parser rule could then look like:
It is important to keep in mind that the input source is first tokenized by the lexer and only after that process, the parser rules come into play.