为什么会发生这个错误? - “以下选项永远无法匹配”

发布于 2024-11-09 07:32:16 字数 1468 浏览 8 评论 0原文

我对制作编译器、解析器和解析器生成器很感兴趣,但我对它们了解不多。

在阅读了这个问题的答案后,我尝试做出“非常”的回答简单的 LaTeX 解析器。

这是代码:

grammar Latex;

latex   :   ITEM*;
ITEM    :   CMD|LAWTEXT;
CMD :   CHEAD ARGS;
CHEAD   :   '\\' LETTER(LETTER|DIGIT)*;
LETTER  :   'A'..'Z'|'a'..'z';
DIGIT   :   '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9';
ARGS    :   '{' ITEM* '}';
LAWTEXT :   (LETTER|DIGIT|WHITESPACE|PUNC)*;
WHITESPACE
    :   ' '|'\t'|'\n'|'\r';
PUNC    :   '!'|'^';

(PUNC 中只有两个字符,用于测试目的)

这是错误消息:

[18:39:09] warning(200): C:\Users\***\Documents\Latex.g:9:12: Decision can match input such as "{'\t'..'\n', '\r', ' '..'!', '0'..'9', 'A'..'Z', '\\', '^', 'a'..'z', '}'}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[18:39:09] error(201): C:\Users\***\Documents\Latex.g:9:12: The following alternatives can never be matched: 2

[18:39:09] error(211): C:\Users\***\Documents\Latex.g:1:8: [fatal] rule Tokens has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2.  Resolve by left-factoring or using syntactic predicates or using backtrack=true option.

我发现发生此错误是因为存在歧义,代码可以用两种以上的方式解释,但我没有知道这种歧义是如何产生的。

这是图表和可以解释某些事物的两种方式(也许)。

图表。

...但是 \} 可以如何感到困惑吗?

I'm interested in making compiler, parser, and parser generator, but I don't know much about them.

After reading an answer of this question, I tried to make a 'very' simple LaTeX parser.

This is the code:

grammar Latex;

latex   :   ITEM*;
ITEM    :   CMD|LAWTEXT;
CMD :   CHEAD ARGS;
CHEAD   :   '\\' LETTER(LETTER|DIGIT)*;
LETTER  :   'A'..'Z'|'a'..'z';
DIGIT   :   '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9';
ARGS    :   '{' ITEM* '}';
LAWTEXT :   (LETTER|DIGIT|WHITESPACE|PUNC)*;
WHITESPACE
    :   ' '|'\t'|'\n'|'\r';
PUNC    :   '!'|'^';

(There's only two char in PUNC, for test purpose)

And this is the error message:

[18:39:09] warning(200): C:\Users\***\Documents\Latex.g:9:12: Decision can match input such as "{'\t'..'\n', '\r', ' '..'!', '0'..'9', 'A'..'Z', '\\', '^', 'a'..'z', '}'}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
[18:39:09] error(201): C:\Users\***\Documents\Latex.g:9:12: The following alternatives can never be matched: 2

[18:39:09] error(211): C:\Users\***\Documents\Latex.g:1:8: [fatal] rule Tokens has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2.  Resolve by left-factoring or using syntactic predicates or using backtrack=true option.

I found that this error happens because there's an ambiguity, a code can be interpreted in more than two ways, but I have no idea how this ambiguity is generated.

And this is the diagram and two ways something can be interpreted (maybe).

The diagram.

... but how \ and } can be confused?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

不知所踪 2024-11-16 07:32:16

JiminP 写道:

我对制作编译器、解析器和解析器生成器感兴趣,但我对它们了解不多。

ANTLR 根据您编写的语法为您创建词法分析器和解析器。 ANTLR 本身就是解析器生成器,因此您不需要编写解析器生成器本身(幸运的是!)。编译器是一个应用程序,它获取解析器生成的树并将输入转换为其他形式:这是您需要自己做的事情。所以,强调一下:ANTLR 帮助您为您的语言创建一个解析器,其余的取决于您。

现在,问题。

您的语法几乎只包含词法分析器规则。 Lexer 规则以大写字母开头,用于标记您的输入源。因此,像这样的规则:

LETTER  :   'A'..'Z'|'a'..'z';
...
LAWTEXT :   (LETTER|DIGIT|WHITESPACE|PUNC)*;

可能会导致词法分析器自行创建一个 LETTER 标记。如果您总是希望将小写或大写 ascii 字母变成 LAWTEXT 标记,那么您需要将 LETTER 设置为如下片段规则:

fragment LETTER  :   'A'..'Z'|'a'..'z';
...
LAWTEXT :   (LETTER|DIGIT|WHITESPACE|PUNC)+;

正如您所看到的,我使用 + 而不是 * 结束 LAWTEXT 规则:您不想创建不包含任何内容(空字符串)的标记。

此外,argsitemcmd 不适合词法分析器规则:它们应该是解析器规则。

这是生成词法分析器和解析器而没有任何错误的语法:

grammar Latex;

latex
  :  item* EOF
  ;

item 
  :  cmd
  |  LAWTEXT
  ;

cmd
  :  CHEAD args
  ;

args
  :  '{' item* '}'
  ;

CHEAD 
  :  '\\' LETTER (LETTER | DIGIT)*
  ;  

LAWTEXT
  :  (LETTER | DIGIT | WHITESPACE | PUNC)+
  ;

fragment  
WHITESPACE 
  :  ' ' | '\t' | '\n' | '\r'
  ;

fragment  
PUNC       
  : '!' | '^'
  ;

fragment
LETTER
  :  'A'..'Z' | 'a'..'z'
  ;

fragment
DIGIT
  :  '0'..'9'
  ;

编辑

正如我已经提到的:词法分析器规则以大写字母开头,而解析器规则以小写字母开头。词法分析器有时称为分词器或扫描器,负责分割输入源。输入源一开始只是一个字符流。然后,词法分析器将这些字符分组在一起。因此,给定以下词法分析器规则:

Identifier
  :  (Letter | '_') (Letter | '_' | Digit)*
  ;

Assign
  :  '='
  ;

Number
  :  Digit+ ('.' Digit+)?
  ;

fragment Digit
  :  '0'..'9'
  ;

fragment Letter
  :  'a'..'z' | 'A'..'Z'
  ;

Spaces
  :  (' ' | '\t' | '\r' | '\n') {skip();}
  ;

可以采用如下输入源:

foo = 12.34

词法分析器将其视为:

'f', 'o', 'o', ' ', '=', ' ', '1', '2', '.', '3', '4', EOF

并将创建以下标记:

  • Identifier "foo"
  • Assign "="
  • < code>Number "12.34"

(请注意,没有从空格创建标记:我跳过了这些!)

词法分析器从输入源创建标记后,会将这些标记传递给解析器。赋值解析器规则可能如下所示:

assignment
  :  Identifier Assign Number
  ;

重要的是要记住,输入源首先由词法分析器进行标记,只有在该过程之后,解析器规则才会发挥作用。

JiminP wrote:

I'm interested in making compiler, parser, and parser generator, but I don't know much about them.

ANTLR creates a lexer and parser for you based on the grammar you write. ANTLR itself is the parser generator, so you don't write the parser generator itself (luckily!). A compiler is an application that takes the tree your parser generates and translates the input into some other form: this is something you need to do yourself. So, to emphasize: ANTLR only helps you create a parser for your language, the rest is up to you.

Now, the problem(s).

Your grammar contains almost only lexer rules. Lexer rules start with a capital letter and are used to tokenize your input source. So rules like these:

LETTER  :   'A'..'Z'|'a'..'z';
...
LAWTEXT :   (LETTER|DIGIT|WHITESPACE|PUNC)*;

might cause the lexer to create a LETTER token on its own. If you always want a lower- or uppercase ascii letter to become a LAWTEXT token, then you need to make LETTER a fragment rule like this:

fragment LETTER  :   'A'..'Z'|'a'..'z';
...
LAWTEXT :   (LETTER|DIGIT|WHITESPACE|PUNC)+;

And as you can see, I ended the LAWTEXT rule with a + instead of a *: you don't want to create tokens that contain nothing (empty string).

Also, args, item and cmd are no suitable candidates for lexer rule: they should be parser rules instead.

Here's a grammar that generates a lexer and parser without any errors:

grammar Latex;

latex
  :  item* EOF
  ;

item 
  :  cmd
  |  LAWTEXT
  ;

cmd
  :  CHEAD args
  ;

args
  :  '{' item* '}'
  ;

CHEAD 
  :  '\\' LETTER (LETTER | DIGIT)*
  ;  

LAWTEXT
  :  (LETTER | DIGIT | WHITESPACE | PUNC)+
  ;

fragment  
WHITESPACE 
  :  ' ' | '\t' | '\n' | '\r'
  ;

fragment  
PUNC       
  : '!' | '^'
  ;

fragment
LETTER
  :  'A'..'Z' | 'a'..'z'
  ;

fragment
DIGIT
  :  '0'..'9'
  ;

EDIT

As I already mentioned: lexer rules start with a capital letter while parser rules start with a lower case letter. A lexer, which is sometimes called a tokenizer or scanner, is responsible for chopping up the input source. The input source starts of as being just a stream of characters. These characters are then grouped together by the lexer. So, given the following lexer rules:

Identifier
  :  (Letter | '_') (Letter | '_' | Digit)*
  ;

Assign
  :  '='
  ;

Number
  :  Digit+ ('.' Digit+)?
  ;

fragment Digit
  :  '0'..'9'
  ;

fragment Letter
  :  'a'..'z' | 'A'..'Z'
  ;

Spaces
  :  (' ' | '\t' | '\r' | '\n') {skip();}
  ;

could take input source like:

foo = 12.34

which the lexer sees as:

'f', 'o', 'o', ' ', '=', ' ', '1', '2', '.', '3', '4', EOF

and will create the following tokens:

  • Identifier "foo"
  • Assign "="
  • Number "12.34"

(note that there are no tokens being created from the white spaces: I skipped these!)

After the lexer has created tokens from your input source, the parser is passed these tokens. An assignment parser rule could then look like:

assignment
  :  Identifier Assign Number
  ;

It is important to keep in mind that the input source is first tokenized by the lexer and only after that process, the parser rules come into play.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文