扫描仪(使用 ANTLR 对关键字进行词法分析)

发布于 2024-12-02 18:59:12 字数 760 浏览 1 评论 0原文

我一直致力于为我的程序编写一个扫描器,大多数在线教程都包含解析器和扫描器。如果不同时编写解析器,似乎不可能编写词法分析器。我只是想生成令牌,而不是解释它们。我想识别 INT 标记、浮点标记和一些标记,例如“开始”和“结束”,

我对如何匹配关键字感到困惑。我尝试了以下方法但没有成功:

KEYWORD : KEY1 | KEY2;

KEY1 : {input.LT(1).getText().equals("BEGIN")}? LETTER+ ;
KEY2 : {input.LT(1).getText().equals("END")}? LETTER+ ;

FLOATLITERAL_INTLITERAL
  : DIGIT+ 
  ( 
    { input.LA(2) != '.' }? => '.' DIGIT* { $type = FLOATLITERAL; }
    | { $type = INTLITERAL; }
  )
  | '.'  DIGIT+ {$type = FLOATLITERAL}
;

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : ('0'..'9');

IDENTIFIER 
 : LETTER 
   | LETTER DIGIT (LETTER|DIGIT)+ 
   | LETTER LETTER (LETTER|DIGIT)*
 ;

WS  //Whitespace
  : (' ' | '\t' | '\n' | '\r' | '\f')+  {$channel = HIDDEN;}
;  

I have been working on writing a scanner for my program and most of the tutorials online include a parser along with the scanner. It doesn't seem possible to write a lexer without writing a parser at the same time. I am only trying to generate tokens, not interpret them. I want to recognize INT tokens, float tokens, and some tokens like "begin" and "end"

I am confused about how to match keywords. I unsuccessfully tried the following:

KEYWORD : KEY1 | KEY2;

KEY1 : {input.LT(1).getText().equals("BEGIN")}? LETTER+ ;
KEY2 : {input.LT(1).getText().equals("END")}? LETTER+ ;

FLOATLITERAL_INTLITERAL
  : DIGIT+ 
  ( 
    { input.LA(2) != '.' }? => '.' DIGIT* { $type = FLOATLITERAL; }
    | { $type = INTLITERAL; }
  )
  | '.'  DIGIT+ {$type = FLOATLITERAL}
;

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : ('0'..'9');

IDENTIFIER 
 : LETTER 
   | LETTER DIGIT (LETTER|DIGIT)+ 
   | LETTER LETTER (LETTER|DIGIT)*
 ;

WS  //Whitespace
  : (' ' | '\t' | '\n' | '\r' | '\f')+  {$channel = HIDDEN;}
;  

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

萌辣 2024-12-09 18:59:12

如果您只需要词法分析器,请以以下内容开始语法:

lexer grammar FooLexer; // creates: FooLexer.java

LT(int): Token 只能在解析器规则内使用(在 TokenStream)。在词法分析器规则中,您只能使用 LA(int): intIntStream。但没有必要手动查看所有内容。只需执行以下操作:

lexer grammar FooLexer;

BEGIN
  :  'BEGIN'
  ;

END
  :  'END'
  ;

FLOAT
  :  DIGIT+ '.' DIGIT+
  ;

INT
  :  DIGIT+
  ;

IDENTIFIER 
  :  LETTER (LETTER | DIGIT)*
  ;

WS
  :  (' ' | '\t' | '\n' | '\r' | '\f')+  {$channel = HIDDEN;}
  ; 

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : ('0'..'9');

我认为不需要创建一个名为 KEYWORD 的标记来匹配所有关键字:您需要区分 BEGINEND 令牌,对吗?但如果您确实想要这样做,只需执行以下操作:

KEYWORD
  :  'BEGIN'
  |  'END'
  ;

并删除 BEGINEND 规则。只需确保在 IDENTIFIER 之前定义 KEYWORD 即可。

编辑

使用以下类测试词法分析器:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String src = "BEGIN END 3.14159 42 FOO";
    FooLexer lexer = new FooLexer(new ANTLRStringStream(src));
    while(true) {
      Token token = lexer.nextToken();
      if(token.getType() == FooLexer.EOF) {
        break;
      }
      System.out.println(token.getType() + " :: " + token.getText());
    }
  }
}

如果生成词法分析器,请编译 .java 源文件并运行 Main 类,如下所示:

java -cp antlr-3.3.jar org.antlr.Tool FooLexer.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main

以下输出将打印到控制台:

4 :: BEGIN
11 ::  
5 :: END
11 ::  
7 :: 3.14159
11 ::  
8 :: 42
11 ::  
10 :: FOO

If you only want a lexer, start your grammar with:

lexer grammar FooLexer; // creates: FooLexer.java

LT(int): Token can only be used inside parser rules (on a TokenStream). Inside lexer rules, you can only use LA(int): int that gets the next int (character) from the IntStream. But there is no need for all the manual look ahead. Just do something like this:

lexer grammar FooLexer;

BEGIN
  :  'BEGIN'
  ;

END
  :  'END'
  ;

FLOAT
  :  DIGIT+ '.' DIGIT+
  ;

INT
  :  DIGIT+
  ;

IDENTIFIER 
  :  LETTER (LETTER | DIGIT)*
  ;

WS
  :  (' ' | '\t' | '\n' | '\r' | '\f')+  {$channel = HIDDEN;}
  ; 

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : ('0'..'9');

I don't see the need to create a token called KEYWORD that matches all keywords: you'll want to make a distinction between a BEGIN and END token, right? But if you really want this, simply do:

KEYWORD
  :  'BEGIN'
  |  'END'
  ;

and remove the BEGIN and END rules. Just make sure KEYWORD is defined before IDENTIFIER.

EDIT

Test the lexer with the following class:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String src = "BEGIN END 3.14159 42 FOO";
    FooLexer lexer = new FooLexer(new ANTLRStringStream(src));
    while(true) {
      Token token = lexer.nextToken();
      if(token.getType() == FooLexer.EOF) {
        break;
      }
      System.out.println(token.getType() + " :: " + token.getText());
    }
  }
}

If you generate a lexer, compile the .java source files and run the Main class like this:

java -cp antlr-3.3.jar org.antlr.Tool FooLexer.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main

the following output will be printed to the console:

4 :: BEGIN
11 ::  
5 :: END
11 ::  
7 :: 3.14159
11 ::  
8 :: 42
11 ::  
10 :: FOO
栀子花开つ 2024-12-09 18:59:12

[来自一个制作自定义词法分析器工具并仍在尝试学习 ANTLR 的人]

无聊的广泛答案:

你是对的。许多书籍和课程混合了这两种工具。有时“生成/检测令牌”和“解释令牌”可能会混合。

有时,开发人员尝试制作扫描仪,但仍然将扫描和扫描混合在一起。在头脑中进行解析;-)

通常,在检测标记时,您还必须执行一个操作(“解释”),就像将消息或找到的标记打印到字符串一样简单。
示例: "{ cout << "嘿,我发现了一个整数常量" << "\n" }"

还有几种情况可能会让初学者在该主题中难以浏览。

一种情况是多个文本可能用于不同的标记。

示例:

“-”作为二元减法运算符,“-”作为负前缀运算符。
或者,将 5 既视为整数又视为浮点数。在扫描器中,“-”可以被视为相同的标记,而在解析器中,您可以将其视为不同的标记。

为了解决这个问题,我最喜欢的方法是在扫描/词法分析器过程中使用“通用令牌”,然后在解析/语法过程中将它们转换为“自定义令牌”。

快速回答:

正如前面的答案中提到的,从制作语法开始,事实上,我建议在白板或笔记本中尝试,然后在您最喜欢的(ANTLRL,其他)扫描工具中尝试。

考虑那些特殊情况,其中可能存在一些令牌重叠。

祝你好运。

[From a guy who make a custom lexer tool, and still trying to learn ANTLR]

Boring extensive answer:

You are right. Many books & courses mix both tools. And sometimes "generating/detecting tokens" and "interpreting tokens" may mix.

Sometimes, a developer is trying to do a scanner, and still, mixes scanning & parsing in its mind ;-)

Usually, when detecting tokens, you also have to do an action ("interpretation"), as simple, as printing a message or the found token to string.
Example: "{ cout << "Hey, I found a integer constant" << "\n" }"

There are also several cases that may make scanning difficult for a begginner in the topic.

One case is that several text may be used for different tokens.

Example:

"-" as the substraction binary operator, and "-" as the negative prefix operator.
Or, treating 5 both as an integer and a float. In scanners, "-" can be seen as the same token, while in parsers, you may treat it as different tokens.

In order to fix this, my favorite approach its to use "generic tokens", in the scanning/lexer process, and later, convert them as "custom tokens" in the parsing/syntax process.

Quick answer:

As mentioned in previous answers, start with making a grammar, in fact, I suggest try it in a whiteboard or notebook, and later in your favorite (ANTLRL, other) scanning tool.

Consider those special cases, where there could be some token overlappings.

Good Luck.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文