扫描仪(使用 ANTLR 对关键字进行词法分析)
我一直致力于为我的程序编写一个扫描器,大多数在线教程都包含解析器和扫描器。如果不同时编写解析器,似乎不可能编写词法分析器。我只是想生成令牌,而不是解释它们。我想识别 INT 标记、浮点标记和一些标记,例如“开始”和“结束”,
我对如何匹配关键字感到困惑。我尝试了以下方法但没有成功:
KEYWORD : KEY1 | KEY2;
KEY1 : {input.LT(1).getText().equals("BEGIN")}? LETTER+ ;
KEY2 : {input.LT(1).getText().equals("END")}? LETTER+ ;
FLOATLITERAL_INTLITERAL
: DIGIT+
(
{ input.LA(2) != '.' }? => '.' DIGIT* { $type = FLOATLITERAL; }
| { $type = INTLITERAL; }
)
| '.' DIGIT+ {$type = FLOATLITERAL}
;
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : ('0'..'9');
IDENTIFIER
: LETTER
| LETTER DIGIT (LETTER|DIGIT)+
| LETTER LETTER (LETTER|DIGIT)*
;
WS //Whitespace
: (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN;}
;
I have been working on writing a scanner for my program and most of the tutorials online include a parser along with the scanner. It doesn't seem possible to write a lexer without writing a parser at the same time. I am only trying to generate tokens, not interpret them. I want to recognize INT tokens, float tokens, and some tokens like "begin" and "end"
I am confused about how to match keywords. I unsuccessfully tried the following:
KEYWORD : KEY1 | KEY2;
KEY1 : {input.LT(1).getText().equals("BEGIN")}? LETTER+ ;
KEY2 : {input.LT(1).getText().equals("END")}? LETTER+ ;
FLOATLITERAL_INTLITERAL
: DIGIT+
(
{ input.LA(2) != '.' }? => '.' DIGIT* { $type = FLOATLITERAL; }
| { $type = INTLITERAL; }
)
| '.' DIGIT+ {$type = FLOATLITERAL}
;
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : ('0'..'9');
IDENTIFIER
: LETTER
| LETTER DIGIT (LETTER|DIGIT)+
| LETTER LETTER (LETTER|DIGIT)*
;
WS //Whitespace
: (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel = HIDDEN;}
;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您只需要词法分析器,请以以下内容开始语法:
LT(int): Token
只能在解析器规则内使用(在TokenStream
)。在词法分析器规则中,您只能使用LA(int): int
从IntStream
。但没有必要手动查看所有内容。只需执行以下操作:我认为不需要创建一个名为
KEYWORD
的标记来匹配所有关键字:您需要区分BEGIN
和END
令牌,对吗?但如果您确实想要这样做,只需执行以下操作:并删除
BEGIN
和END
规则。只需确保在IDENTIFIER
之前定义KEYWORD
即可。编辑
使用以下类测试词法分析器:
如果生成词法分析器,请编译 .java 源文件并运行 Main 类,如下所示:
以下输出将打印到控制台:
If you only want a lexer, start your grammar with:
LT(int): Token
can only be used inside parser rules (on aTokenStream
). Inside lexer rules, you can only useLA(int): int
that gets the nextint
(character) from theIntStream
. But there is no need for all the manual look ahead. Just do something like this:I don't see the need to create a token called
KEYWORD
that matches all keywords: you'll want to make a distinction between aBEGIN
andEND
token, right? But if you really want this, simply do:and remove the
BEGIN
andEND
rules. Just make sureKEYWORD
is defined beforeIDENTIFIER
.EDIT
Test the lexer with the following class:
If you generate a lexer, compile the .java source files and run the Main class like this:
the following output will be printed to the console:
[来自一个制作自定义词法分析器工具并仍在尝试学习 ANTLR 的人]
无聊的广泛答案:
你是对的。许多书籍和课程混合了这两种工具。有时“生成/检测令牌”和“解释令牌”可能会混合。
有时,开发人员尝试制作扫描仪,但仍然将扫描和扫描混合在一起。在头脑中进行解析;-)
通常,在检测标记时,您还必须执行一个操作(“解释”),就像将消息或找到的标记打印到字符串一样简单。
示例: "{ cout << "嘿,我发现了一个整数常量" << "\n" }"
还有几种情况可能会让初学者在该主题中难以浏览。
一种情况是多个文本可能用于不同的标记。
示例:
“-”作为二元减法运算符,“-”作为负前缀运算符。
或者,将 5 既视为整数又视为浮点数。在扫描器中,“-”可以被视为相同的标记,而在解析器中,您可以将其视为不同的标记。
为了解决这个问题,我最喜欢的方法是在扫描/词法分析器过程中使用“通用令牌”,然后在解析/语法过程中将它们转换为“自定义令牌”。
快速回答:
正如前面的答案中提到的,从制作语法开始,事实上,我建议在白板或笔记本中尝试,然后在您最喜欢的(ANTLRL,其他)扫描工具中尝试。
考虑那些特殊情况,其中可能存在一些令牌重叠。
祝你好运。
[From a guy who make a custom lexer tool, and still trying to learn ANTLR]
Boring extensive answer:
You are right. Many books & courses mix both tools. And sometimes "generating/detecting tokens" and "interpreting tokens" may mix.
Sometimes, a developer is trying to do a scanner, and still, mixes scanning & parsing in its mind ;-)
Usually, when detecting tokens, you also have to do an action ("interpretation"), as simple, as printing a message or the found token to string.
Example: "{ cout << "Hey, I found a integer constant" << "\n" }"
There are also several cases that may make scanning difficult for a begginner in the topic.
One case is that several text may be used for different tokens.
Example:
"-" as the substraction binary operator, and "-" as the negative prefix operator.
Or, treating 5 both as an integer and a float. In scanners, "-" can be seen as the same token, while in parsers, you may treat it as different tokens.
In order to fix this, my favorite approach its to use "generic tokens", in the scanning/lexer process, and later, convert them as "custom tokens" in the parsing/syntax process.
Quick answer:
As mentioned in previous answers, start with making a grammar, in fact, I suggest try it in a whiteboard or notebook, and later in your favorite (ANTLRL, other) scanning tool.
Consider those special cases, where there could be some token overlappings.
Good Luck.