ANTLR 实现类似Python的缩进相关语法的最简单方法是什么?
我正在尝试实现类似Python的缩进相关语法。
源代码示例:
ABC QWE
CDE EFG
EFG CDE
ABC
QWE ZXC
正如我所见,我需要的是实现两个标记 INDENT 和 DEDENT,所以我可以编写如下内容:
grammar mygrammar;
text: (ID | block)+;
block: INDENT (ID|block)+ DEDENT;
INDENT: ????;
DEDENT: ????;
有没有简单的方法可以使用 ANTLR 来实现这一点?
(如果可能的话,我更愿意使用标准 ANTLR 词法分析器。)
I am trying realize python like indent-depending grammar.
Source example:
ABC QWE
CDE EFG
EFG CDE
ABC
QWE ZXC
As i see, what i need is to realize two tokens INDENT and DEDENT, so i could write something like:
grammar mygrammar;
text: (ID | block)+;
block: INDENT (ID|block)+ DEDENT;
INDENT: ????;
DEDENT: ????;
Is there any simple way to realize this using ANTLR?
(I'd prefer, if it's possible, to use standard ANTLR lexer.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我不知道最简单的处理方法是什么,但以下是一个相对简单的方法。每当您在词法分析器中匹配换行符时,可以选择匹配一个或多个空格。如果换行符后面有空格,则将这些空格的长度与当前的缩进大小进行比较。如果它大于当前的缩进大小,则发出一个
Indent
标记,如果它小于当前的缩进大小,则发出一个Dedent
标记,如果相同,则不发出什么都不做。您还需要在文件末尾发出多个
Dedent
标记,以使每个Indent
都有一个匹配的Dedent
标记。为了使其正常工作,您必须在输入源文件中添加前导和尾随换行符!
ANTRL3
一个快速演示:
您可以使用以下类测试解析器:
如果您现在将以下内容放入名为
in.txt
的文件中:(请注意前导和尾随换行符!)
然后您将看到与以下 AST 相对应的输出:
请注意,我的演示不会连续产生足够的凹痕,比如从
ccc
到aaa
的缩进(需要 2 个缩进标记):您需要调整
else if(n < previousIndents) { .. . }
可能会根据n
和previousIndents
之间的差异发出超过 1 个 dedent 标记。在我的脑海中,这可能看起来像这样:ANTLR4
对于 ANTLR4,执行如下操作:
取自: https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4
I don't know what the easiest way to handle it is, but the following is a relatively easy way. Whenever you match a line break in your lexer, optionally match one or more spaces. If there are spaces after the line break, compare the length of these spaces with the current indent-size. If it's more than the current indent size, emit an
Indent
token, if it's less than the current indent-size, emit aDedent
token and if it's the same, don't do anything.You'll also want to emit a number of
Dedent
tokens at the end of the file to let everyIndent
have a matchingDedent
token.For this to work properly, you must add a leading and trailing line break to your input source file!
ANTRL3
A quick demo:
You can test the parser with the class:
If you now put the following in a file called
in.txt
:(Note the leading and trailing line breaks!)
then you'll see output that corresponds to the following AST:
Note that my demo wouldn't produce enough dedents in succession, like dedenting from
ccc
toaaa
(2 dedent tokens are needed):You would need to adjust the code inside
else if(n < previousIndents) { ... }
to possibly emit more than 1 dedent token based on the difference betweenn
andpreviousIndents
. Off the top of my head, that could look like this:ANTLR4
For ANTLR4, do something like this:
Taken from: https://github.com/antlr/grammars-v4/blob/master/python3/Python3.g4
ANTLR v4 有一个开源库 antlr-denter这有助于为您解析缩进和缩进。查看其 README 了解如何使用它。
由于它是一个库,而不是复制并粘贴到语法中的代码片段,因此它的缩进处理可以与语法的其余部分分开更新。
There is an open-source library antlr-denter for ANTLR v4 that helps parse indents and dedents for you. Check out its README for how to use it.
Since it is a library, rather than code snippets to copy-and-paste into your grammar, its indentation-handling can be updated separately from the rest of your grammar.
有一个相对简单的方法来执行此 ANTLR,我将其作为实验编写: DentLexer.g4。该解决方案与本页提到的由 Kiers 和 Shavit 编写的其他解决方案不同。它仅通过重写 Lexer 的
nextToken()
方法与运行时集成。它通过检查标记来完成工作:(1)NEWLINE
标记触发“跟踪缩进”阶段的开始; (2) 空白和注释,都设置为通道 HIDDEN,在该阶段分别被计数和忽略; (3) 任何非HIDDEN
令牌都会结束该阶段。因此,控制缩进逻辑只是设置令牌通道的简单问题。本页提到的两种解决方案都需要
NEWLINE
标记来获取所有后续空格,但这样做无法处理中断该空格的多行注释。相反,Dent 将NEWLINE
和空白标记分开,并且可以处理多行注释。你的语法将像下面这样设置。请注意,NEWLINE 和 WS 词法分析器规则具有控制
pendingDent
状态并使用indentCount
变量跟踪缩进级别的操作。There is a relatively simple way to do this ANTLR, which I wrote as an experiment: DentLexer.g4. This solution is different from the others mentioned on this page that were written by Kiers and Shavit. It integrates with the runtime solely via an override of the Lexer's
nextToken()
method. It does its work by examining tokens: (1) aNEWLINE
token triggers the start of a "keep track of indentation" phase; (2) whitespace and comments, both set to channelHIDDEN
, are counted and ignored, respectively, during that phase; and, (3) any non-HIDDEN
token ends the phase. Thus controlling the indentation logic is a simple matter of setting a token's channel.Both of the solutions mentioned on this page require a
NEWLINE
token to also grab all the subsequent whitespace, but in doing so can't handle multi-line comments interrupting that whitespace. Dent, instead, keepsNEWLINE
and whitespace tokens separate and can handle multi-line comments.Your grammar would be set up something like below. Note that the NEWLINE and WS lexer rules have actions that control the
pendingDent
state and keep track of indentation level with theindentCount
variable.你看过Python ANTLR语法吗?
编辑:添加了用于创建 INDENT/DEDENT 标记的伪 Python 代码
Have you looked at the Python ANTLR grammar?
Edit: Added psuedo Python code for creating INDENT/DEDENT tokens