Python 风格缩进的 PEG
您如何在以下任何解析器生成器中编写解析表达式语法(PEG.js, 柑橘, Treetop) 可以处理 Python/Haskell/CoffeScript 风格的缩进:
not 的示例- 现有的编程语言:
square x =
x * x
cube x =
x * square x
fib n =
if n <= 1
0
else
fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets
更新: 不要尝试为上面的示例编写解释器。我只对缩进问题感兴趣。另一个例子可能是解析以下内容:
foo
bar = 1
baz = 2
tap
zap = 3
# should yield (ruby style hashmap):
# {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } }
How would you write a Parsing Expression Grammar in any of the following Parser Generators (PEG.js, Citrus, Treetop) which can handle Python/Haskell/CoffeScript style indentation:
Examples of a not-yet-existing programming language:
square x =
x * x
cube x =
x * square x
fib n =
if n <= 1
0
else
fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets
Update:
Don't try to write an interpreter for the examples above. I'm only interested in the indentation problem. Another example might be parsing the following:
foo
bar = 1
baz = 2
tap
zap = 3
# should yield (ruby style hashmap):
# {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } }
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
纯 PEG 无法解析缩进。
但 peg.js 可以。
我做了一个快速而肮脏的实验(受到 Ira Baxter 关于作弊的评论的启发)并编写了一个简单的标记器。
如需更完整的解决方案(完整的解析器),请参阅此问题:使用 PEG.js 解析缩进级别
深度
是一堆缩进。 indent() 返回一个缩进标记数组,start() 解包该数组以使解析器的行为有点像流。peg.js 为文本生成:
这些结果:
此标记生成器甚至可以捕获错误的缩进。
Pure PEG cannot parse indentation.
But peg.js can.
I did a quick-and-dirty experiment (being inspired by Ira Baxter's comment about cheating) and wrote a simple tokenizer.
For a more complete solution (a complete parser) please see this question: Parse indentation level with PEG.js
depths
is a stack of indentations. indent() gives back an array of indentation tokens and start() unwraps the array to make the parser behave somewhat like a stream.peg.js produces for the text:
these results:
This tokenizer even catches bad indents.
我认为像这样的缩进敏感语言是上下文敏感的。我相信 PEG 只能处理上下文无关的语言。
请注意,虽然 nalply 的答案肯定是正确的,即 PEG.js 可以通过外部状态(即可怕的全局变量)来做到这一点,但它可能是一条危险的道路(比全局变量的常见问题更糟糕)。某些规则最初可以匹配(然后运行其操作),但父规则可能会失败,从而导致操作运行无效。如果在此类操作中外部状态发生更改,您可能会得到无效状态。这是非常可怕的,可能会导致颤抖、呕吐和死亡。一些问题和解决方案位于此处的评论中: https://github.com/dmajda/pegjs /问题/45
I think an indentation-sensitive language like that is context-sensitive. I believe PEG can only do context-free langauges.
Note that, while nalply's answer is certainly correct that PEG.js can do it via external state (ie the dreaded global variables), it can be a dangerous path to walk down (worse than the usual problems with global variables). Some rules can initially match (and then run their actions) but parent rules can fail thus causing the action run to be invalid. If external state is changed in such an action, you can end up with invalid state. This is super awful, and could lead to tremors, vomiting, and death. Some issues and solutions to this are in the comments here: https://github.com/dmajda/pegjs/issues/45
因此,我们在这里使用缩进真正做的是创建类似 C 风格块的东西,这些块通常有自己的词法范围。如果我正在为这样的语言编写编译器,我想我会尝试让词法分析器跟踪缩进。每次缩进增加时,它都可以插入一个“{”标记。同样,每次减少时,它都可以插入一个“}”标记。然后编写带有显式花括号的表达式语法来表示词法范围变得更加直接。
So what we are really doing here with indentation is creating something like a C-style blocks which often have their own lexical scope. If I were writing a compiler for a language like that I think I would try and have the lexer keep track of the indentation. Every time the indentation increases it could insert a '{' token. Likewise every time it decreases it could inset an '}' token. Then writing an expression grammar with explicit curly braces to represent lexical scope becomes more straight forward.
您可以在 Treetop 中使用语义谓词来执行此操作。在这种情况下,您需要一个语义谓词来检测由于出现具有相同或更少缩进的另一行而关闭空白缩进块。谓词必须计算从起始行开始的缩进,如果当前行的缩进已完成相同或更短的长度,则返回 true(块已关闭)。因为结束条件是依赖于上下文的,所以不能记住它。
这是我要添加到 Treetop 文档中的示例代码。请注意,我已经重写了 Treetop 的 SyntaxNode 检查方法,以便更轻松地可视化结果。
这是一个小测试驱动程序,您可以轻松尝试:
You can do this in Treetop by using semantic predicates. In this case you need a semantic predicate that detects closing a white-space indented block due to the occurrence of another line that has the same or lesser indentation. The predicate must count the indentation from the opening line, and return true (block closed) if the current line's indentation has finished at the same or shorter length. Because the closing condition is context-dependent, it must not be memoized.
Here's the example code I'm about to add to Treetop's documentation. Note that I've overridden Treetop's SyntaxNode inspect method to make it easier to visualise the result.
Here's a little test driver program so you can try it easily:
我知道这是一个旧线程,但我只是想在答案中添加一些 PEGjs 代码。这段代码将解析一段文本并将其“嵌套”到一种“AST-ish”结构中。它只深入一层,看起来很丑,而且它并没有真正使用返回值来创建正确的结构,而是在内存中保留语法树,并在最后返回该树。这很可能变得笨拙并导致一些性能问题,但至少它做了它应该做的事情。
注意:确保使用制表符而不是空格!
I know this is an old thread, but I just wanted to add some PEGjs code to the answers. This code will parse a piece of text and "nest" it into a sort of "AST-ish" structure. It only goes one deep and it looks ugly, furthermore it does not really use the return values to create the right structure but keeps an in-memory tree of your syntax and it will return that at the end. This might well become unwieldy and cause some performance issues, but at least it does what it's supposed to.
Note: Make sure you have tabs instead of spaces!