解析“越位” (基于缩进的)语言
越位语言是指
...该语言中声明(块)的范围由其缩进表示。
此类语言的示例包括 Python、Boo、Nemerle、YAML 等。
所以我的问题是:我如何实际解析这些?如何解决制表符与空格问题(两个制表符或 8 个空格是否等效)?解析器生成器在这里有什么帮助吗?还是我必须自己手动编写词法分析器/解析器?
An off-side language is the one where
...the scope of declarations (a block) in that language is expressed by their indentation.
Examples of such languages are Python, Boo, Nemerle, YAML and several more.
So my question is this: how do I actually parse these? How do I resolve tabs vs spaces problem (are two tabs or 8 spaces equivalent or not)? Are parser generators of any help here or do I have to hand-code lexer/parser myself?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
Python 有一个词法分析器,可以生成Indent 和Dedent 标记,它们相当于大括号(“{”、“}”)。 Stack Overflow 上甚至还有一个示例,其中这种词法分析器的简单实现。
对于制表符与空格,Python 只有一个编码约定:使用 4 个空格每个缩进级别。不过制表符是合法的语法。
Python has a lexer that generates Indent and Dedent tokens, that are equivalent to curly braces ("{", "}"). There is even an example on Stack Overflow, with a simple implementation of such a lexer.
For tabs vs. spaces, Python only has a coding convention: Use 4 spaces per indentation level. Tabs are legal syntax though.
解决制表符与空格问题的最简单方法是禁止空格和制表符的组合(例如,F# 就是这样做的)。任何现代编辑器都允许将制表符转换为一定数量的空格。
至于你是否需要放弃解析器生成器,可能不需要,但你必须在某个地方破解越位标识。这可能需要您发挥一点创造力。根据浏览 F# 源代码,他们似乎使用词法分析后步骤来创建表示越位语言元素的附加标记。
The easiest way to resolve the tabs versus spaces problem is to disallow combinations of spaces and tabs (this is what's done in F#, for instance). Any modern editor will allow tabs to be converted to some number of spaces.
As for whether you need to abandon parser generators, probably not, but you will have to hack the offsides identification in there somewhere. This may require a bit of creativity on your part. Based on browsing the F# source, it looks like they use a post-lexing step to create additional tokens representing offside language elements.
如果两个制表符等于八个空格,这取决于编辑器的设置。
正如发起者所表达的,越位规则提到了连续两行代码的相对位置,而不是空格的绝对数量。 这里是一篇很好的读物,可以帮助您更好地理解(和一些引用):
It depends on how the editor's settings if two tabs will equal eight spaces.
The off-side rule, as expressed by the originator, mentions relative positioning of two successive lines of code and not the absolute number of whitespaces. Here's a nice read to help you better understand (and some quote):
就其价值而言,Haskell 也是基于缩进的,并且可以选择 { foo;酒吧;等 } 当空白不方便时。 我编写了一个简单的基于缩进的解析器,其中 Parsec,它的目的是读起来像Lisp,但缩进表示运算符应用程序。括号只能在一行中使用。
这里
aaa
应用于bb
。结果是一个三元函数。它应用于参数cc
,e
应用于一个参数,以及ddd
。了解应用程序如何基于列对齐,而不是 X 空格。解析器也可能会简单得多。
For what it's worth, Haskell is also indentation-based and optionally { foo; bar; etc } for when whitespace is inconvenient. I wrote a simple indentation-based parser with Parsec, it's indended to read like Lisp but indentation indicates operator application. Parentheses can be only used on one line.
Here
aaa
is applied tobb
. What results is a ternary function. It is applied to argumentscc
,e
applied to one argument, andddd
. See how the application is based on column alignment, and not X spaces.The parser could probably be a lot simpler, too.
关于制表符和空格,您有多个选项:要么禁止混合制表符和空格,假设制表符与空格的固定比例,要么允许程序员根据每个项目或每个源文件来决定(某种“#pragma tab”) (4)”样式指令允许制表符和/或更改它们代表的空格数)。
像 ANTLR 3 这样的解析器生成器可以轻松应对这个问题;我自己一直在尝试一个示例,编译为其 C# 目标。 DirkGently 的答案中的链接解释了Python算法,该算法将直接写入代码。我的方法只是为空格和换行符定义单独的标记,并覆盖词法分析器使用的“发出标记”函数以动态插入额外的缩进/缩进标记。事实证明,这比我见过的覆盖“获取最后一个令牌”功能的其他方法更容易实现,但两者都效果很好。
You've got several options w.r.t. tabs and spaces: either disallow mixing tabs and spaces, assume a fixed ratio of tabs to spaces, or permit the programmer to decide on a per project or per source file basis (some sort of "#pragma tab(4)" style directive to permit tabs and/or change the number of spaces they represent).
Parser generator such as ANTLR 3 can easily cope with this; I've been toying with an example myself, compiling to its C# target. The link in DirkGently's answer explains the Python algorithm, which translates directly into code. My approach was simply to define separate tokens for whitespace and newlines, and override the "emit token" function used by the lexer to insert additional indent/dedent tokens on the fly. This turned out to be simpler to implement than other approaches I've seen around which override the "get last token" function, but either works pretty well.
我正在开发一个 解析器基于缩进的语言。到目前为止,它涉及大量手动代码。我想尝试做 Python 提到的缩进/缩进的事情,但这有点难。
我将其分为 3 个阶段:
标记化阶段大约有 100 行,加上正则表达式的定义。它会输出一系列有用的标记。
“指导”阶段(或“折叠”阶段)获取令牌并基本上吐出推送和弹出,以创建有关如何制作树数据结构的指令。它本质上是将列表折叠成一棵树。
最后的“树化”阶段采用树指令并实际构建树。事实证明,思考令牌列表如何变成一棵树是相当困难的,需要进行精神体操。我花了整个周末试图让它工作,但仍然有很多方法可以让输出树与推送和弹出正确对齐。
它应该作为如何构建基于缩进的解析器的真实示例,尽管它的代码并不是那么好。
I am working on a parser for an indentation based language. It so far has involved a lot of manual code. I want to try to do the dedent/indent thing that Python was mentioned doing, but it's kind of hard TBH.
I divided it into 3 phases:
The tokenization phase is about 100 lines, plus the definitions of the regexps. It spits out a sequence of helpful tokens.
The "directing" phase (or "folding" phase) takes the tokens and spits out push and pop basically, to create instructions on how to make a tree data structure. It essentially folds the list into a tree.
The final "treeifying" phase takes the tree instructions and actually builds the tree. It turns out to be pretty hard, mentally doing gymnastics to think about how the list of tokens becomes a tree. I spent all weekend trying to get it to work, but still have a ways to go to get the output tree to be properly aligned with push and pop.
It should serve as a real-world example of how to build an indentation-based parser, though it's not that great of code.
我自己得到了一种解决方案,我像分析嵌套树的块一样分析代码。对于括号部分,我只是用了普通的方法。
这是该解析器: https://github.com/jiyinyiyong/cirru-parser
I got one solution by myself, which I analyse the code just like blocks for the nesting tree. For the part of brackets, I just used normal method.
Here's that parser: https://github.com/jiyinyiyong/cirru-parser