用自己的语言编写编译器
直观上,似乎语言 Foo
的编译器本身不能用 Foo 编写。 更具体地说,语言 Foo
的第一个编译器不能用 Foo 编写,但任何后续编译器都可以为 Foo
编写。
但这真的是真的吗? 我有一些非常模糊的记忆,读到过一种语言,它的第一个编译器是用“它自己”编写的。 这可能吗?如果可能的话,如何实现?
Intuitively, it would seems that a compiler for language Foo
cannot itself be written in Foo. More specifically, the first compiler for language Foo
cannot be written in Foo, but any subsequent compiler could be written for Foo
.
But is this actually true? I have some very vague recollection of reading about a language whose first compiler was written in "itself". Is this possible, and if so how?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
这称为“引导”。 您必须首先使用其他语言(通常是 Java 或 C)为您的语言构建编译器(或解释器)。 完成后,您可以用 Foo 语言编写新版本的编译器。 您使用第一个引导编译器来编译该编译器,然后使用此编译后的编译器来编译其他所有内容(包括其自身的未来版本)。
大多数语言确实是以这种方式创建的,部分原因是语言设计者喜欢使用他们正在创建的语言,而且还因为一个不平凡的编译器通常可以作为衡量语言“完整”程度的有用基准。
Scala 就是一个例子。 它的第一个编译器是用 Pizza 创建的,这是 Martin Odersky 的一种实验语言。 从 2.0 版本开始,编译器完全用 Scala 重写。 从那时起,旧的 Pizza 编译器可以被完全丢弃,因为新的 Scala 编译器可以用于编译自身以供将来的迭代使用。
This is called "bootstrapping". You must first build a compiler (or interpreter) for your language in some other language (usually Java or C). Once that is done, you can write a new version of the compiler in language Foo. You use the first bootstrap compiler to compile the compiler, and then use this compiled compiler to compile everything else (including future versions of itself).
Most languages are indeed created in this fashion, partially because language designers like to use the language they are creating, and also because a non-trivial compiler often serves as a useful benchmark for how "complete" the language may be.
An example of this would be Scala. Its first compiler was created in Pizza, an experimental language by Martin Odersky. As of version 2.0, the compiler was completely re-written in Scala. From that point on, the old Pizza compiler could be completely discarded, due to the fact that the new Scala compiler could be used to compile itself for future iterations.
我记得听过软件工程广播播客< /a> 其中,Dick Gabriel 谈到通过在纸上用 LISP 编写一个基本版本并手工将其组装成机器代码来引导原始的 LISP 解释器。 从那时起,LISP 的其余功能都用 LISP 编写和解释。
I recall listening to a Software Engineering Radio podcast wherein Dick Gabriel spoke about bootstrapping the original LISP interpreter by writing a bare-bones version in LISP on paper and hand assembling it into machine code. From then on, the rest of the LISP features were both written in and interpreted with LISP.
当您为 C 编写第一个编译器时,您是用其他语言编写的。 现在,您有了一个 C 语言编译器,例如汇编器。 最终,您将到达必须解析字符串的地方,特别是转义序列。 您将编写代码将
\n
转换为十进制代码 10 的字符(将\r
转换为 13 等)。编译器准备就绪后,您将开始用 C 语言重新实现它。这个过程称为“bootstrapping< /a>”。
字符串解析代码将变为:
当编译时,您将拥有一个可以理解“\n”的二进制文件。 这意味着你可以更改源代码:
那么“\n”是13的代码的信息在哪里呢? 它在二进制文件中! 就像 DNA:用这个二进制文件编译 C 源代码将继承这个信息。 如果编译器自行编译,它将把这些知识传递给它的后代。 从此时起,无法仅从源代码看出编译器将做什么。
如果你想在某个程序的源代码中隐藏病毒,你可以这样做:获取编译器的源代码,找到编译函数的函数并将其替换为这个:
有趣的部分是 A 和 B。是
compileFunction
的源代码,包括病毒,可能以某种方式加密,因此通过搜索生成的二进制文件并不明显。 这可以确保编译到编译器本身将保留病毒注入代码。B 与我们要用病毒替换的功能相同。 例如,它可能是源文件“login.c”中的函数“login”,该文件可能来自 Linux 内核。 我们可以将其替换为除了普通密码之外还接受 root 帐户密码“joshua”的版本。
如果您将其编译并以二进制形式传播,则无法通过查看源代码来找到病毒。
这个想法的原始来源: https:// /web.archive.org/web/20070714062657/http://www.acm.org/classics/sep95/
When you write your first compiler for C, you write it in some other language. Now, you have a compiler for C in, say, assembler. Eventually, you will come to the place where you have to parse strings, specifically escape sequences. You will write code to convert
\n
to the character with the decimal code 10 (and\r
to 13, etc).After that compiler is ready, you will start to reimplement it in C. This process is called "bootstrapping".
The string parsing code will become:
When this compiles, you have a binary which understands '\n'. This means you can change the source code:
So where is the information that '\n' is the code for 13? It's in the binary! It's like DNA: Compiling C source code with this binary will inherit this information. If the compiler compiles itself, it will pass this knowledge on to its offspring. From this point on, there is no way to see from the source alone what the compiler will do.
If you want to hide a virus in the source of some program, you can do it like this: Get the source of a compiler, find the function which compiles functions and replace it with this one:
The interesting parts are A and B. A is the source code for
compileFunction
including the virus, probably encrypted in some way so it's not obvious from searching the resulting binary. This makes sure that compiling to compiler with itself will preserve the virus injection code.B is the same for the function we want to replace with our virus. For example, it could be the function "login" in the source file "login.c" which is probably from the Linux kernel. We could replace it with a version that will accept the password "joshua" for the root account in addition to the normal password.
If you compile that and spread it as a binary, there will be no way to find the virus by looking at the source.
The original source of the idea: https://web.archive.org/web/20070714062657/http://www.acm.org/classics/sep95/
对之前的答案增加好奇心。
这是Linux From Scratch手册的引用,在从源代码开始构建 GCC 编译器的步骤中。 (Linux From Scratch 是一种安装 Linux 的方法,它与安装发行版完全不同,因为您必须编译目标系统的每一个二进制文件。)
使用“引导”目标的动机是,用于构建目标系统工具链的编译器可能不具有与目标编译器完全相同的版本。 按照这种方式进行,肯定会在目标系统中获得一个可以自行编译的编译器。
Adding a curiosity to the previous answers.
Here's a quote from the Linux From Scratch manual, at the step where one starts building the GCC compiler from its source. (Linux From Scratch is a way to install Linux that is radically different from installing a distribution, in that you have to compile really every single binary of the target system.)
That use of the 'bootstrap' target is motivated by the fact that the compiler one uses to build the target system's toolchain may not have the very same version of the target compiler. Proceeding in that way one is sure to obtain, in the target system, a compiler that can compile itself.
您无法自行编写编译器,因为您没有任何东西可以用来编译起始源代码。 有两种方法可以解决这个问题。
最不受欢迎的是以下内容。 您用汇编程序编写一个最小的编译器(恶心),用于最小的语言集,然后使用该编译器来实现该语言的额外功能。 不断积累,直到拥有一个具有所有语言功能的编译器。 这是一个痛苦的过程,通常只有在您别无选择时才会进行。
首选方法是使用交叉编译器。 您可以更改另一台计算机上现有编译器的后端,以创建在目标计算机上运行的输出。 然后你就有了一个很好的完整编译器并可以在目标机器上运行。 最流行的是 C 语言,因为有很多现有的编译器具有可以交换的可插入后端。
一个鲜为人知的事实是,GNU C++ 编译器有一个仅使用 C 子集的实现。 原因是通常很容易找到适用于新目标机器的 C 编译器,然后您可以从中构建完整的 GNU C++ 编译器。 现在,您已经在目标机器上安装了 C++ 编译器。
You cannot write a compiler in itself because you have nothing to compile your starting source code with. There are two approaches to solving this.
The least favored is the following. You write a minimal compiler in assembler (yuck) for a minimal set of the language and then use that compiler to implement extra features of the language. Building your way up until you have a compiler with all the language features for itself. A painful process that is usually only done when you have no other choice.
The preferred approach is to use a cross compiler. You change the back end of an existing compiler on a different machine to create output that runs on the target machine. Then you have a nice full compiler up and working on the target machine. Most popular for this is the C language, as there are plenty of existing compilers that have pluggable back ends that can be swapped out.
A little known fact is that the GNU C++ compiler has an implementation that uses only the C subset. The reason being it is usually easy to find a C compiler for a new target machine that allows you to then build the full GNU C++ compiler from it. You have now boot strapped yourself to having a C++ compiler on the target machine.
一般来说,您需要首先让编译器工作(如果是原始的),然后您可以开始考虑使其成为自托管。 这实际上被认为是某些语言的一个重要里程碑。
根据我对“mono”的记忆,他们可能需要向反射添加一些东西才能使其正常工作:mono 团队不断指出,有些事情根本无法通过
Reflection.Emit
实现。代码>; 当然,微软团队可能会证明他们是错的。这有一些真正的优点:对于初学者来说,这是一个相当好的单元测试! 而且您只需担心一种语言(即 C# 专家可能不太了解 C++;但现在您可以修复 C# 编译器)。 但我想知道这里的工作是否没有足够的职业自豪感:他们只是希望它是自我托管的。
不完全是一个编译器,但我最近一直在开发一个自托管的系统; 代码生成器用于生成代码生成器...因此,如果架构发生更改,我只需在其自身上运行它:新版本。 如果有错误,我只是返回到早期版本并重试。 非常方便,而且非常容易维护。
更新 1
我刚刚观看了 PDC 的 Anders 的这段视频,并且(关于小时)他确实给出了一些更有效的理由——所有这些都是关于编译器即服务的。 只是为了记录。
Generally, you need to have a working (if primative) cut of the compiler working first - then you can start thinking about making it self-hosting. This is actually considered an important milestone in some langauges.
From what I remember from "mono", it is likely they will need to add a few things to reflection to get it working: the mono team keep pointing out that some things simply aren't possible with
Reflection.Emit
; of course, the MS team might prove them wrong.This has a few real advantages: it is a fairly good unit test, for starters! And you only have one language to worry about (i.e. it is possible a C# expert might not know much C++; but now thy can fix the C# compiler). But I wonder if there isn't an amount of professional pride at work here: they simply want it to be self-hosting.
Not quite a compiler, but I've recently been working on a system that is self hosting; the code generator is used to generate the code generator... so if the schema changes I simply run it on itself : new version. If there is a bug, I just go back to an earlier version and try again. Very convenient, and very easy to maintain.
Update 1
I've just watched this video of Anders at PDC, and (about an hour in) he does give some much more valid reasons - all about the compiler as a service. Just for the record.
我自己编写了 SLIC(用于实现编译器的语言系统)。 然后手工将其编译成汇编。 SLIC 有很多功能,因为它是五种子语言的单个编译器:
SLIC 的灵感来自于 CWIC(用于编写和实现编译器的编译器)。 与大多数编译器开发包不同,SLIC 和 CWIC 使用专门的、特定领域的语言来解决代码生成问题。 SLIC 扩展了 CWIC 代码生成,添加了 ISO、PSEUDO 和 MACHOP 子语言,将目标机器细节与树爬行生成器语言分开。
LISP 2 树和列表
基于 LISP 2 生成器语言的动态内存管理系统是一个关键组件。 列表用方括号括起来的语言表示,其组成部分用逗号分隔,即三元素 [a,b,c] 列表。
树:
由第一个条目是节点对象的列表表示:
树通常在分支之前显示单独的节点:
使用基于 LISP 2 的生成器函数进行解析
生成器函数是一组命名的 (unparse)=>action> 。 对...
未解析表达式是匹配树模式和/或对象类型的测试,将它们分开并将这些部分分配给局部变量以由其过程操作进行处理。 有点像采用不同参数类型的重载函数。 除了 ()=> ...按照编码顺序尝试测试。 第一个成功的解解析执行其相应的操作。 未解析表达式是反汇编测试。 ADD[x,y] 匹配一个两分支 ADD 树,将其分支分配给局部变量 x 和 y。 该操作可以是一个简单的表达式或一个 .BEGIN ... .END 有界代码块。 今天我会使用 c 风格的 { ... } 块。 树匹配,[],解解析规则可以调用生成器将返回的结果传递给操作:
具体来说,上面的 expr_gen 解解析匹配一个两分支 ADD 树。 在测试模式中,放置在树枝中的单个参数生成器将使用该分支进行调用。 但它的参数列表是分配返回对象的局部变量。 上面的unparse指定了两个分支是ADD树的反汇编,递归地将每个分支压到expr_gen。 左分支返回放入局部变量 x 中。 同样,右分支传递给 expr_gen,y 作为返回对象。 以上可以是数值表达式求值器的一部分。 上面有称为向量的快捷功能,而不是节点字符串,节点向量可以与相应操作的向量一起使用:
上面更完整的表达式求值器将 expr_gen 左分支的返回分配给 x,将右分支分配给 y 。 返回对 x 和 y 执行的相应操作向量。 最后的 unparse=>action 对匹配数字和符号对象。
符号和符号属性
符号可能具有命名属性。 val:(x) 访问 x 中包含的符号对象的 val 属性。 通用符号表堆栈是 SLIC 的一部分。 SYMBOL 表可以被压入和弹出,为函数提供本地符号。 新创建的符号在顶部符号表中编目。 符号查找从顶表开始向后向下搜索符号表堆栈。
生成与机器无关的代码
SLIC 的生成器语言生成 PSEUDO 指令对象,并将它们附加到段代码列表中。 .FLUSH 导致其 PSEUDO 代码列表运行,从列表中删除每个 PSEUDO 指令并调用它。 执行后,PSEUDO 对象的内存被释放。 除了输出之外,伪指令和生成器动作的程序体基本上是相同的语言。 PSEUDO 旨在充当汇编宏,提供与机器无关的代码序列化。 它们提供了一种将特定目标机器从树爬行生成器语言中分离出来的方法。 PSEUDO 调用 MACHOP 函数来输出机器代码。 MACHOP 用于定义汇编伪操作(如 dc、定义常量等)和机器指令或使用向量条目的一系列类似格式的指令。 它们只是将参数转换为构成指令的一系列位字段。 MACHOP 调用旨在看起来像程序集,并在编译列表中显示程序集时提供字段的打印格式。 在示例代码中,我使用了 c 风格的注释,可以轻松添加该注释,但原始语言中没有。 MACHOP 将代码生成到可位寻址的存储器中。 SLIC 链接器处理编译器的输出。 使用向量条目的 DEC-10 用户模式指令的 MACHOP:
.MORG 36,O(18):$/36; 将位置与 36 位边界对齐,以八进制形式打印 18 位的位置 $/36 字地址。 9 位 opcd、4 位寄存器、间接位和 4 位索引寄存器组合并打印,就像单个 18 位字段一样。 18 位地址/36 或立即值以八进制输出并打印。 打印出 r1 = 1 和 r2=2 的 MOVEI 示例:
使用编译器汇编选项,您可以在编译列表中获得生成的汇编代码。
将其链接在一起
SLIC 链接器作为处理链接和符号解析的库提供。 但是,必须为目标计算机编写特定于目标的输出加载文件格式,并与链接器库链接。
生成器语言能够将树写入文件并读取它们,从而允许实现多遍编译器。
代码生成和起源的简短总结
我首先回顾了代码生成,以确保人们理解 SLIC 是一个真正的编译器。 SLIC 的灵感来自于 20 世纪 60 年代末 Systems Development Corporation 开发的 CWIC(用于编写和实现编译器的编译器)。 CWIC 仅具有从 GENERATOR 语言生成数字字节代码的 SYNTAX 和 GENERATOR 语言。 字节代码被放置或种植(CWIC 文档中使用的术语)到与命名节相关的内存缓冲区中,并通过 .FLUSH 语句写出。 ACM 档案中提供了有关 CWIC 的 ACM 论文。
成功实现一种主要编程语言
在 20 世纪 70 年代末,SLIC 被用来编写 COBOL 交叉编译器。 大部分由一名程序员在大约 3 个月内完成。 我根据需要与程序员一起工作了一些。 另一位程序员为目标 TI-990 迷你计算机编写了运行时库和 MACHOP。 该 COBOL 编译器每秒编译的行数比用汇编语言编写的 DEC-10 本机 COBOL 编译器要多得多。
更多关于编译器然后通常谈论
从头开始编写编译器的一个重要部分是运行时库。 您需要一个符号表。 你需要输入和输出。 动态内存管理等。为编译器编写运行时库比编写编译器可能需要更多的工作。 但对于 SLIC,运行时库对于所有使用 SLIC 开发的编译器都是通用的。 请注意,有两个运行时库。 一种用于该语言(例如 COBOL)的目标机器。 另一个是编译器编译器运行时库。
我想我已经确定这些不是解析器生成器。 现在,只要对后端有一点了解,我就可以解释解析器编程语言了。
解析器编程语言
解析器是使用以简单方程形式编写的公式编写的。
最低层次的语言元素是字符。 标记由语言字符的子集组成。 字符类用于命名和定义这些字符子集。 字符类定义运算符是冒号 (:) 字符。 作为类成员的字符在定义的右侧进行编码。 可打印字符包含在素数单 ' 字符串中。 非打印字符和特殊字符可以用它们的数字序数表示。 类成员由替代 | 分隔开。 操作员。 类公式以分号结尾。 字符类可以包括先前定义的类:
skip_class 0b00000001 是预定义的,但定义skip_class 可能会超出范围。
总之:字符类是一个只能是字符常量、字符序数或先前定义的字符类的替代列表。 当我实现字符类时:类公式被分配了一个类位掩码。 (如上面的注释所示)任何具有任何字符文字或序数的类公式都会导致分配类位。 掩码是通过将包含的类的类掩码与分配的位(如果有)进行或运算而形成的。 类表是根据字符类创建的。 由字符序数索引的条目包含指示字符的类成员身份的位。 类测试是内联完成的。 eax 中带有字符序号的 IA-86 代码示例说明了类测试:
后面跟着 a:
或
使用 IA-86 指令代码示例,因为我认为 IA-86 指令如今更为广为人知。 评估其类掩码的类名与按字符序数(在 eax 中)索引的类表进行非破坏性 AND 运算。 非零结果表示类成员资格。 (除了包含该字符的 al(EAX 的低 8 位)之外,EAX 均为零)。
这些旧编译器中的令牌有点不同。 关键词没有被解释为标记。 它们只是通过解析器语言中带引号的字符串常量进行匹配。 通常不保留引用的字符串。 可以使用修饰符。 A + 保持字符串匹配。 (即 +'-' 匹配 - 字符,成功时保留该字符) , 操作(即 'E')将字符串插入到标记中。 空白由令牌公式处理,跳过前导 SKIP_CLASS 字符,直到进行第一个匹配。 请注意,显式的skip_class字符匹配将停止跳过,允许令牌以skip_class字符开头。 字符串标记公式会跳过与单引号 quitdd 字符或双引号字符串匹配的前导 Skip_class 字符。 有趣的是匹配 " 引用字符串中的 " 字符:
第一个替代项匹配任何单引号引用的字符。 正确的替代方案匹配双引号引起来的字符串,该字符串可能包含双引号字符,使用两个 " 字符一起表示单个 " 字符。 该公式定义了其自身定义中使用的字符串。 内部右替代 '"' $(-"""" .ANY | """""","""") '"' 匹配双引号引起来的字符串。 我们可以使用单个 ' 引号字符来匹配双引号 " 字符。但是,在双 " 引号字符串中,如果我们希望使用 " 字符,则必须使用两个 " 字符来获取一个字符。 例如,在内部左侧替代中,匹配除引号之外的任何字符:
使用负向查看 -"""",当成功(不匹配 " 字符)时,则匹配 .ANY 字符(不能是 " 字符,因为 - “”“”消除了这种可能性)。 正确的选择是采用 -"""" 匹配一个 " 字符,失败是正确的选择:
尝试匹配两个 " 字符,使用单个双 " 替换它们,使用 ,"""" 插入单个 " 字符。 两个内部替代方案都未能匹配结束字符串引号字符,并调用 MAKSTR[] 来创建字符串对象。 $ 序列,成功时循环,运算符用于匹配序列。 令牌公式跳过前导跳过类字符(空白)。 一旦完成第一个匹配,skip_class 跳过就会被禁用。 我们可以使用[]调用其他语言编写的函数。 MAKSTR[]、MAKIN[]、MAKOCT[]、MAKHEX[]、MAKFLOAT[] 和 MAKINT[] 是提供的库函数,可将匹配的标记字符串转换为类型化对象。 下面的数字公式说明了相当复杂的标记识别:
上面的数字标记公式可识别整数和浮点数。 ——替代方案总是成功的。 数字对象可用于计算。 公式成功后,标记对象将被推入解析堆栈。 (+'E'|'e','E') 中的指数领先很有趣。 我们希望 MAKEFLOAT[] 始终使用大写 E。 但我们允许使用小写“e”来替换它,“E”。
您可能已经注意到字符类和标记公式的一致性。 解析公式继续添加回溯选项和树构建运算符。 回溯和非回溯替代运算符不得在表达式级别内混合。 你可能没有(a|b\c)混合非回溯| withe \回溯替代方案。 (a\b\c)、(a|b|c) 和 ((a|b)\c) 有效。 \ 回溯替代方案在尝试其左侧替代方案之前保存解析状态,并且在失败时在尝试右侧替代方案之前恢复解析状态。 在一系列替代方案中,第一个成功的替代方案使团队感到满意。 没有尝试其他替代方案。 因式分解和分组提供了持续推进的解析。 回溯替代方案在尝试其左侧替代方案之前创建解析的已保存状态。 当解析可能部分匹配然后失败时,需要回溯:
在上面,如果返回失败,则尝试替代 cd。 如果 c 返回失败,将尝试回溯替代方案。 如果 a 成功而 b 失败,则解析将被回溯并尝试 e。 同样,a 失败 c 成功,b 失败,则回溯解析并采用替代 e。 回溯不限于公式内。 如果任何解析公式在任何时候进行部分匹配,然后失败,则解析将重置为顶部回溯并采取替代方案。 如果代码已输出并感知到已创建回溯,则可能会发生编译失败。 在开始编译之前设置回溯。 返回失败或回溯到它是编译器失败。 回溯是堆叠的。 我们可以使用负数和正数? 窥视/前瞻运算符可以在不推进解析的情况下进行测试。 字符串测试是一种预知,只需要保存和重置输入状态。 前瞻是一个在失败之前进行部分匹配的解析表达式。 前瞻是通过回溯来实现的。
解析器语言既不是 LL 也不是 LR 解析器。 但是,用于编写递归体面解析器的编程语言可以在其中对树构造进行编程:
常用的解析示例是算术表达式:
使用循环的 Exp 和 Term 创建一棵左手树。 使用右递归的因数创建右手树:
这是 cc 编译器的一些内容,它是带有 c 风格注释的 SLIC 的更新版本。 函数类型(语法、标记、字符类、生成器、PSEUDO 或 MACHOP)由其 id 后面的初始语法确定。
使用这些自上而下的解析器,您可以从定义公式的程序开始:
// 注意在创建树时如何分解 id 并随后进行组合。
值得注意的是解析器语言如何处理注释和错误恢复。
我想我已经回答了这个问题。 在这里编写了 SLIC 的后继者 cc 语言本身的大部分内容。 目前还没有适合它的编译器。 但我可以手动将其编译成汇编代码、裸 asm c 或 c++ 函数。
I wrote SLIC (System of Languages for Implementing Compilers) in itself. Then hand compiled it into assembly. There is a lot to SLIC as it was a single compiler of five sub-languages:
SLIC was inspired by CWIC (Compiler for Writing and Implementing Compilers). Unlike most compiler development packages SLIC and CWIC addressed code generation with specialize, domain specific, languages. SLIC extends CWICs code generation adding the ISO, PSEUDO and MACHOP sub-languages separating target machine specifics out of the tree-crawling generator language.
LISP 2 trees and lists
The LISP 2 based generator language's dynamic memory management system is a key component. Lists are expressed in the language inclosed in square brackets, its components separated by commas i.e. a three element [a,b,c] list.
Trees:
are represented by lists whose first entry is a node object:
Trees are commonly displayed with the node separate preceding the branches:
Unparsing with LISP 2 based generator functions
A generator function is a named set of (unparse)=>action> pairs ...
Unparse expressions are tests that match tree patterns and/or object types breaking them apart and assigning those parts to local variable to be processed by its procedural action. Kind of like an overloaded function taking different argument types. Except the ()=> ... tests are attempted in the order coded. The first successful unparse executing its corresponding action. The unparse expressions are disassembling tests. ADD[x,y] matches a two branch ADD tree assigning its branches to local variables x and y. The action may be a simple expression or a .BEGIN ... .END bounded code block. I would use c style { ... } blocks today. Tree matching, [], unparse rules may call generators passing the returned result(s) to the action:
Specifically the above expr_gen unparse matches a two branch ADD tree. Within the test pattern a single argument generator placed in a tree branch will be called with that branch. Its argument list though are local variables assigned returned objects. Above the unparse specifies a two branch is ADD tree disassembly, recursive pressing each branch to expr_gen. The left branch return placed into into local variables x. Likewise the right branch passed to expr_gen with y the return object. The above could be part of a numeric expression evaluator. There were shortcut features called vectors were in the above instead of the node string a vector of nodes could be used with a vector of corresponding actions:
The above more complete expression evaluator assigning the return from expr_gen left branch to x and the right branch to y. The corresponding action vector performed on x and y returned. The last unparse=>action pairs match numeric and symbol objects.
Symbol and symbol attributes
Symbols may have named attributes. val:(x) access the val attribute of the symbol object contained in x. A generalized symbol table stack is part of SLIC. The SYMBOL table may be pushed and popped providing local symbols for functions. Newly created symbol are cataloged in the top symbol table. Symbol lookup searches the symbol table stack from the top table first backward down the stack.
Generating machine independent code
SLIC's generator language produces PSEUDO instruction objects, appending them to a sections code list. A .FLUSH causes its PSEUDO code list to be run removing each PSEUDO instruction from the list and calling it. After execution a PSEUDO objects memory is released. The procedural bodies of PSEUDOs and GENERATOR actions are basically the same language except for their output. PSEUDO are meant to act as assembly macros providing machine independent code sequentialization. They provide a separation of the specific target machine out of the tree crawling generator language. PSEUDOs call MACHOP functions to output machine code. MACHOPs are used to define assembly pseudo ops (like dc, define constant etc) and machine instruction or a family of like formated instructions using vectored entry. They simply transform their parameters into a sequence of bit fields making up the instruction. MACHOP calls are meant to look like assembly and provide print formatting of the fields for when assembly is shown in the compile listing. In the example code I am using c style commenting that could be easily added but was not in the original languages. MACHOPs are producing code into a bit addressable memory. The SLIC linker handles output of the compiler. A MACHOP for the DEC-10 user mode instructions using vectored entry:
The .MORG 36, O(18): $/36; aligns the location to a 36 bit boundary printing the location $/36 word address of 18 bits in octal. The 9 bit opcd,4 bit register, indirect bit and 4 bit index register are combined and printed as if a single 18 bit field. The 18 bit address/36 or immediate value is output and printed in octal. A MOVEI example print out with r1 = 1 and r2=2:
With the compiler assembly option you get the generated assembly code in the compile listing.
Link it together
The SLIC linker is supplied as a library that handles the linking and symbol resolutions. Target specific output load file formatting though must be written for target machines and linked with the linker library library.
The generator language is capable of writing trees to a file and reading them allowing a multipass compiler to be implemented.
Short summery of code generation and origins
I have went over the code generation first to insure it is understood that SLIC was a true compiler compiler. SLIC was inspired by CWIC (Compiler for Writing and Implementing Compilers) developed at Systems Development Corporation in the late 1960s. CWIC only had SYNTAX and GENERATOR languages producing numeric byte code out of the GENERATOR language. Byte code was placed or planted (the term used in CWICs documentation) into memory buffers associated with named sections and written out by a .FLUSH statement. An ACM paper on CWIC is available from the ACM archives.
Successfully implementing a major programming language
In the late 1970s SLIC was used to write a COBOL cross compiler. Completed in about 3 months mostly by a single programmer. I worked a bit with the programmer as needed. Another programer wrote the runtime library and MACHOPs for the target TI-990 mini-COMPUTER. That COBOL compiler compiled substantially more lines per second then the DEC-10 native COBOL compiler written in assembly.
More to a compiler then usually talked about
A big part of writing a compiler from scratch is the run time library. You need a symbol table. You need input and output. Dynamic memory management etc. It easily can be more work writing the runtime library for a compiler then writing the compiler. But with SLIC that runtime library is common to all compilers developen in SLIC. Note there are two runtime libraries. One for the language's (COBOL for example) target machine. The other is the compiler compilers runtime library.
I think I have established that these were not parser generators. So now with a little understanding of the back end I can explain the parser programming language.
Parser programming language
The parser is written using formula written in the form of simple equations.
The language element at the lowest level is the character. Tokens are formed from a subsets of the characters of the language. Character classes are used to name and define those character subsets. The character class defining operator is the colon (:) character. Characters that are members of the class are coded on the right side of the definition. Printable characters are enclosed in primes single ' strings. Nonprinting and special characters may be represented by their numeric ordinal. Class members are separated by an alternative | operator. A class formula ends with a semicolon. Character classes may include previously defined classes:
The skip_class 0b00000001 is predefined but may be overroad be defining a skip_class.
In summary: A character class is a list of alternative that can only be a character constant, a character's ordinal, or a previously defined character class. As I implemented character classes: The class formula is assigned a class bit mask. (Shown in comments above) Any class formula having any character literal or ordinal causes a class bit to be allocated. A mask is made by oring the included class(es)'s class mask(s) together with the allocated bit (if any). A class table is created from the character classes. An entry indexed by a character's ordinal contains bits indicating the character's class memberships. Class testing is done inline. An IA-86 code example with the character's ordinal in eax illustrates class testing:
Followed by a:
or
IA-86 instruction code examples are used because I think IA-86 instructions are more widely known today. The class name evaluating to its class mask is non-destructively ANDed with the class-table indexed by the characters ordinal(in eax). A non-zero result indicates class membership. (EAX is zeroed except for al(the low 8 bits of EAX) that contains the character).
Tokens were a bit different in these old compilers. Key words were not explained as tokens. They simply were matched by quoted string constants in the parser language. Quoted strings are not normally kept. Modifiers may be used. A + keeps the string matched. (i.e. +'-' matches a - character keeping the character when successful) The , operation (i.e. ,'E') inserts the string into the token. White space is handled by the token formula skipping leading SKIP_CLASS characters until a first match is made. Note that an explicit skip_class character match will stop the skipping allowing a token to start with a skip_class character. The string token formula skips leading skip_class characters matching a single quote quitedd character or a double quoted string. Of interest is the matching a " character within a " quoted string:
The first alternative matches any single quote quoted character. The right alternative matches a double quote quoted string that may include double quote characters using two " character together to represent a single " character. This formula defines the strings used in its own definition. The inner right alternative '"' $(-"""" .ANY | """""","""") '"' matches a double quote quoted string. We can use a single ' quoted character to match a double quote " character. However within the double " quoted string if we wish to use a " character we must use two " characters to get one. For example in the inner left alternative matching any character except a quote:
a negative peek ahead -"""" is used that when successful (not matching a " character) then matches .ANY character (which can not be a " character because -"""" eliminated that possibility). The right alternative is taking on -"""" matching a " character and failing were the right alternative:
tries to match two " characters replacing them with a single double " using ,"""" to inserting thw single " character. Both inner alternatives failing the closing string quote character is matched and MAKSTR[] called to create a string object. The $ sequence, loop while successful, operator is used in matching a sequence. Token formula skip leading skip class characters(whit space). Once a first match is made skip_class skipping is disabled. We can call functions programed in other languages using []. MAKSTR[], MAKBIN[], MAKOCT[], MAKHEX[], MAKFLOAT[], and MAKINT[] are supplied library function that convert a matched token string to a typed object. The number formula below illustrates a fairly complex token recognition:
The above number token formula recognizes integer and floating point numbers. The -- alternatives are always successful. Numeric objects may be used in calculations. The token objects are pushed onto the parse stack on success of the formula. The exponent lead in (+'E'|'e','E') is interesting. We wish to always have an uppercase E for MAKEFLOAT[]. But we allow a lower case 'e' replacing it using ,'E'.
You may have noticed consistencies of character class and token formula. The parsing formula continue that adding backtracking alternatives and tree construction operators. Backtracking and non-backtracking alternative operators may not be mixed within an expression level. You may not have (a | b \ c) mixing non-backtracking | withe \ backtracking alternative. (a\b\c), (a|b|c) and ((a|b)\c) are valid. A \ backtracking alternative saves the parse state before attempting its left alternative and on failure restores the parse state before attempting the right alternative. In a sequence of alternatives the first successful alternative satisfies the group. Further alternatives are not attempted. Factoring and grouping provides for a continuous advancing parse. The backtrack alternative creates a saved state of the parse before it attempts its left alternative. Backtracking is required when the parse may make a partial match and then fail:
In the above if a returns failure the alternative c d is attempted. If then c returns failure the backtrack alternative will be attempted. If a succeeds and b fails the parse wile be backtracked and e attempted. Likewise a failing c successful and b fails the parse is backtracked and the alternative e taken. Backtracking is not limited to within a formula. If any parsing formula makes a partial match at any time and then fails the parse is reset to the top backtrack and its alternative taken. A compile failure can occur if code has been output sense the backtrack was created. A backtrack is set before starting the compile. Returning failure or backtracking to it is a compiler failure. Backtracks are stacked. We may use negative - and positive ? peek/look ahead operators to test without advancing the parse. being string test is a peek ahead only needing the input state saved and reset. A look ahead would be a parsing expression that makes a partial match before failing. A look ahead is implemented using backtracking.
The parser language is neither an LL or LR parser. But a programming language for writing a recursive decent parser in which you program tree construction:
A commonly used parsing example is an arithmetic expression:
Exp and Term using a loop creates a left handed tree. Factor using right recursion creates a right handed tree:
Here is a bit of the cc compiler, an updated version of SLIC with c style comments. Function types (grammar, token, character class, generator, PSEUDO, or MACHOP are determined by their initial syntax following their id.
With these top-down parsers you start with a program defining formula:
// Note how id is factored off and later combined when creating the tree.
Of note is how the parser language handles commenting and error recovery.
I think I have answered the question. Having written a big part of SLICs successor, the cc language in itself here. There is no compiler for it as yet. But I can hand compile it into assembly code, naked asm c or c++ functions.
这是一个转储(实际上是很难搜索的主题):
Smalltalk
C
这也是 PyPy 和 Rubinius:(
我认为这也可能适用于 < a href="http://en.wikipedia.org/wiki/Forth_%28programming_language%29" rel="nofollow noreferrer">Forth,但我对 Forth 一无所知。)
Here's a dump (difficult topic to search on, actually):
Smalltalk
C
This is also the idea of PyPy and Rubinius:
(I think this might also apply to Forth, but I don't know anything about Forth.)
GNAT(GNU Ada 编译器)需要完整构建 Ada 编译器。 将其移植到没有可用的 GNAT 二进制文件的平台时,这可能会很痛苦。
GNAT, the GNU Ada compiler, requires an Ada compiler to be fully built. This can be a pain when porting it to a platform where there is no GNAT binary readily available.
实际上,出于上述原因,大多数编译器都是用它们编译的语言编写的。
第一个引导编译器通常用 C、C++ 或 Assembly 编写。
Actually, most compilers are written in the language they compile, for the reasons stated above.
The first bootstrap compiler is usually written in C, C++ or Assembly.
Mono项目的C#编译器已经“自托管”很长时间了,这意味着它是用C#本身编写的。
我所知道的是,编译器最初是作为纯 C 代码,但是一旦实现了 ECMA 的“基本”功能,他们就开始用 C# 重写编译器。
我不知道用同一种语言编写编译器的优点,但我确信它至少与该语言本身可以提供的功能有关(例如,C 不支持面向对象编程) 。
您可以在此处找到更多信息。
The Mono project C# compiler has been "self-hosted" for a long time now, what it means is that it has been written in C# itself.
What I know is that the compiler was started as pure C code, but once the "basic" features of ECMA were implemented they started to rewrite the compiler in C#.
I'm not aware of the advantages of writing the compiler in the same language, but I'm sure it has to do at least with the features that the language itself can offer (C, for example, does not support object oriented programming).
You can find more information here.
是的,您可以用某种语言编写该语言的编译器。 不,您不需要第一个编译器来引导该语言。
您需要引导的是该语言的实现。 它可以是编译器,也可以是解释器。
从历史上看,语言通常被认为是解释型语言或编译型语言。 解释器只为前者编写,编译器只为后者编写。 因此,通常如果要为某种语言编写编译器,第一个编译器将用其他语言编写来引导它,然后,可选地,将针对主题语言重新编写编译器。 但是用另一种语言编写解释器是一种选择。
这不仅仅是理论上的。 我自己目前恰好也在做这件事。 我正在为我自己开发的语言 Salmon 开发编译器。 我首先用 C 创建了一个 Salmon 编译器,现在我用 Salmon 编写了该编译器,因此我可以让 Salmon 编译器正常工作,而无需使用任何其他语言编写的 Salmon 编译器。
Yes, you can write a compiler for a language in that language. No, you don't need a first compiler for that language to bootstrap.
What you need to bootstrap is an implementation of the language. That can be either a compiler or an interpreter.
Historically, languages were usually thought of as either interpreted languages or compiled languages. Interpreters were only written for the former and compilers were only written for the latter. So usually if a compiler was going to be written for a language, the first compiler would be written in some other language to bootstrap it, then, optionally, the compiler would be re-written for the subject language. But writing an interpreter in another language instead is an option.
This isn't just theoretical. I happen to currently be doing this myself. I am working on a compiler for a language, Salmon, that I developed myself. I first created a Salmon compiler in C and now I'm writing the compiler in Salmon, so I can get the Salmon compiler working without ever having a compiler for Salmon written in any other language.
请注意,从技术上讲,您可以用尚不存在的语言编写编译器。 为了做到这一点,你需要创建一个解释器,它是原始语言的一个子部分,它通常很慢而且无用,因为它在执行任何操作之前解释该语言的每个语句。
如果您阅读它,它看起来确实完全像预期的语言,但它的执行会经过一些过程,将其转换为可执行文件,并且不止一步。
该编译器通常非常慢,因为它使用一些适用于几乎所有现有语言的通用数学过程,但优点是下次您除了使用生成的编译器而不是现有代码之外什么也不做。
这次当然没有解释它。
Notice that technically you can write a compiler in a language that is still not there. In order to do this you create an interpreter, a subpar of original language, that is slow and useless in general as it interprets each statement of the language, before it executes anything.
It does look completely like the intended language, if you read it, but its execution goes over some process that is converting it into executable in more than one step.
This compiler is typically horribly slow, as it uses some generic mathematical procedure that is applicable to almost any existing language, but the advantage is that you do nothing next time around except use the produced compiler over the existing code.
This time of course without interpreting it.
也许你可以写一个 BNF 来描述 BNF。
Maybe you can write a BNF describing BNF.