当前位置：文江博客话题详情

用自己的语言编写编译器

发布于 2024-07-06 17:02:42 字数 212 浏览 14 评论 0原文

直观上，似乎语言 Foo 的编译器本身不能用 Foo 编写。更具体地说，语言 Foo 的第一个编译器不能用 Foo 编写，但任何后续编译器都可以为 Foo 编写。

但这真的是真的吗？我有一些非常模糊的记忆，读到过一种语言，它的第一个编译器是用“它自己”编写的。这可能吗？如果可能的话，如何实现？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

内心激荡 2024-07-13 17:02:42

这称为“引导”。您必须首先使用其他语言（通常是 Java 或 C）为您的语言构建编译器（或解释器）。完成后，您可以用 Foo 语言编写新版本的编译器。您使用第一个引导编译器来编译该编译器，然后使用此编译后的编译器来编译其他所有内容（包括其自身的未来版本）。

大多数语言确实是以这种方式创建的，部分原因是语言设计者喜欢使用他们正在创建的语言，而且还因为一个不平凡的编译器通常可以作为衡量语言“完整”程度的有用基准。

Scala 就是一个例子。它的第一个编译器是用 Pizza 创建的，这是 Martin Odersky 的一种实验语言。从 2.0 版本开始，编译器完全用 Scala 重写。从那时起，旧的 Pizza 编译器可以被完全丢弃，因为新的 Scala 编译器可以用于编译自身以供将来的迭代使用。

回复收藏 0 原文

疯了 2024-07-13 17:02:42

我记得听过软件工程广播播客< /a> 其中，Dick Gabriel 谈到通过在纸上用 LISP 编写一个基本版本并手工将其组装成机器代码来引导原始的 LISP 解释器。从那时起，LISP 的其余功能都用 LISP 编写和解释。

回复收藏 0 原文

筱果果 2024-07-13 17:02:42

当您为 C 编写第一个编译器时，您是用其他语言编写的。现在，您有了一个 C 语言编译器，例如汇编器。最终，您将到达必须解析字符串的地方，特别是转义序列。您将编写代码将 \n 转换为十进制代码 10 的字符（将 \r 转换为 13 等）。

编译器准备就绪后，您将开始用 C 语言重新实现它。这个过程称为“bootstrapping< /a>”。

字符串解析代码将变为：

...
if (c == 92) { // backslash
    c = getc();
    if (c == 110) { // n
        return 10;
    } else if (c == 92) { // another backslash
        return 92;
    } else {
        ...
    }
}
...

当编译时，您将拥有一个可以理解“\n”的二进制文件。这意味着你可以更改源代码：

...
if (c == '\\') {
    c = getc();
    if (c == 'n') {
        return '\n';
    } else if (c == '\\') {
        return '\\';
    } else {
        ...
    }
}
...

那么“\n”是13的代码的信息在哪里呢？它在二进制文件中！就像 DNA：用这个二进制文件编译 C 源代码将继承这个信息。如果编译器自行编译，它将把这些知识传递给它的后代。从此时起，无法仅从源代码看出编译器将做什么。

如果你想在某个程序的源代码中隐藏病毒，你可以这样做：获取编译器的源代码，找到编译函数的函数并将其替换为这个：

void compileFunction(char * name, char * filename, char * code) {
    if (strcmp("compileFunction", name) == 0 && strcmp("compile.c", filename) == 0) {
        code = A;
    } else if (strcmp("xxx", name) == 0 && strcmp("yyy.c", filename) == 0) {
        code = B;
    }

    ... code to compile the function body from the string in "code" ...
}

有趣的部分是 A 和 B。是 compileFunction 的源代码，包括病毒，可能以某种方式加密，因此通过搜索生成的二进制文件并不明显。这可以确保编译到编译器本身将保留病毒注入代码。

B 与我们要用病毒替换的功能相同。例如，它可能是源文件“login.c”中的函数“login”，该文件可能来自 Linux 内核。我们可以将其替换为除了普通密码之外还接受 root 帐户密码“joshua”的版本。

如果您将其编译并以二进制形式传播，则无法通过查看源代码来找到病毒。

这个想法的原始来源： https:// /web.archive.org/web/20070714062657/http://www.acm.org/classics/sep95/

When you write your first compiler for C, you write it in some other language. Now, you have a compiler for C in, say, assembler. Eventually, you will come to the place where you have to parse strings, specifically escape sequences. You will write code to convert \n to the character with the decimal code 10 (and \r to 13, etc).

After that compiler is ready, you will start to reimplement it in C. This process is called "bootstrapping".

The string parsing code will become:

...
if (c == 92) { // backslash
    c = getc();
    if (c == 110) { // n
        return 10;
    } else if (c == 92) { // another backslash
        return 92;
    } else {
        ...
    }
}
...

When this compiles, you have a binary which understands '\n'. This means you can change the source code:

...
if (c == '\\') {
    c = getc();
    if (c == 'n') {
        return '\n';
    } else if (c == '\\') {
        return '\\';
    } else {
        ...
    }
}
...

So where is the information that '\n' is the code for 13? It's in the binary! It's like DNA: Compiling C source code with this binary will inherit this information. If the compiler compiles itself, it will pass this knowledge on to its offspring. From this point on, there is no way to see from the source alone what the compiler will do.

If you want to hide a virus in the source of some program, you can do it like this: Get the source of a compiler, find the function which compiles functions and replace it with this one:

void compileFunction(char * name, char * filename, char * code) {
    if (strcmp("compileFunction", name) == 0 && strcmp("compile.c", filename) == 0) {
        code = A;
    } else if (strcmp("xxx", name) == 0 && strcmp("yyy.c", filename) == 0) {
        code = B;
    }

    ... code to compile the function body from the string in "code" ...
}

The interesting parts are A and B. A is the source code for compileFunction including the virus, probably encrypted in some way so it's not obvious from searching the resulting binary. This makes sure that compiling to compiler with itself will preserve the virus injection code.

B is the same for the function we want to replace with our virus. For example, it could be the function "login" in the source file "login.c" which is probably from the Linux kernel. We could replace it with a version that will accept the password "joshua" for the root account in addition to the normal password.

If you compile that and spread it as a binary, there will be no way to find the virus by looking at the source.

The original source of the idea: https://web.archive.org/web/20070714062657/http://www.acm.org/classics/sep95/

回复收藏 0 原文

怂人 2024-07-13 17:02:42

对之前的答案增加好奇心。

这是Linux From Scratch手册的引用，在从源代码开始构建 GCC 编译器的步骤中。（Linux From Scratch 是一种安装 Linux 的方法，它与安装发行版完全不同，因为您必须编译目标系统的每一个二进制文件。）

制作引导程序 
  
“bootstrap”目标不仅编译 GCC，而且多次编译它。它使用首先编译的程序
轮流进行第二次编译，然后再进行第三次编译。然后比较第二个和第三个
编译以确保它可以完美地自我复制。这也意味着它已正确编译。

使用“引导”目标的动机是，用于构建目标系统工具链的编译器可能不具有与目标编译器完全相同的版本。按照这种方式进行，肯定会在目标系统中获得一个可以自行编译的编译器。

Adding a curiosity to the previous answers.

Here's a quote from the Linux From Scratch manual, at the step where one starts building the GCC compiler from its source. (Linux From Scratch is a way to install Linux that is radically different from installing a distribution, in that you have to compile really every single binary of the target system.)

make bootstrap
The 'bootstrap' target does not just compile GCC, but compiles it several times. It uses the programs compiled in a first
round to compile itself a second time, and then again a third time. It then compares these second and third
compiles to make sure it can reproduce itself flawlessly. This also implies that it was compiled correctly.

That use of the 'bootstrap' target is motivated by the fact that the compiler one uses to build the target system's toolchain may not have the very same version of the target compiler. Proceeding in that way one is sure to obtain, in the target system, a compiler that can compile itself.

回复收藏 0 原文

清风疏影 2024-07-13 17:02:42

您无法自行编写编译器，因为您没有任何东西可以用来编译起始源代码。有两种方法可以解决这个问题。

最不受欢迎的是以下内容。您用汇编程序编写一个最小的编译器（恶心），用于最小的语言集，然后使用该编译器来实现该语言的额外功能。不断积累，直到拥有一个具有所有语言功能的编译器。这是一个痛苦的过程，通常只有在您别无选择时才会进行。

首选方法是使用交叉编译器。您可以更改另一台计算机上现有编译器的后端，以创建在目标计算机上运行的输出。然后你就有了一个很好的完整编译器并可以在目标机器上运行。最流行的是 C 语言，因为有很多现有的编译器具有可以交换的可插入后端。

一个鲜为人知的事实是，GNU C++ 编译器有一个仅使用 C 子集的实现。原因是通常很容易找到适用于新目标机器的 C 编译器，然后您可以从中构建完整的 GNU C++ 编译器。现在，您已经在目标机器上安装了 C++ 编译器。

回复收藏 0 原文

复古式 2024-07-13 17:02:42

一般来说，您需要首先让编译器工作（如果是原始的），然后您可以开始考虑使其成为自托管。这实际上被认为是某些语言的一个重要里程碑。

根据我对“mono”的记忆，他们可能需要向反射添加一些东西才能使其正常工作：mono 团队不断指出，有些事情根本无法通过 Reflection.Emit 实现。代码>; 当然，微软团队可能会证明他们是错的。

这有一些真正的优点：对于初学者来说，这是一个相当好的单元测试！而且您只需担心一种语言（即 C# 专家可能不太了解 C++；但现在您可以修复 C# 编译器）。但我想知道这里的工作是否没有足够的职业自豪感：他们只是希望它是自我托管的。

不完全是一个编译器，但我最近一直在开发一个自托管的系统；代码生成器用于生成代码生成器...因此，如果架构发生更改，我只需在其自身上运行它：新版本。如果有错误，我只是返回到早期版本并重试。非常方便，而且非常容易维护。

更新 1

我刚刚观看了 PDC 的 Anders 的这段视频，并且（关于小时）他确实给出了一些更有效的理由——所有这些都是关于编译器即服务的。只是为了记录。

回复收藏 0 原文

笔芯 2024-07-13 17:02:42

我自己编写了 SLIC（用于实现编译器的语言系统）。然后手工将其编译成汇编。 SLIC 有很多功能，因为它是五种子语言的单个编译器：

SYNTAX 解析器编程语言 PPL
GENERATOR LISP 2 基于树爬行的 PSEUDO 代码生成语言
ISO In Sequence、PSEUDO 代码、优化语言
PSEUDO 类似汇编代码的宏生成语言。
MACHOP 汇编机器指令定义语言。

SLIC 的灵感来自于 CWIC（用于编写和实现编译器的编译器）。与大多数编译器开发包不同，SLIC 和 CWIC 使用专门的、特定领域的语言来解决代码生成问题。 SLIC 扩展了 CWIC 代码生成，添加了 ISO、PSEUDO 和 MACHOP 子语言，将目标机器细节与树爬行生成器语言分开。

LISP 2 树和列表

基于 LISP 2 生成器语言的动态内存管理系统是一个关键组件。列表用方括号括起来的语言表示，其组成部分用逗号分隔，即三元素 [a,b,c] 列表。

树：

     ADD
    /   \
  MPY     3
 /   \
5     x

由第一个条目是节点对象的列表表示：

[ADD,[MPY,5,x],3]

树通常在分支之前显示单独的节点：

ADD[MPY[5,x],3]

使用基于 LISP 2 的生成器函数进行解析

生成器函数是一组命名的 (unparse)=>action> 。对...

<NAME>(<unparse>)=><action>;
      (<unparse>)=><action>;
            ...
      (<unparse>)=><action>;

未解析表达式是匹配树模式和/或对象类型的测试，将它们分开并将这些部分分配给局部变量以由其过程操作进行处理。有点像采用不同参数类型的重载函数。除了 ()=> ...按照编码顺序尝试测试。第一个成功的解解析执行其相应的操作。未解析表达式是反汇编测试。 ADD[x,y] 匹配一个两分支 ADD 树，将其分支分配给局部变量 x 和 y。该操作可以是一个简单的表达式或一个 .BEGIN ... .END 有界代码块。今天我会使用 c 风格的 { ... } 块。树匹配，[]，解解析规则可以调用生成器将返回的结果传递给操作：

expr_gen(ADD[expr_gen(x),expr_gen(y)])=> x+y;

具体来说，上面的 expr_gen 解解析匹配一个两分支 ADD 树。在测试模式中，放置在树枝中的单个参数生成器将使用该分支进行调用。但它的参数列表是分配返回对象的局部变量。上面的unparse指定了两个分支是ADD树的反汇编，递归地将每个分支压到expr_gen。左分支返回放入局部变量 x 中。同样，右分支传递给 expr_gen，y 作为返回对象。以上可以是数值表达式求值器的一部分。上面有称为向量的快捷功能，而不是节点字符串，节点向量可以与相应操作的向量一起使用：

expr_gen(#node[expr_gen(x),expr_gen(y)])=> #action;

  node:   ADD, SUB, MPY, DIV;
  action: x+y, x-y, x*y, x/y;

        (NUMBER(x))=> x;
        (SYMBOL(x))=> val:(x);

上面更完整的表达式求值器将 expr_gen 左分支的返回分配给 x，将右分支分配给 y 。返回对 x 和 y 执行的相应操作向量。最后的 unparse=>action 对匹配数字和符号对象。

符号和符号属性

符号可能具有命名属性。 val:(x) 访问 x 中包含的符号对象的 val 属性。通用符号表堆栈是 SLIC 的一部分。 SYMBOL 表可以被压入和弹出，为函数提供本地符号。新创建的符号在顶部符号表中编目。符号查找从顶表开始向后向下搜索符号表堆栈。

生成与机器无关的代码

SLIC 的生成器语言生成 PSEUDO 指令对象，并将它们附加到段代码列表中。 .FLUSH 导致其 PSEUDO 代码列表运行，从列表中删除每个 PSEUDO 指令并调用它。执行后，PSEUDO 对象的内存被释放。除了输出之外，伪指令和生成器动作的程序体基本上是相同的语言。 PSEUDO 旨在充当汇编宏，提供与机器无关的代码序列化。它们提供了一种将特定目标机器从树爬行生成器语言中分离出来的方法。 PSEUDO 调用 MACHOP 函数来输出机器代码。 MACHOP 用于定义汇编伪操作（如 dc、定义常量等）和机器指令或使用向量条目的一系列类似格式的指令。它们只是将参数转换为构成指令的一系列位字段。 MACHOP 调用旨在看起来像程序集，并在编译列表中显示程序集时提供字段的打印格式。在示例代码中，我使用了 c 风格的注释，可以轻松添加该注释，但原始语言中没有。 MACHOP 将代码生成到可位寻址的存储器中。 SLIC 链接器处理编译器的输出。使用向量条目的 DEC-10 用户模式指令的 MACHOP：

.MACHOP #opnm register,@indirect offset (index): // Instruction's parameters.
.MORG 36, O(18): $/36; // Align to 36 bit boundary print format: 18 bit octal $/36
O(9):  #opcd;          // Op code 9 bit octal print out
 (4):  register;       // 4 bit register field appended print
 (1):  indirect;       // 1 bit appended print
 (4):  index;          // 4 bit index register appended print
O(18): if (#opcd&&3==1) offset // immediate mode use value else
       else offset/36;         // memory address divide by 36
                               // to get word address.
// Vectored entry opcode table:
#opnm := MOVE, MOVEI, MOVEM, MOVES, MOVS, MOVSI, MOVSM, MOVSS,
         MOVN, MOVNI, MOVNM, MOVNS, MOVM, MOVMI, MOVMM, MOVMS,
         IMUL, IMULI, IMULM, IMULB, MUL,  MULI,  MULM,  MULB,
                           ...
         TDO,  TSO,   TDOE,  TSOE,  TDOA, TSOA,  TDON,  TSON;
// corresponding opcode value:
#opcd := 0O200, 0O201, 0O202, 0O203, 0O204, 0O205, 0O206, 0O207,
         0O210, 0O211, 0O212, 0O213, 0O214, 0O215, 0O216, 0O217,
         0O220, 0O221, 0O222, 0O223, 0O224, 0O225, 0O226, 0O227,
                           ...
         0O670, 0O671, 0O672, 0O673, 0O674, 0O675, 0O676, 0O677;

.MORG 36，O(18)：$/36；将位置与 36 位边界对齐，以八进制形式打印 18 位的位置 $/36 字地址。 9 位 opcd、4 位寄存器、间接位和 4 位索引寄存器组合并打印，就像单个 18 位字段一样。 18 位地址/36 或立即值以八进制输出并打印。打印出 r1 = 1 和 r2=2 的 MOVEI 示例：

400020 201082 000005            MOVEI r1,5(r2)

使用编译器汇编选项，您可以在编译列表中获得生成的汇编代码。

将其链接在一起

SLIC 链接器作为处理链接和符号解析的库提供。但是，必须为目标计算机编写特定于目标的输出加载文件格式，并与链接器库链接。

生成器语言能够将树写入文件并读取它们，从而允许实现多遍编译器。

代码生成和起源的简短总结

我首先回顾了代码生成，以确保人们理解 SLIC 是一个真正的编译器。 SLIC 的灵感来自于 20 世纪 60 年代末 Systems Development Corporation 开发的 CWIC（用于编写和实现编译器的编译器）。 CWIC 仅具有从 GENERATOR 语言生成数字字节代码的 SYNTAX 和 GENERATOR 语言。字节代码被放置或种植（CWIC 文档中使用的术语）到与命名节相关的内存缓冲区中，并通过 .FLUSH 语句写出。 ACM 档案中提供了有关 CWIC 的 ACM 论文。

成功实现一种主要编程语言

在 20 世纪 70 年代末，SLIC 被用来编写 COBOL 交叉编译器。大部分由一名程序员在大约 3 个月内完成。我根据需要与程序员一起工作了一些。另一位程序员为目标 TI-990 迷你计算机编写了运行时库和 MACHOP。该 COBOL 编译器每秒编译的行数比用汇编语言编写的 DEC-10 本机 COBOL 编译器要多得多。

解析器编程语言

解析器是使用以简单方程形式编写的公式编写的。

<name> <formula type operator> <expression> ;

最低层次的语言元素是字符。标记由语言字符的子集组成。字符类用于命名和定义这些字符子集。字符类定义运算符是冒号 (:) 字符。作为类成员的字符在定义的右侧进行编码。可打印字符包含在素数单 ' 字符串中。非打印字符和特殊字符可以用它们的数字序数表示。类成员由替代 | 分隔开。操作员。类公式以分号结尾。字符类可以包括先前定义的类：

/*  Character Class Formula                                    class_mask */
bin: '0'|'1';                                                // 0b00000010
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';                            // 0b00000110
dgt: oct|'8'|'9';                                            // 0b00001110
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'|'a'|'b'|'c'|'d'|'e'|'f';    // 0b00011110
upr:  'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
      'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';   // 0b00100000
lwr:  'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
      'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';   // 0b01000000
alpha:  upr|lwr;                                             // 0b01100000
alphanum: alpha|dgt;                                         // 0b01101110

skip_class 0b00000001 是预定义的，但定义skip_class 可能会超出范围。

总之：字符类是一个只能是字符常量、字符序数或先前定义的字符类的替代列表。当我实现字符类时：类公式被分配了一个类位掩码。（如上面的注释所示）任何具有任何字符文字或序数的类公式都会导致分配类位。掩码是通过将包含的类的类掩码与分配的位（如果有）进行或运算而形成的。类表是根据字符类创建的。由字符序数索引的条目包含指示字符的类成员身份的位。类测试是内联完成的。 eax 中带有字符序号的 IA-86 代码示例说明了类测试：

test    byte ptr [eax+_classmap],dgt

后面跟着 a:

jne      <success>

或

je       <failure>

使用 IA-86 指令代码示例，因为我认为 IA-86 指令如今更为广为人知。评估其类掩码的类名与按字符序数（在 eax 中）索引的类表进行非破坏性 AND 运算。非零结果表示类成员资格。（除了包含该字符的 al（EAX 的低 8 位）之外，EAX 均为零）。

这些旧编译器中的令牌有点不同。关键词没有被解释为标记。它们只是通过解析器语言中带引号的字符串常量进行匹配。通常不保留引用的字符串。可以使用修饰符。 A + 保持字符串匹配。（即 +'-' 匹配 - 字符，成功时保留该字符） , 操作（即 'E'）将字符串插入到标记中。空白由令牌公式处理，跳过前导 SKIP_CLASS 字符，直到进行第一个匹配。请注意，显式的skip_class字符匹配将停止跳过，允许令牌以skip_class字符开头。字符串标记公式会跳过与单引号 quitdd 字符或双引号字符串匹配的前导 Skip_class 字符。有趣的是匹配 " 引用字符串中的 " 字符：

string .. (''' .ANY ''' | '"' $(-"""" .ANY | """""","""") '"') MAKSTR[];

第一个替代项匹配任何单引号引用的字符。正确的替代方案匹配双引号引起来的字符串，该字符串可能包含双引号字符，使用两个 " 字符一起表示单个 " 字符。该公式定义了其自身定义中使用的字符串。内部右替代 '"' $(-"""" .ANY | """""","""") '"' 匹配双引号引起来的字符串。我们可以使用单个 ' 引号字符来匹配双引号 " 字符。但是，在双 " 引号字符串中，如果我们希望使用 " 字符，则必须使用两个 " 字符来获取一个字符。例如，在内部左侧替代中，匹配除引号之外的任何字符：

-"""" .ANY

使用负向查看 -""""，当成功（不匹配 " 字符）时，则匹配 .ANY 字符（不能是 " 字符，因为 - “”“”消除了这种可能性）。正确的选择是采用 -"""" 匹配一个 " 字符，失败是正确的选择：

"""""",""""

尝试匹配两个 " 字符，使用单个双 " 替换它们，使用 ,"""" 插入单个 " 字符。两个内部替代方案都未能匹配结束字符串引号字符，并调用 MAKSTR[] 来创建字符串对象。 $ 序列，成功时循环，运算符用于匹配序列。令牌公式跳过前导跳过类字符（空白）。一旦完成第一个匹配，skip_class 跳过就会被禁用。我们可以使用[]调用其他语言编写的函数。 MAKSTR[]、MAKIN[]、MAKOCT[]、MAKHEX[]、MAKFLOAT[] 和 MAKINT[] 是提供的库函数，可将匹配的标记字符串转换为类型化对象。下面的数字公式说明了相当复杂的标记识别：

number .. "0B" bin $bin MAKBIN[]        // binary integer
         |"0O" oct $oct MAKOCT[]        // octal integer
         |("0H"|"0X") hex $hex MAKHEX[] // hexadecimal integer
// look for decimal number determining if integer or floating point.
         | ('+'|+'-'|--)                // only - matters
           dgt $dgt                     // integer part
           ( +'.' $dgt                  // fractional part?
              ((+'E'|'e','E')           // exponent  part
               ('+'|+'-'|--)            // Only negative matters
               dgt(dgt(dgt|--)|--)|--)  // 1 2 or 3 digit exponent
             MAKFLOAT[] )               // floating point
           MAKINT[];                    // decimal integer

上面的数字标记公式可识别整数和浮点数。 ——替代方案总是成功的。数字对象可用于计算。公式成功后，标记对象将被推入解析堆栈。 (+'E'|'e','E') 中的指数领先很有趣。我们希望 MAKEFLOAT[] 始终使用大写 E。但我们允许使用小写“e”来替换它，“E”。

您可能已经注意到字符类和标记公式的一致性。解析公式继续添加回溯选项和树构建运算符。回溯和非回溯替代运算符不得在表达式级别内混合。你可能没有(a|b\c)混合非回溯| withe \回溯替代方案。 (a\b\c)、(a|b|c) 和 ((a|b)\c) 有效。 \ 回溯替代方案在尝试其左侧替代方案之前保存解析状态，并且在失败时在尝试右侧替代方案之前恢复解析状态。在一系列替代方案中，第一个成功的替代方案使团队感到满意。没有尝试其他替代方案。因式分解和分组提供了持续推进的解析。回溯替代方案在尝试其左侧替代方案之前创建解析的已保存状态。当解析可能部分匹配然后失败时，需要回溯：

(a b | c d)\ e

在上面，如果返回失败，则尝试替代 cd。如果 c 返回失败，将尝试回溯替代方案。如果 a 成功而 b 失败，则解析将被回溯并尝试 e。同样，a 失败 c 成功，b 失败，则回溯解析并采用替代 e。回溯不限于公式内。如果任何解析公式在任何时候进行部分匹配，然后失败，则解析将重置为顶部回溯并采取替代方案。如果代码已输出并感知到已创建回溯，则可能会发生编译失败。在开始编译之前设置回溯。返回失败或回溯到它是编译器失败。回溯是堆叠的。我们可以使用负数和正数？窥视/前瞻运算符可以在不推进解析的情况下进行测试。字符串测试是一种预知，只需要保存和重置输入状态。前瞻是一个在失败之前进行部分匹配的解析表达式。前瞻是通过回溯来实现的。

解析器语言既不是 LL 也不是 LR 解析器。但是，用于编写递归体面解析器的编程语言可以在其中对树构造进行编程：

:<node name> creates a node object and pushes it onto the node stack.
..           Token formula create token objects and push them onto 
             the parse stack.
!<number>    pops the top node object and top <number> of parstack 
             entries into a list representation of the tree. The 
             tree then pushed onto the parse stack.
+[ ... ]+    creates a list of the parse stack entries created 
             between them:
              '(' +[argument $(',' argument]+ ')'
             could parse an argument list. into a list.

常用的解析示例是算术表达式：

Exp = Term $(('+':ADD|'-':SUB) Term!2); 
Term = Factor $(('*':MPY|'/':DIV) Factor!2);
Factor = ( number
         | id  ( '(' +[Exp $(',' Exp)]+ ')' :FUN!2
               | --)
         | '(' Exp ')" )
         (^' Factor:XPO!2 |--);

使用循环的 Exp 和 Term 创建一棵左手树。使用右递归的因数创建右手树：

d^(x+5)^3-a+b*c => ADD[SUB[EXP[EXP[d,ADD[x,5]],3],a],MPY[b,c]]

              ADD
             /   \
          SUB     MPY
         /   \   /   \
      EXP     a b     c
     /   \
    d     EXP     
         /   \
      ADD     3
     /   \
    x     5

这是 cc 编译器的一些内容，它是带有 c 风格注释的 SLIC 的更新版本。函数类型（语法、标记、字符类、生成器、PSEUDO 或 MACHOP）由其 id 后面的初始语法确定。
使用这些自上而下的解析器，您可以从定义公式的程序开始：

program = $((declaration            // A program is a sequence of
                                    // declarations terminated by
            |.EOF .STOP)            // End Of File finish & stop compile
           \                        // Backtrack: .EOF failed or
                                    // declaration long-failed.
             (ERRORX["?Error?"]     // report unknown error
                                    // flagging furthest parse point.
              $(-';' (.ANY          // find a ';'. skiping .ANY
                     | .STOP))      // character: .ANY fails on end of file
                                    // so .STOP ends the compile.
                                    // (-';') failing breaks loop.
              ';'));                // Match ';' and continue

declaration =  "#" directive                // Compiler directive.
             | comment                      // skips comment text
             | global        DECLAR[*1]     // Global linkage
             |(id                           // functions starting with an id:
                ( formula    PARSER[*1]     // Parsing formula
                | sequencer  GENERATOR[*1]  // Code generator
                | optimizer  ISO[*1]        // Optimizer
                | pseudo_op  PRODUCTION[*1] // Pseudo instruction
                | emitor_op  MACHOP[*1]     // Machine instruction
                )        // All the above start with an identifier
              \ (ERRORX["Syntax error."]
                 garbol);                    // skip over error.

// 注意在创建树时如何分解 id 并随后进行组合。

formula =   ("==" syntax  :BCKTRAK   // backtrack grammar formula
            |'='  syntax  :SYNTAX    // grammar formula.
            |':'  chclass :CLASS     // character class define
            |".." token   :TOKEN     // token formula
              )';' !2                // Combine node name with id 
                                     // parsed in calling declaration 
                                     // formula and tree produced
                                     // by the called syntax, token
                                     // or character class formula.
                $(-(.NL |"/*") (.ANY|.STOP)); Comment ; to line separator?

chclass = +[ letter $('|' letter) ]+;// a simple list of character codes
                                     // except 
letter  = char | number | id;        // when including another class

syntax  = seq ('|' alt1|'\' alt2 |--);

alt1    = seq:ALT!2 ('|' alt1|--);  Non-backtrack alternative sequence.

alt2    = seq:BKTK!2 ('\' alt2|--); backtrack alternative sequence

seq     = +[oper $oper]+;

oper    = test | action | '(' syntax ')' | comment; 

test    = string | id ('[' (arg_list| ,NILL) ']':GENCALL!2|.EMPTY);

action  = ':' id:NODE!1
        | '!' number:MAKTREE!1
        | "+["  seq "]+" :MAKLST!1;

//     C style comments
comment  = "//" $(-.NL .ANY)
         | "/*" $(-"*/" .ANY) "*/";

值得注意的是解析器语言如何处理注释和错误恢复。

我想我已经回答了这个问题。在这里编写了 SLIC 的后继者 cc 语言本身的大部分内容。目前还没有适合它的编译器。但我可以手动将其编译成汇编代码、裸 asm c 或 c++ 函数。

I wrote SLIC (System of Languages for Implementing Compilers) in itself. Then hand compiled it into assembly. There is a lot to SLIC as it was a single compiler of five sub-languages:

SYNTAX Parser Programming Language PPL
GENERATOR LISP 2 based tree-crawling PSEUDO code generation language
ISO In Sequence, PSEUDO code, Optimization language
PSEUDO Macro like Assembly code producing language.
MACHOP Assembly-machine instruction defining language.

SLIC was inspired by CWIC (Compiler for Writing and Implementing Compilers). Unlike most compiler development packages SLIC and CWIC addressed code generation with specialize, domain specific, languages. SLIC extends CWICs code generation adding the ISO, PSEUDO and MACHOP sub-languages separating target machine specifics out of the tree-crawling generator language.

LISP 2 trees and lists

The LISP 2 based generator language's dynamic memory management system is a key component. Lists are expressed in the language inclosed in square brackets, its components separated by commas i.e. a three element [a,b,c] list.

Trees:

     ADD
    /   \
  MPY     3
 /   \
5     x

are represented by lists whose first entry is a node object:

[ADD,[MPY,5,x],3]

Trees are commonly displayed with the node separate preceding the branches:

ADD[MPY[5,x],3]

Unparsing with LISP 2 based generator functions

A generator function is a named set of (unparse)=>action> pairs ...

<NAME>(<unparse>)=><action>;
      (<unparse>)=><action>;
            ...
      (<unparse>)=><action>;

Unparse expressions are tests that match tree patterns and/or object types breaking them apart and assigning those parts to local variable to be processed by its procedural action. Kind of like an overloaded function taking different argument types. Except the ()=> ... tests are attempted in the order coded. The first successful unparse executing its corresponding action. The unparse expressions are disassembling tests. ADD[x,y] matches a two branch ADD tree assigning its branches to local variables x and y. The action may be a simple expression or a .BEGIN ... .END bounded code block. I would use c style { ... } blocks today. Tree matching, [], unparse rules may call generators passing the returned result(s) to the action:

expr_gen(ADD[expr_gen(x),expr_gen(y)])=> x+y;

Specifically the above expr_gen unparse matches a two branch ADD tree. Within the test pattern a single argument generator placed in a tree branch will be called with that branch. Its argument list though are local variables assigned returned objects. Above the unparse specifies a two branch is ADD tree disassembly, recursive pressing each branch to expr_gen. The left branch return placed into into local variables x. Likewise the right branch passed to expr_gen with y the return object. The above could be part of a numeric expression evaluator. There were shortcut features called vectors were in the above instead of the node string a vector of nodes could be used with a vector of corresponding actions:

expr_gen(#node[expr_gen(x),expr_gen(y)])=> #action;

  node:   ADD, SUB, MPY, DIV;
  action: x+y, x-y, x*y, x/y;

        (NUMBER(x))=> x;
        (SYMBOL(x))=> val:(x);

The above more complete expression evaluator assigning the return from expr_gen left branch to x and the right branch to y. The corresponding action vector performed on x and y returned. The last unparse=>action pairs match numeric and symbol objects.

Symbol and symbol attributes

Symbols may have named attributes. val:(x) access the val attribute of the symbol object contained in x. A generalized symbol table stack is part of SLIC. The SYMBOL table may be pushed and popped providing local symbols for functions. Newly created symbol are cataloged in the top symbol table. Symbol lookup searches the symbol table stack from the top table first backward down the stack.

Generating machine independent code

SLIC's generator language produces PSEUDO instruction objects, appending them to a sections code list. A .FLUSH causes its PSEUDO code list to be run removing each PSEUDO instruction from the list and calling it. After execution a PSEUDO objects memory is released. The procedural bodies of PSEUDOs and GENERATOR actions are basically the same language except for their output. PSEUDO are meant to act as assembly macros providing machine independent code sequentialization. They provide a separation of the specific target machine out of the tree crawling generator language. PSEUDOs call MACHOP functions to output machine code. MACHOPs are used to define assembly pseudo ops (like dc, define constant etc) and machine instruction or a family of like formated instructions using vectored entry. They simply transform their parameters into a sequence of bit fields making up the instruction. MACHOP calls are meant to look like assembly and provide print formatting of the fields for when assembly is shown in the compile listing. In the example code I am using c style commenting that could be easily added but was not in the original languages. MACHOPs are producing code into a bit addressable memory. The SLIC linker handles output of the compiler. A MACHOP for the DEC-10 user mode instructions using vectored entry:

.MACHOP #opnm register,@indirect offset (index): // Instruction's parameters.
.MORG 36, O(18): $/36; // Align to 36 bit boundary print format: 18 bit octal $/36
O(9):  #opcd;          // Op code 9 bit octal print out
 (4):  register;       // 4 bit register field appended print
 (1):  indirect;       // 1 bit appended print
 (4):  index;          // 4 bit index register appended print
O(18): if (#opcd&&3==1) offset // immediate mode use value else
       else offset/36;         // memory address divide by 36
                               // to get word address.
// Vectored entry opcode table:
#opnm := MOVE, MOVEI, MOVEM, MOVES, MOVS, MOVSI, MOVSM, MOVSS,
         MOVN, MOVNI, MOVNM, MOVNS, MOVM, MOVMI, MOVMM, MOVMS,
         IMUL, IMULI, IMULM, IMULB, MUL,  MULI,  MULM,  MULB,
                           ...
         TDO,  TSO,   TDOE,  TSOE,  TDOA, TSOA,  TDON,  TSON;
// corresponding opcode value:
#opcd := 0O200, 0O201, 0O202, 0O203, 0O204, 0O205, 0O206, 0O207,
         0O210, 0O211, 0O212, 0O213, 0O214, 0O215, 0O216, 0O217,
         0O220, 0O221, 0O222, 0O223, 0O224, 0O225, 0O226, 0O227,
                           ...
         0O670, 0O671, 0O672, 0O673, 0O674, 0O675, 0O676, 0O677;

The .MORG 36, O(18): $/36; aligns the location to a 36 bit boundary printing the location $/36 word address of 18 bits in octal. The 9 bit opcd,4 bit register, indirect bit and 4 bit index register are combined and printed as if a single 18 bit field. The 18 bit address/36 or immediate value is output and printed in octal. A MOVEI example print out with r1 = 1 and r2=2:

400020 201082 000005            MOVEI r1,5(r2)

With the compiler assembly option you get the generated assembly code in the compile listing.

Link it together

The SLIC linker is supplied as a library that handles the linking and symbol resolutions. Target specific output load file formatting though must be written for target machines and linked with the linker library library.

The generator language is capable of writing trees to a file and reading them allowing a multipass compiler to be implemented.

Short summery of code generation and origins

I have went over the code generation first to insure it is understood that SLIC was a true compiler compiler. SLIC was inspired by CWIC (Compiler for Writing and Implementing Compilers) developed at Systems Development Corporation in the late 1960s. CWIC only had SYNTAX and GENERATOR languages producing numeric byte code out of the GENERATOR language. Byte code was placed or planted (the term used in CWICs documentation) into memory buffers associated with named sections and written out by a .FLUSH statement. An ACM paper on CWIC is available from the ACM archives.

Successfully implementing a major programming language

In the late 1970s SLIC was used to write a COBOL cross compiler. Completed in about 3 months mostly by a single programmer. I worked a bit with the programmer as needed. Another programer wrote the runtime library and MACHOPs for the target TI-990 mini-COMPUTER. That COBOL compiler compiled substantially more lines per second then the DEC-10 native COBOL compiler written in assembly.

More to a compiler then usually talked about

A big part of writing a compiler from scratch is the run time library. You need a symbol table. You need input and output. Dynamic memory management etc. It easily can be more work writing the runtime library for a compiler then writing the compiler. But with SLIC that runtime library is common to all compilers developen in SLIC. Note there are two runtime libraries. One for the language's (COBOL for example) target machine. The other is the compiler compilers runtime library.

I think I have established that these were not parser generators. So now with a little understanding of the back end I can explain the parser programming language.

Parser programming language

The parser is written using formula written in the form of simple equations.

<name> <formula type operator> <expression> ;

The language element at the lowest level is the character. Tokens are formed from a subsets of the characters of the language. Character classes are used to name and define those character subsets. The character class defining operator is the colon (:) character. Characters that are members of the class are coded on the right side of the definition. Printable characters are enclosed in primes single ' strings. Nonprinting and special characters may be represented by their numeric ordinal. Class members are separated by an alternative | operator. A class formula ends with a semicolon. Character classes may include previously defined classes:

/*  Character Class Formula                                    class_mask */
bin: '0'|'1';                                                // 0b00000010
oct: bin|'2'|'3'|'4'|'5'|'6'|'7';                            // 0b00000110
dgt: oct|'8'|'9';                                            // 0b00001110
hex: dgt|'A'|'B'|'C'|'D'|'E'|'F'|'a'|'b'|'c'|'d'|'e'|'f';    // 0b00011110
upr:  'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'|'I'|'J'|'K'|'L'|'M'|
      'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z';   // 0b00100000
lwr:  'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|
      'n'|'o'|'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z';   // 0b01000000
alpha:  upr|lwr;                                             // 0b01100000
alphanum: alpha|dgt;                                         // 0b01101110

The skip_class 0b00000001 is predefined but may be overroad be defining a skip_class.

In summary: A character class is a list of alternative that can only be a character constant, a character's ordinal, or a previously defined character class. As I implemented character classes: The class formula is assigned a class bit mask. (Shown in comments above) Any class formula having any character literal or ordinal causes a class bit to be allocated. A mask is made by oring the included class(es)'s class mask(s) together with the allocated bit (if any). A class table is created from the character classes. An entry indexed by a character's ordinal contains bits indicating the character's class memberships. Class testing is done inline. An IA-86 code example with the character's ordinal in eax illustrates class testing:

test    byte ptr [eax+_classmap],dgt

Followed by a:

jne      <success>

je       <failure>

IA-86 instruction code examples are used because I think IA-86 instructions are more widely known today. The class name evaluating to its class mask is non-destructively ANDed with the class-table indexed by the characters ordinal(in eax). A non-zero result indicates class membership. (EAX is zeroed except for al(the low 8 bits of EAX) that contains the character).

Tokens were a bit different in these old compilers. Key words were not explained as tokens. They simply were matched by quoted string constants in the parser language. Quoted strings are not normally kept. Modifiers may be used. A + keeps the string matched. (i.e. +'-' matches a - character keeping the character when successful) The , operation (i.e. ,'E') inserts the string into the token. White space is handled by the token formula skipping leading SKIP_CLASS characters until a first match is made. Note that an explicit skip_class character match will stop the skipping allowing a token to start with a skip_class character. The string token formula skips leading skip_class characters matching a single quote quitedd character or a double quoted string. Of interest is the matching a " character within a " quoted string:

string .. (''' .ANY ''' | '"' $(-"""" .ANY | """""","""") '"') MAKSTR[];

The first alternative matches any single quote quoted character. The right alternative matches a double quote quoted string that may include double quote characters using two " character together to represent a single " character. This formula defines the strings used in its own definition. The inner right alternative '"' $(-"""" .ANY | """""","""") '"' matches a double quote quoted string. We can use a single ' quoted character to match a double quote " character. However within the double " quoted string if we wish to use a " character we must use two " characters to get one. For example in the inner left alternative matching any character except a quote:

-"""" .ANY

a negative peek ahead -"""" is used that when successful (not matching a " character) then matches .ANY character (which can not be a " character because -"""" eliminated that possibility). The right alternative is taking on -"""" matching a " character and failing were the right alternative:

"""""",""""

tries to match two " characters replacing them with a single double " using ,"""" to inserting thw single " character. Both inner alternatives failing the closing string quote character is matched and MAKSTR[] called to create a string object. The $ sequence, loop while successful, operator is used in matching a sequence. Token formula skip leading skip class characters(whit space). Once a first match is made skip_class skipping is disabled. We can call functions programed in other languages using []. MAKSTR[], MAKBIN[], MAKOCT[], MAKHEX[], MAKFLOAT[], and MAKINT[] are supplied library function that convert a matched token string to a typed object. The number formula below illustrates a fairly complex token recognition:

number .. "0B" bin $bin MAKBIN[]        // binary integer
         |"0O" oct $oct MAKOCT[]        // octal integer
         |("0H"|"0X") hex $hex MAKHEX[] // hexadecimal integer
// look for decimal number determining if integer or floating point.
         | ('+'|+'-'|--)                // only - matters
           dgt $dgt                     // integer part
           ( +'.' $dgt                  // fractional part?
              ((+'E'|'e','E')           // exponent  part
               ('+'|+'-'|--)            // Only negative matters
               dgt(dgt(dgt|--)|--)|--)  // 1 2 or 3 digit exponent
             MAKFLOAT[] )               // floating point
           MAKINT[];                    // decimal integer

The above number token formula recognizes integer and floating point numbers. The -- alternatives are always successful. Numeric objects may be used in calculations. The token objects are pushed onto the parse stack on success of the formula. The exponent lead in (+'E'|'e','E') is interesting. We wish to always have an uppercase E for MAKEFLOAT[]. But we allow a lower case 'e' replacing it using ,'E'.

You may have noticed consistencies of character class and token formula. The parsing formula continue that adding backtracking alternatives and tree construction operators. Backtracking and non-backtracking alternative operators may not be mixed within an expression level. You may not have (a | b \ c) mixing non-backtracking | withe \ backtracking alternative. (a\b\c), (a|b|c) and ((a|b)\c) are valid. A \ backtracking alternative saves the parse state before attempting its left alternative and on failure restores the parse state before attempting the right alternative. In a sequence of alternatives the first successful alternative satisfies the group. Further alternatives are not attempted. Factoring and grouping provides for a continuous advancing parse. The backtrack alternative creates a saved state of the parse before it attempts its left alternative. Backtracking is required when the parse may make a partial match and then fail:

(a b | c d)\ e

In the above if a returns failure the alternative c d is attempted. If then c returns failure the backtrack alternative will be attempted. If a succeeds and b fails the parse wile be backtracked and e attempted. Likewise a failing c successful and b fails the parse is backtracked and the alternative e taken. Backtracking is not limited to within a formula. If any parsing formula makes a partial match at any time and then fails the parse is reset to the top backtrack and its alternative taken. A compile failure can occur if code has been output sense the backtrack was created. A backtrack is set before starting the compile. Returning failure or backtracking to it is a compiler failure. Backtracks are stacked. We may use negative - and positive ? peek/look ahead operators to test without advancing the parse. being string test is a peek ahead only needing the input state saved and reset. A look ahead would be a parsing expression that makes a partial match before failing. A look ahead is implemented using backtracking.

The parser language is neither an LL or LR parser. But a programming language for writing a recursive decent parser in which you program tree construction:

:<node name> creates a node object and pushes it onto the node stack.
..           Token formula create token objects and push them onto 
             the parse stack.
!<number>    pops the top node object and top <number> of parstack 
             entries into a list representation of the tree. The 
             tree then pushed onto the parse stack.
+[ ... ]+    creates a list of the parse stack entries created 
             between them:
              '(' +[argument $(',' argument]+ ')'
             could parse an argument list. into a list.

A commonly used parsing example is an arithmetic expression:

Exp = Term $(('+':ADD|'-':SUB) Term!2); 
Term = Factor $(('*':MPY|'/':DIV) Factor!2);
Factor = ( number
         | id  ( '(' +[Exp $(',' Exp)]+ ')' :FUN!2
               | --)
         | '(' Exp ')" )
         (^' Factor:XPO!2 |--);

Exp and Term using a loop creates a left handed tree. Factor using right recursion creates a right handed tree:

d^(x+5)^3-a+b*c => ADD[SUB[EXP[EXP[d,ADD[x,5]],3],a],MPY[b,c]]

              ADD
             /   \
          SUB     MPY
         /   \   /   \
      EXP     a b     c
     /   \
    d     EXP     
         /   \
      ADD     3
     /   \
    x     5

Here is a bit of the cc compiler, an updated version of SLIC with c style comments. Function types (grammar, token, character class, generator, PSEUDO, or MACHOP are determined by their initial syntax following their id.
With these top-down parsers you start with a program defining formula:

program = $((declaration            // A program is a sequence of
                                    // declarations terminated by
            |.EOF .STOP)            // End Of File finish & stop compile
           \                        // Backtrack: .EOF failed or
                                    // declaration long-failed.
             (ERRORX["?Error?"]     // report unknown error
                                    // flagging furthest parse point.
              $(-';' (.ANY          // find a ';'. skiping .ANY
                     | .STOP))      // character: .ANY fails on end of file
                                    // so .STOP ends the compile.
                                    // (-';') failing breaks loop.
              ';'));                // Match ';' and continue

declaration =  "#" directive                // Compiler directive.
             | comment                      // skips comment text
             | global        DECLAR[*1]     // Global linkage
             |(id                           // functions starting with an id:
                ( formula    PARSER[*1]     // Parsing formula
                | sequencer  GENERATOR[*1]  // Code generator
                | optimizer  ISO[*1]        // Optimizer
                | pseudo_op  PRODUCTION[*1] // Pseudo instruction
                | emitor_op  MACHOP[*1]     // Machine instruction
                )        // All the above start with an identifier
              \ (ERRORX["Syntax error."]
                 garbol);                    // skip over error.

// Note how id is factored off and later combined when creating the tree.

formula =   ("==" syntax  :BCKTRAK   // backtrack grammar formula
            |'='  syntax  :SYNTAX    // grammar formula.
            |':'  chclass :CLASS     // character class define
            |".." token   :TOKEN     // token formula
              )';' !2                // Combine node name with id 
                                     // parsed in calling declaration 
                                     // formula and tree produced
                                     // by the called syntax, token
                                     // or character class formula.
                $(-(.NL |"/*") (.ANY|.STOP)); Comment ; to line separator?

chclass = +[ letter $('|' letter) ]+;// a simple list of character codes
                                     // except 
letter  = char | number | id;        // when including another class

syntax  = seq ('|' alt1|'\' alt2 |--);

alt1    = seq:ALT!2 ('|' alt1|--);  Non-backtrack alternative sequence.

alt2    = seq:BKTK!2 ('\' alt2|--); backtrack alternative sequence

seq     = +[oper $oper]+;

oper    = test | action | '(' syntax ')' | comment; 

test    = string | id ('[' (arg_list| ,NILL) ']':GENCALL!2|.EMPTY);

action  = ':' id:NODE!1
        | '!' number:MAKTREE!1
        | "+["  seq "]+" :MAKLST!1;

//     C style comments
comment  = "//" $(-.NL .ANY)
         | "/*" $(-"*/" .ANY) "*/";

Of note is how the parser language handles commenting and error recovery.

I think I have answered the question. Having written a big part of SLICs successor, the cc language in itself here. There is no compiler for it as yet. But I can hand compile it into assembly code, naked asm c or c++ functions.

回复收藏 0 原文