在用 Ocaml 编写的编译器中，在哪里/如何声明变量的唯一键？

发布于 2024-11-17 19:22:39 字数 1265 浏览 3 评论 0原文

我正在用 Ocaml 编写 mini-pascal 编译器。例如，我希望我的编译器接受以下代码：

program test;
var
   a,b : boolean;
   n : integer;
begin
   ...
end.

我在处理变量声明（var 之后的部分）时遇到困难。目前，变量的类型在 sib_syntax.ml 中定义如下：

type s_var =
    { s_var_name: string;
      s_var_type: s_type; 
      s_var_uniqueId: s_uniqueId (* key *) }

其中 s_var_uniqueId（而不是s_var_name）是变量的唯一键。我的第一个问题是，每次获得新变量时，在哪里以及如何实现生成新 id 的机制（实际上是将最大 id 加 1）。我想知道是否应该在 sib_parser.mly 中实现它，这可能涉及静态变量 cur_id以及binding部分的修改，又不知道如何在.mly中实现。或者我应该在下一阶段实现该机制 - interpreter.ml？但在这种情况下，问题是如何使.mly与类型s_var一致，我应该在部分提供什么s_var_uniqueId 绑定？

另一个问题是关于 .mly 中 statement 的这一部分：

id = IDENT COLONEQ e = expression
  { Sc_assign (Sle_var {s_var_name = id; s_var_type = St_void}, e) }

在这里，我还需要提供下一个级别（interpreter.ml）a我只知道其中的 s_var_name 变量，那么我可以在这里对其 s_var_type 和 s_var_uniqueId 做什么呢？

有人可以帮忙吗？非常感谢！

原文

I am writing a compiler of mini-pascal in Ocaml. I would like my compiler to accept the following code for instance:

program test;
var
   a,b : boolean;
   n : integer;
begin
   ...
end.

I have difficulties in dealing with the declaration of variables (the part following var). At the moment, the type of variables is defined like this in sib_syntax.ml:

type s_var =
    { s_var_name: string;
      s_var_type: s_type; 
      s_var_uniqueId: s_uniqueId (* key *) }

Where s_var_uniqueId (instead of s_var_name) is the unique key of the variables. My first question is, where and how I could implement the mechanism of generating a new id (actually by increasing the biggest id by 1) every time I have got a new variable. I am wondering if I should implement it in sib_parser.mly, which probably involves a static variable cur_id and the modification of the part of binding, again don't know how to realize them in .mly. Or should I implement the mechanism at the next stage - the interpreter.ml? but in this case, the question is how to make the .mly consistent with the type s_var, what s_var_uniqueId should I provide in the part of binding?

Another question is about this part of statement in .mly:

id = IDENT COLONEQ e = expression
  { Sc_assign (Sle_var {s_var_name = id; s_var_type = St_void}, e) }

Here, I also need to provide the next level (the interpreter.ml) a variable of which I only know the s_var_name, so what could I do regarding its s_var_type and s_var_uniqueId here?

Could anyone help? Thank you very much!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暗地喜欢 2024-11-24 19:22:39

要问自己的第一个问题是您是否真的需要一个唯一的 ID。根据我的经验，它们几乎从来没有必要，甚至没有用。如果您想要做的是通过alpha-equivalence使变量唯一，那么这应该在解析完成后发生，并且可能会涉及某种形式的 DeBruijn索引而不是唯一标识符。

无论哪种方式，每次调用时都会返回新整数标识符的函数是：

let unique = 
  let last = ref 0 in 
  fun () -> incr last ; !last

let one = unique ()  (* 1 *)
let two = unique ()  (* 2 *)

因此，您可以简单地分配 { ... ; s_var_uniqueId = unique () } 在您的 Menhir 规则中。

您在这里要解决的更重要的问题是变量绑定。变量x在一个位置定义并在另一个位置使用，并且您需要确定它在两个位置恰好是相同的变量。有很多方法可以做到这一点，其中之一是将绑定延迟到解释器。我将向您展示如何在解析过程中处理这个问题。

首先，我将定义一个上下文：它是一组变量，允许您根据变量的名称轻松检索变量。您可能想使用哈希表或映射来创建它，但为了简单起见，我将在此处使用 List.assoc。

type s_context = {
  s_ctx_parent : s_context option ;
  s_ctx_bindings : (string * (int * s_type)) list ;
  s_ctx_size : int ;
}

let empty_context parent = {
  s_ctx_parent = parent ;
  s_ctx_bindings = [] ;
  s_ctx_size = 0
}

let bind v_name v_type ctx = 
  try let _ = List.assoc ctx.s_ctx_bindings v_name in
      failwith "Variable is already defined"
  with Not_found -> 
    { ctx with 
      s_ctx_bindings = (v_name, (ctx.s_ctx_size, v_type)) 
        :: ctx.s_ctx_bindings ;
      s_ctx_size = ctx.s_ctx_size + 1 }

let rec find v_name ctx =       
  try 0, List.assoc ctx.s_ctx_bindings v_name
  with Not_found -> 
    match ctx.s_ctx_parent with 
      | Some parent -> let depth, found = find v_name parent in
                       depth + 1, found
      | None -> failwith "Variable is not defined"

因此，bind 向当前上下文添加一个新变量，find 在当前上下文及其父变量中查找变量，并返回绑定数据和发现它的深度。因此，您可以将所有全局变量放在一个上下文中，然后将函数的所有参数放在另一个以全局上下文作为其父级的上下文中，然后将函数中的所有局部变量（当您拥有它们时）放在第三个上下文中将函数的主上下文作为父级，依此类推。

因此，例如，find 'x' ctx 将返回类似 0, (3, St_int) 的内容，其中 0 是变量，3 是变量在 DeBruijn 索引标识的上下文中的位置，St_int 是类型。

type s_var = {
  s_var_deBruijn: int;
  s_var_type: s_type;
  s_var_pos: int 
}

let find v_name ctx = 
   let deBruijn, (pos, typ) = find v_name ctx in 
   { s_var_deBruijn = deBruijn ;
     s_var_type = typ ;
     s_var_pos = pos }

当然，您需要函数来存储其上下文，并确保第一个参数是上下文中位置 0 的变量：

type s_fun =
{ s_fun_name: string;
  s_fun_type: s_type;
  s_fun_params: context; 
  s_fun_body: s_block; }

let context_of_paramlist parent paramlist = 
  List.fold_left 
    (fun ctx (v_name,v_type) -> bind v_name v_type ctx) 
    (empty_context parent)
    paramlist

然后，您可以更改解析器以考虑上下文。技巧在于，大多数规则将返回一个以上下文作为参数并返回 AST 节点的函数，而不是返回代表 AST 一部分的对象。

例如：

int_expression:
  (* Constant : ignore the context *)
| c = INT { fun _ -> Se_const (Sc_int c) }
  (* Variable : look for the variable inside the contex *)
| id = IDENT { fun ctx -> Se_var (find id ctx) }
  (* Subexpressions : pass the context to both *)
| e1 = int_expression o = operator e2 = int_expression 
  { fun ctx -> Se_binary (o, e1 ctx, e2 ctx) }
;

因此，您只需通过表达式递归地“向下”传播上下文即可。唯一聪明的部分是创建新上下文时的部分（您还没有这个语法，所以我只是添加一个占位符）：

| function_definition_expression (args, body) 
  { fun ctx -> let ctx = context_of_paramlist (Some ctx) args in
               { s_fun_params = ctx ; 
                 s_fun_body = body ctx } }

以及全局上下文（程序规则本身不返回函数，但block 规则确实如此，因此从全局变量创建上下文并提供）。

prog:
  PROGRAM IDENT SEMICOLON
  globals = variables
  main = block
  DOT
    { let ctx = context_of_paramlist None globals in 
      { globals = ctx;
        main = main ctx } }

由于 DeBruijn 索引，所有这些都使得解释器的实现变得更加容易：您可以有一个“堆栈”来保存您的值（value 类型），定义为：

type stack = value array list

然后，读取和写入变量 < code>x 就这么简单：

let read stack x = 
  (List.nth stack x.s_var_deBruijn).(x.s_var_pos)

let write stack x value = 
  (List.nth stack x.s_var_deBruijn).(x.s_var_pos) <- value

另外，由于我们确保函数参数的顺序与其在函数上下文中的位置相同，因此如果您想调用函数 f 及其函数，参数存储在数组args中，那么构建堆栈就像这样简单：

let inner_stack = args :: stack in
(* Evaluate f.s_fun_body with inner_stack here *)

但我相信当您开始使用解释器时，您会有更多问题要问；）

The first question to ask yourself is whether you actually need an unique id. From my experience, they're almost never necessary or even useful. If what you're trying to do is making variables unique through alpha-equivalence, then this should happen after parsing is complete, and will probably involve some form of DeBruijn indices instead of unique identifiers.

Either way, a function which returns a new integer identifier every time it is called is:

let unique = 
  let last = ref 0 in 
  fun () -> incr last ; !last

let one = unique ()  (* 1 *)
let two = unique ()  (* 2 *)

So, you can simply assign { ... ; s_var_uniqueId = unique () } in your Menhir rules.

The more important problem you're trying to solve here is that of variable binding. Variable x is defined in one location and used in another, and you need to determine that it happens to be the same variable in both places. There are many ways of doing this, one of them being to delay the binding until the interpreter. I'm going to show you how to deal with this during parsing.

First, I'm going to define a context: it's a set of variables that allows you to easily retrieve a variable based on its name. You might want to create it with hash tables or maps, but to keep things simple I will be using List.assoc here.

type s_context = {
  s_ctx_parent : s_context option ;
  s_ctx_bindings : (string * (int * s_type)) list ;
  s_ctx_size : int ;
}

let empty_context parent = {
  s_ctx_parent = parent ;
  s_ctx_bindings = [] ;
  s_ctx_size = 0
}

let bind v_name v_type ctx = 
  try let _ = List.assoc ctx.s_ctx_bindings v_name in
      failwith "Variable is already defined"
  with Not_found -> 
    { ctx with 
      s_ctx_bindings = (v_name, (ctx.s_ctx_size, v_type)) 
        :: ctx.s_ctx_bindings ;
      s_ctx_size = ctx.s_ctx_size + 1 }

let rec find v_name ctx =       
  try 0, List.assoc ctx.s_ctx_bindings v_name
  with Not_found -> 
    match ctx.s_ctx_parent with 
      | Some parent -> let depth, found = find v_name parent in
                       depth + 1, found
      | None -> failwith "Variable is not defined"

So, bind adds a new variable to the current context, find looks for a variable in the current context and its parents, and returns both the bound data and the depth at which it was found. So, you could have all global variables in one context, then all parameters of a function in another context that has the global context as its parent, then all local variables in a function (when you'll have them) in a third context that has the function's main context as the parent, and so on.

So, for instance, find 'x' ctx will return something like 0, (3, St_int) where 0 is the DeBruijn index of the variable, 3 is the position of the variable in the context identified by the DeBruijn index, and St_int is the type.

type s_var = {
  s_var_deBruijn: int;
  s_var_type: s_type;
  s_var_pos: int 
}

let find v_name ctx = 
   let deBruijn, (pos, typ) = find v_name ctx in 
   { s_var_deBruijn = deBruijn ;
     s_var_type = typ ;
     s_var_pos = pos }

Of course, you need your functions to store their context, and make sure that the first argument is the variable at position 0 within the context:

type s_fun =
{ s_fun_name: string;
  s_fun_type: s_type;
  s_fun_params: context; 
  s_fun_body: s_block; }

let context_of_paramlist parent paramlist = 
  List.fold_left 
    (fun ctx (v_name,v_type) -> bind v_name v_type ctx) 
    (empty_context parent)
    paramlist

Then, you can change your parser to take into account the context. The trick is that instead of returning an object representing part of your AST, most of your rules will return a function that takes a context as an argument and returns an AST node.

For instance:

int_expression:
  (* Constant : ignore the context *)
| c = INT { fun _ -> Se_const (Sc_int c) }
  (* Variable : look for the variable inside the contex *)
| id = IDENT { fun ctx -> Se_var (find id ctx) }
  (* Subexpressions : pass the context to both *)
| e1 = int_expression o = operator e2 = int_expression 
  { fun ctx -> Se_binary (o, e1 ctx, e2 ctx) }
;

So, you simply propagate the context "down" recursively through the expressions. The only clever parts are those when new contexts are created (you don't have this syntax yet, so I'm just adding a placeholder):

| function_definition_expression (args, body) 
  { fun ctx -> let ctx = context_of_paramlist (Some ctx) args in
               { s_fun_params = ctx ; 
                 s_fun_body = body ctx } }

As well as the global context (the program rule itself does not return a function, but the block rule does, and so a context is created from the globals and provided).

prog:
  PROGRAM IDENT SEMICOLON
  globals = variables
  main = block
  DOT
    { let ctx = context_of_paramlist None globals in 
      { globals = ctx;
        main = main ctx } }

All of this makes the implementation of your interpreter much easier due to the DeBruijn indices: you can have a "stack" which holds your values (of type value) defined as:

type stack = value array list

Then, reading and writing variable x is as simple as:

let read stack x = 
  (List.nth stack x.s_var_deBruijn).(x.s_var_pos)

let write stack x value = 
  (List.nth stack x.s_var_deBruijn).(x.s_var_pos) <- value

Also, since we made sure that function parameters are in the same order as their position in the function context, if you want to call function f and its arguments are stored in the array args, then constructing the stack is as simple as:

let inner_stack = args :: stack in
(* Evaluate f.s_fun_body with inner_stack here *)

But I'm sure you'll have a lot more questions to ask when you start working on your interpeter ;)

回复收藏 0 原文

半山落雨半山空 2024-11-24 19:22:39

如何创建全局 id 生成器：

let unique =
  let counter = ref (-1) in
  fun () -> incr counter; !counter

测试：

# unique ();;
- : int = 0
# unique ();;
- : int = 1

关于您更一般的设计问题：您的数据表示似乎没有忠实地表示编译器阶段。如果您必须在解析阶段之后返回类型感知的数据类型（带有此字段 s_var_type），则说明出现了问题。您有两种选择：

为解析后 AST 设计更精确的数据表示形式，这与键入后 AST 不同，并且没有那些 s_var_type 字段。类型化就是从非类型化 AST 到类型化 AST 的转换。这是我推荐的一个干净的解决方案。
承认你必须打破数据表示语义，因为你在这个阶段没有足够的信息，并尝试在解析阶段后接受返回诸如 St_void 之类的垃圾的想法，以便稍后重建正确的信息。这是较少类型化的（因为你对数据有一个隐含的假设，这在类型中并不明显），更务实，丑陋，但有时是必要的。我认为在这种情况下这不是正确的决定，但是您会遇到最好少输入一点的情况。

我认为独特的 id 处理设计的具体选择取决于您对这个更普遍的问题的立场，以及您对类型的具体决定。如果您选择解析后 AST 的更精细类型表示，则您可以选择是否包含唯一 id（我会，因为生成唯一 ID 非常简单，不需要单独的传递，而且我会比打字阶段稍微复杂化语法产生式）。如果您选择使用虚拟值破解类型字段，那么如果您愿意，对变量 id 执行此操作也是合理的，将 0 作为虚拟值并稍后定义它；但我个人仍然会在解析阶段这样做。

How to create a global id generator:

let unique =
  let counter = ref (-1) in
  fun () -> incr counter; !counter

Test:

# unique ();;
- : int = 0
# unique ();;
- : int = 1

Regarding your more general design question: it seems that your data representation does not faithfully represent the compiler phases. If you must return a type-aware data-type (with this field s_var_type) after the parsing phase, something is wrong. You have two choices:

devise a more precise data representation for the post-parsing AST, that would be different from the post-typing AST, and not have those s_var_type fields. Typing would then be a conversion from the untyped to the typed AST. This is a clean solution that I would recommend.
admit that you must break the data representation semantics because you don't have enough information at this stage, and try to be at peace with the idea of returning garbage such as St_void after the parsing phase, to reconstruct the correct information later. This is less typed (as you have an implicit assumption on your data which is not apparent in the type), more pragmatic, ugly but sometimes necessary. I don't think it's the right decision in this case, but you will encounter situation where it's better to be a bit less typed.

I think the specific choice of unique id handling design depends on your position on this more general question, and your concrete decisions about types. If you choose a finer-typed representation of post-parsing AST, it's your choice to decide whether to include unique ids or not (I would, because generating a unique ID is dead simple and doesn't need a separate pass, and I would rather slightly complexify the grammar productions than the typing phase). If you choose to hack the type field with a dummy value, it's also reasonable to do that for variable ids if you wish to, putting 0 as a dummy value and defining it later; but still I personally would do that in the parsing phase.

回复收藏 0 原文

~没有更多了~