使用 OcamlYacc / FsYacc 表示可选语法和重复
我正在尝试建立一些词法分析/解析语法的技能。 我回顾了我为 SQL 编写的一个简单解析器,我对它并不完全满意——似乎应该有一种更简单的方法来编写解析器。
SQL 让我很困惑,因为它有很多可选标记和重复。 例如:
SELECT *
FROM t1
INNER JOIN t2
INNER JOIN t3
WHERE t1.ID = t2.ID and t1.ID = t3.ID
相当于:
SELECT *
FROM t1
INNER JOIN t2 ON t1.ID = t2.ID
INNER JOIN t3 on t1.ID = t3.ID
ON
和 WHERE
子句是可选的,并且可以出现多次。 我在解析器中处理这些问题如下:
%{
open AST
%}
// ...
%token <string> ID
%token AND OR COMMA
%token EQ LT LE GT GE
%token JOIN INNER LEFT RIGHT ON
// ...
%%
op: EQ { Eq } | LT { Lt } | LE { Le } | GT { Gt } | GE { Ge }
// WHERE clause is optional
whereClause:
| { None }
| WHERE whereList { Some($2) }
whereList:
| ID op ID { Cond($1, $2, $3) }
| ID op ID AND whereList { And(Cond($1, $2, $3), $5) }
| ID op ID OR whereList { Or(Cond($1, $2, $3), $5) }
// Join clause is optional
joinList:
| { [] }
| joinClause { [$1] }
| joinClause joinList { $1 :: $2 }
joinClause:
| INNER JOIN ID joinOnClause { $3, Inner, $4 }
| LEFT JOIN ID joinOnClause { $3, Left, $4 }
| RIGHT JOIN ID joinOnClause { $3, Right, $4 }
// "On" clause is optional
joinOnClause:
| { None }
| ON whereList { Some($2) }
// ...
%%
换句话说,我通过将可选语法分解为单独的规则来处理可选语法,并使用递归处理重复。 这是可行的,但它将解析分解为一堆小子例程,并且很难看出语法实际代表什么。
我认为如果我可以在括号内指定可选语法并用 * 或 + 重复,那么编写起来会容易得多。 这会将我的 whereClause 和 joinList 规则减少为以下内容:
whereClause:
| { None }
// $1 $2, I envision $2 being return as an (ID * op * ID * cond) list
| WHERE [ ID op ID { (AND | OR) }]+
{ Some([for name1, op, name2, _ in $1 -> name1, op, name2]) }
joinClause:
| { None }
// $1, returned as (joinType
// * JOIN
// * ID
// * ((ID * op * ID) list) option) list
| [ (INNER | LEFT | RIGHT) JOIN ID [ON whereClause] ]*
{ let joinType, _, table, onClause = $1;
Some(table, joinType, onClause) }
我认为这种形式更更易于阅读,并且更直观地表达了它试图捕获的语法。 不幸的是,我在 Ocaml 或 F# 文档中找不到任何支持此表示法或类似内容的内容。
是否有一种简单的方法可以在 OcamlYacc 或 FsYacc 中表示具有可选或重复标记的语法?
I'm trying to build up some skills in lexing/parsing grammars. I'm looking back on a simple parser I wrote for SQL, and I'm not altogether happy with it -- it seems like there should have been an easier way to write the parser.
SQL tripped me up because it has a lot of optional tokens and repetition. For example:
SELECT *
FROM t1
INNER JOIN t2
INNER JOIN t3
WHERE t1.ID = t2.ID and t1.ID = t3.ID
Is equivalent to:
SELECT *
FROM t1
INNER JOIN t2 ON t1.ID = t2.ID
INNER JOIN t3 on t1.ID = t3.ID
The ON
and WHERE
clauses are optional and can occur more than once. I handled these in my parser as follows:
%{
open AST
%}
// ...
%token <string> ID
%token AND OR COMMA
%token EQ LT LE GT GE
%token JOIN INNER LEFT RIGHT ON
// ...
%%
op: EQ { Eq } | LT { Lt } | LE { Le } | GT { Gt } | GE { Ge }
// WHERE clause is optional
whereClause:
| { None }
| WHERE whereList { Some($2) }
whereList:
| ID op ID { Cond($1, $2, $3) }
| ID op ID AND whereList { And(Cond($1, $2, $3), $5) }
| ID op ID OR whereList { Or(Cond($1, $2, $3), $5) }
// Join clause is optional
joinList:
| { [] }
| joinClause { [$1] }
| joinClause joinList { $1 :: $2 }
joinClause:
| INNER JOIN ID joinOnClause { $3, Inner, $4 }
| LEFT JOIN ID joinOnClause { $3, Left, $4 }
| RIGHT JOIN ID joinOnClause { $3, Right, $4 }
// "On" clause is optional
joinOnClause:
| { None }
| ON whereList { Some($2) }
// ...
%%
In other words, I handled optional syntax by breaking it into separate rules, and handled repetition using recursion. This works, but it breaks parsing into a bunch of little subroutines, and it's very hard to see what the grammar actually represents.
I think it would be much easier to write if I could specify optional syntax inside brackets and repetition with an * or +. This would reduce my whereClause and joinList rules to the following:
whereClause:
| { None }
// $1 $2, I envision $2 being return as an (ID * op * ID * cond) list
| WHERE [ ID op ID { (AND | OR) }]+
{ Some([for name1, op, name2, _ in $1 -> name1, op, name2]) }
joinClause:
| { None }
// $1, returned as (joinType
// * JOIN
// * ID
// * ((ID * op * ID) list) option) list
| [ (INNER | LEFT | RIGHT) JOIN ID [ON whereClause] ]*
{ let joinType, _, table, onClause = $1;
Some(table, joinType, onClause) }
I think this form is much easier to read and expresses the grammar it's trying to capture more intuitively. Unfortunately, I can't find anything in either the Ocaml or F# documentation which supports this notation or anything similar.
Is there an easy way to represent grammars with optional or repetitive tokens in OcamlYacc or FsYacc?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Menhir 允许通过另一个符号对非终结符号进行参数化,并提供标准快捷方式库,例如选项和列表,您可以创建自己的。 示例:
还有一些语法糖,‘token?’ 相当于“option(token)”,“token+”相当于“nonempty_list(token)”。
所有这些确实缩短了语法定义。 它还受 ocamlbuild 支持,可以直接替代 ocamlyacc。 强烈推荐!
有趣的是,我也用它来解析 SQL :)
Menhir allows to parametrize nonterminal symbols by another symbols and provides the library of standard shortcuts, like optionals and lists, and you can create your own. Example:
There is also some syntax sugar, 'token?' is equivalent to 'option(token)', 'token+' to 'nonempty_list(token)'.
All of this really shortens grammar definition. Also it is supported by ocamlbuild and can be a drop-in replacement for ocamlyacc. Highly recommended!
Funny, I used it to parse SQL too :)
当你组合所有的小片段时,你应该会得到你想要的东西。 而不是:
您只需将
其定义为这三个关键字的并集。
您还可以在词法分析器中定义联合。 在您定义令牌的方式中,或者使用 camlp4 中,我在该领域没有做太多事情,所以我不能建议您采取这些路线。 而且我不认为它们会像到处都有小部件一样适合你。
编辑:
因此,对于 camlp4,您可以查看 Camlp4 语法模块 和教程 和更好的教程。 请注意,这并不完全是您想要的,但已经非常接近了。 该文档非常糟糕,正如 最近关于 ocaml 的讨论中所表达的那样组,但对于这个特定领域,我认为您不会遇到太多问题。 我用它做了一些工作,可以回答更多问题。
When you compose all the little pieces you should get something like you want though. Instead of:
you just have
and define that to be the union of those three keywords.
You can also define the union in the lexer. in the way you define the tokens, or using camlp4, I haven't done much in that area, so I cannot advise you to take those routes. And I don't think they'll work for you as well as just having little pieces everywhere.
EDIT:
So, for camlp4 you can look at Camlp4 Grammar module and a tutorial and a better tutorial. This isn't exactly what you want, mind you, but it's pretty close. The documentation is pretty bad, as expressed in the recent discussion on the ocaml groups, but for this specific area, I don't think you'll have too many problems. I did a little with it and can field more questions.