按语法解析为 AST(或 .y+.lang => xml)的工具
给定一个词法分析器定义文件、一个语法文件(例如,来自源树的 postgresql .y
、.l
flex 和 bison 程序)以及由这些词法分析器和解析器(例如,SQL 查询)以某种标准形式(例如,XML 的 JSON)获取 AST。
该工具最重要的方面是 - 输入格式的灵活性。在我的示例中,我可以在 ANTLR 中重新创建 postgres SQL 语法 - 但我不想这样做。我宁愿只使用 postgres 使用的任何东西。因此,即使 .y
文件包含的不仅仅是解析规则 - 我正在寻找的工具将能够通过较小的修改来理解它们。
有没有通用工具可以做到这一点?
这是使用我的想象工具 ly2xml
进行的命令行会话:(
$ git clone git://postgres-git-url pg
$ find pg -iname *.[yl] -exec cp '{}' ~/ \;
$ echo 'SELECT * FROM (SELECT 1)'|ly2xml -parser=*.y -lexer=*.l - -O-
<SELECT>
<ARGS>*</ARGS>
<FROM>
<SELECT><ARGS>1</ARGS></SELECT>
</FROM>
</SELECT>
请注意,-
表示它从标准输入读取,-O-
表示它写入到标准输出)。
Given a lexer definition file, a grammar file (say, postgresql .y
,.l
flex and bison programs from it's source tree), and a file defined by those lexer and parser (say, an SQL query) to get the AST in some standard form (say, JSON of XML).
The most important aspect of this tool is - flexibility of the input format. In my example, I could recreate postgres SQL grammar in ANTLR - but I don't want to. I'd rather just use whatever postgres is using. So even though .y
file contains more than the parsing rules - the tool that I'm looking for will be able to understand them with minor modifications.
Is there a generic tool that does that?
Here's a command line session with my imaginary tool ly2xml
:
$ git clone git://postgres-git-url pg
$ find pg -iname *.[yl] -exec cp '{}' ~/ \;
$ echo 'SELECT * FROM (SELECT 1)'|ly2xml -parser=*.y -lexer=*.l - -O-
<SELECT>
<ARGS>*</ARGS>
<FROM>
<SELECT><ARGS>1</ARGS></SELECT>
</FROM>
</SELECT>
(note that -
means it reads from standard input, and -O-
means it writes to standard output).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好主意。您假设以下一项或多项:
a) 显然是错误的。我从来没见过b)。实际上没有一个解析引擎会执行 c);他们只能解析“完整程序”。
恕我直言,您唯一的希望是使用具有大量经过良好测试的语言定义的解析器生成器。
ANTLR 可以说是其中之一;它当然有一长串贡献的语言定义。而且它们都可以在一个地方找到。但据我所知,不做语言片段。怀疑它是否具有所有解析树的 XML 导出。
Bison 可以说是其中之一;有很多很多的语言处理器是使用 Bison 构建的。但定义分散各处,收集起来非常困难。也不做语言片段。很确定它没有 XML 导出功能。
我们的 DMS 软件重新工程工具包 可以说是其中之一。有很多语言定义。它们都收集在一个地方(我们公司)。它确实为每个解析生成 AST,并且具有内置的 XML 导出。 DMS 还可以解析任何它所知道的任何语言的非终结符语言。
给定 DMS .lex、.atg(“属性语法”)和兼容的源文件,DMS 可以很好地模拟您的示例。
接下来是 DMS 词法分析器/解析器构建并运行,并带有 XML 导出,用于在 代数作为 DMS 域
(示例中间的 ++XML 是被告知导出 XML 的解析步骤):
如果您确实想要一个能够理解许多语法符号的引擎,那么它可能是最简单的使用 DMS 构建这样一个引擎。只需将每个语法形式(例如 ANTLR 或 bison)定义为 DMS 的 DSL,使用 DMS 解析特定的语法形式实例(例如 ANLTR bnf 实例),应用 DMS 重写规则将其转换为 DMS 语法,然后构建 DMS 解析器。 (您也必须对词法分析器执行相同的操作。)。
Nice thought. You're assuming one or more of:
a) is clearly false. I've never seen b). Practically none of the parsing engines do c); they can only parse "full programs".
Your only hope IMHO is to use a parser generator that has a large number of well tested language definitions.
ANTLR is arguably one; it certainly has a long list of contributed language definitions. And they're all sort of findable in one place. Doesn't do language fragments, though, that I know of. Doubt if it has XML export for all parse trees.
Bison is arguably one; there are lots and lots of language processors built using Bison. But the definitions are scattered everywhere and it will be very hard to collect them. Also doesn't do language fragments. Pretty sure it doesn't have XML export.
Our DMS Software Reengineering Toolkit is arguably one. Has lots of language definitions. They're all collected in one place (our company). It does produce ASTs for every parse, and does have built-in XML export. DMS also can parse any language nonterminal for any language it knows.
DMS can simulate your example pretty well, given a DMS .lex, .atg ("attributed grammar") and a compatible source file.
What follows is a DMS lexer/parser-build and run, with XML export, for the Algebra grammar found at Algebra as DMS Domain
(the ++XML halfway down the example is the parsing step being told to export XML):
If you really wanted an engine that understood many grammar notations, it might be easiest to build such an engine with DMS. Simply define each of the grammar formalisms (e.g., ANTLR or bison) as a DSL to DMS, parse a specific grammar formalism instance (e.g., ANLTR bnf instance) using DMS, apply DMS rewrite rules to transform that to a DMS grammar, and then build a DMS parser. (You'd have to do the same with the lexer, too.).