从抽象语法树获取控制流程图

发布于 2024-07-05 17:58:45 字数 261 浏览 6 评论 0原文

我有一个源自 Java ANTLR Parser Generator 的 AST。 我想要做的是以某种方式构建源代码的控制流图,其中每个语句或表达式都是一个唯一的节点。 我知道这个识别一定有一些递归性,我想知道你会建议什么作为最好的选择,以及 ANTLR 是否有一个我可以用来完成这项工作的工具集。 干杯, Chris


编辑 - 我主要关心的是从 AST 获取控制流图(CFG)。 这样我就可以获得源的树表示。 澄清一下,源代码和实现语言都是Java。

I have an AST derived from the ANTLR Parser Generator for Java. What I want to do is somehow construct a control flow graph of the source code, where each statement or expression is a unique Node. I understand there must be some recursiveness to this identification, I was wondering what you would suggest as the best option and if ANTLR has a toolset I can use for this job.
Cheers,
Chris


EDIT - My main concern is to get a control flow graph(CFG) from the AST. This way I can get a tree representation of the source. To clarify, both the source code and the implementation language is Java.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

离去的眼神 2024-07-12 17:58:45

通常,CFG 是在较低级别的表示(例如 JVM 字节码)上计算的。 有人就这样的事情做了一篇论文几年前。 其中可能描述了一种有用的方法来获取该表示。

由于您的源语言和目标语言相同,因此没有代码生成步骤 - 您已经完成了! 然而,现在你可以走 AST 了。 在 AST 的每个节点,你都要问自己:这是否是一条“跳转”指令? 方法调用和 if 语句是跳转指令的示例。 循环结构(例如 forwhile)也是如此。 加法和乘法等指令是非跳转的。

首先将 CFG 中的每个 java 语句与一个节点相关联,以及一个入口和出口节点。 作为第一个近似,遍历树并:

  1. 如果当前语句是方法调用,则找出该方法调用的相应主体的入口节点在哪里,并创建一条从当前语句指向该入口节点的边。 如果该语句是方法返回,则枚举可能调用它的位置并为其添加边缘。
  2. 对于每个非跳跃语句,在它和下一个语句之间建立一条边。

这将为您提供某种的CFG。 该过程在步骤 2 中有点麻烦,因为调用的方法可能在库中声明,而不是在 AST 中的其他地方声明 - 如果是这样,要么不创建边,要么为表示该条目的特殊节点创建边库方法。

这有道理吗?

Usually CFGs are computed on a lower-level representation (e.g. JVM bytecode). Someone did a thesis on such things a few years ago. There might be a helpful way described in there for how to get at that representation.

Since your source and target languages are the same, there's no code generation step -- you're already done! However, now you get to walk the AST. At each node of the AST, you have to ask yourself: is this a "jumping" instruction or not? Method calls and if statements are examples of jumping instructions. So are loop constructs (such as for and while). Instructions such as addition and multiplication are non-jumping.

First associate with each java statement a node in the CFG, along with an entry and exit node. As a first approximation, walk the tree and:

  1. if the current statement is a method call, figure out where the entry node is for the corresponding body of that method call, and make an edge pointing from the current statement to that entry node. if the statement is a method return, enumerate the places that could have called it and add an edge to those.
  2. for each non-jumping statement, make an edge between it and the next statement.

This will give you some kind of CFG. The procedure is slightly hairy in step 2 because the method called may be declared in a library, and not elsewhere in the AST -- if so, either don't make an edge or make an edge to a special node representing the entry to that library method.

Does this make sense?

梦忆晨望 2024-07-12 17:58:45

您是否尝试过ANTLR Studio? 它不会生成空洞 AST 图,但对于回顾来说,它已经非常有帮助了。

Did you ever tryed ANTLR Studio? It does not generate the hole AST graph, but for review, its already quite helpful.

﹂绝世的画 2024-07-12 17:58:45

当我过去这样做时,我使用 graphviz,特别是点工具,来生成图表。 我通过在编译时实际遍历控制流图来创建点输入文件。

图形布局是一个难题,而 graphviz 做得非常出色。 它可以输出为 ps、pdf 和各种图像格式,并且布局通常看起来非常直观。 我强烈推荐它。

When I have done this in the past, I used graphviz, in particular the dot tool, to generate the graph. I created the dot input file by actually traversing the control-flow graph at compile time.

Graph layout is a hard problem, and graphviz does an excellent job. It can output to ps, pdf, and various image formats, and the layout is usually pretty intuitive to look at. I highly recommend it.

氛圍 2024-07-12 17:58:45

我认为我无法以您可能正在寻找的方式回答您的问题,因为我不知道 ANTLR 中有什么方法可以生成带有或不带有 AST 的 CFG。 但是,简而言之,您将使用 ANTLR 生成的内容来生成一个单独的 Java 程序来生成 CFG。 您可以利用 ANTLR 生成的语法树作为输入,在您自己创建的单独 Java 程序中生成 CFG。 此时,您实质上正在构建一个编译器。 “编译器”和 JVM 之间的区别在于,您的输出是程序如何分支其各种执行路径的可视化表示 (CFG),而 JVM/Java 编译器生成在真实机器 (CPU) 上执行的代码。

打个比方,如果有人坐下来写一本书(例如英语),句子中使用的各个单词就是计算机语言的令牌,句子的形成方式与上下文无关语法表达有效计算机代码的方式类似,而段落& 整部小说以类似的方式讲述一个故事,语义分析/编译器/CFG 可能会产生/表示逻辑上有效的程序,这些程序实际上做了一些有用的事情,并且或多或少没有逻辑错误。 换句话说,一旦克服了有效语法(正确的句子结构)的问题,任何人都可以在一页上写一堆句子,但只有某些句子组合才能产生实际做某事(讲故事)的文本。

您要问的是最后一部分 - 如何获取语法树并转换或解释 AST 实际在逻辑上所做的事情。 当然,您需要为您想要执行此操作的每种语言构建一个“编译器”。 拥有正确的语法并不能告诉您程序的作用 - 只是从语法的角度来看程序是正确的。

Linters、语法荧光笔和 IDE 都是围绕着这样的想法构建的:试图让最后一块拼图对人类来说变得更容易、更高效。

I don't think I'll be able to answer your question in a way that you are maybe looking for since I don't know of any way in ANTLR to produce a CFG with or without an AST. But, in short you would use what ANTLR produces to generate a separate Java program to produce a CFG. You would utilize the ANTLR generated syntax tree as input to generate your CFG in a separate Java program of your own creation. At this point you are, in essence, building a compiler. The difference between your "compiler" and a JVM is that your output is a visual representation (CFG) of how a program branches its various execution paths and a JVM/Java compiler produces code for execution on a real machine (CPU).

An analogy is if someone sits down to write a book (in English for example), the individual words used in sentences are the TOKENS of a computer language, sentences are formed in a similar manner that context free grammars express valid computer code, and paragraphs & whole novels tell a story in a similar manner that semantic analysis/compilers/CFGs might produce/represent logically valid programs that actually do something useful and are more or less free of logic bugs. In other words, once you get past the issue of valid syntax (correct sentence structure), anyone can write a bunch of sentences on a page but only certain combinations of sentences produce text that actually does something (tell a story).

What you're asking about is that last piece - how to go about taking a syntax tree and transforming or interpreting what the AST actually does logically. And of course you would need to build a "compiler" for each language you want to do this for. Having a correct grammar doesn't tell you what a program does - just that a program is correct from a grammar perspective.

Linters and syntax highlighters and IDEs are all built around the idea of trying to make this last piece of the puzzle an easier and a more efficient task for humans.

甚是思念 2024-07-12 17:58:45

根据一些评论,听起来OP确实想做代码生成 -- 将 AST 转换为基于基本块和跳转点的较低级指令序列。

代码生成是非常特定于语言的,并且已经在这个主题上投入了大量的工作。 在进行代码生成之前,您需要了解您的目标语言 - 无论是汇编语言还是其他高级语言。 一旦确定了这一点,您只需遍历 AST 并生成一系列指令来实现 AST 中的代码。 (我说这很简单,但它可能很难 - 很难概括,因为这里的考虑因素非常特定于语言。)

您为代码生成选择的表示将隐式或显式地包含控制流图。 如果您的目标语言相当低级(接近汇编语言),那么控制流图应该相对容易提取。

(如果您需要更多说明,请发表评论。)

Based on some comments, it sounds like the OP really wants to do code generation -- to convert the AST into a lower-level sequence of instructions based on basic blocks and jump points.

Code generation is very language-specific, and a lot of work has been put into this topic. Before you do code generation you need to know your target language -- whether it be assembler or simply some other high-level language. Once you have identified this, you simply need to walk the AST and generate a sequence of instructions that implements the code in the AST. (I say this is simple, but it can be hard -- it's hard to generalise because the considerations here are pretty language-specific.)

The representation you choose for code generation will contain the control-flow graph, implicitly or explicitly. If your target language is fairly low-level (close to assembler), then the control-flow graph should be relatively easy to extract.

(Please comment if you'd like more clarification.)

洒一地阳光 2024-07-12 17:58:45

生成真正考虑到所有语言的完整控制流程图
问题比看起来更难。 您不仅要确定什么
看起来是“基本块”,但你必须识别函数调用
(有点简单,但确定目标可能会更困难),
其中可能发生类初始值设定项等幕后操作。
并担心可能发生异常的地方
以及如果确实发生异常时控制权的去向。

如果你仔细检查大多数语言,它们也会
清楚表达式中计算求值的顺序,
如果表达式中有两个副作用,这一点就很重要;
控制流应该反映顺序(或非顺序,
如果没有定义)。

也许您只想要控制流的抽象
具有基本块和条件。 那是
显然容易一些。

无论哪种情况(简单 CFG 或完整 CFG),您都需要遍历 AST,
在每个点都有可能的控制流目标的参考
(例如,对于大多数情况,例如 IF 语句,有两个流目标:
THEN 和 ELSE 子句)。 在每个节点上,将该节点链接到
适当的控制流目标,可能替换流目标
(例如,当您遇到 IF 时)。

对于 Java(或 C)的完整语言语义来说,这样做是相当困难的。
很多工作。 您可能只想使用一个计算此的工具
现成的。
请参阅http://www.semanticdesigns.com/Products/DMS/FlowAnalysis.html
通过我们的工具,了解这到底是什么样子。

Producing a full control flow graph that really takes into account all the language
issues is harder than it looks. Not only do you have to identify what
appears to be the "basic blocks", but you have to identify function calls
(sort of easy, but identifying the target might be harder),
where behind-the-scenes operations such as class initializers can happen.
and to worry about the points where exceptions can occur
and where control goes if an exception does occur.

If you examine most languages carefully, they will also be
clear about ordering of evaluation of computations in expressions,
and this matters if you have two side-effects in an expression;
the control flow should reflect the order (or the non-order,
if it isn't defined).

Maybe you only want an abstraction of the control flow
having the basic blocks and the conditionals. That's
obviously a bit easier.

In either case (simple CFG or full CFG), you need to walk the AST,
at each point having a reference to possible control flow targets
(e.g., for most cases, such as IF statements, there are two flow targets:
the THEN and ELSE clauses). At each node, link that node to the
appropriate control flow target, possibly replacing the flow targets
(e.g., when you encounter an IF).

To do this for the full language semantics of Java (or C) is quite
a lot of work. You may want to simply use a tool that computes this
off-the-shelf.
See http://www.semanticdesigns.com/Products/DMS/FlowAnalysis.html
for what this really looks like, coming out of our tools.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文