C - 如何使用 grep 查找所有内循环？

发布于 2024-12-17 21:50:27 字数 250 浏览 2 评论 0原文

我有一个巨大的 C 项目，其中包含大量 C 文件。我必须找到所有内循环。我确信项目中没有任何 O(n3) 块，因此必须找到 O(n2) 复杂性块（循环中的循环）。

是否可以使用 grep 找到所有内循环？如果是，我可以使用什么正则表达式来查找所有类型的内部循环的所有出现，例如 {for,for}、{while,for}、{for, while}、{do, while} 等？如果没有，是否有任何简单的unix方式方法可以做到这一点（可能是多个greps或一种awk）？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你在看孤独的风景 2024-12-24 21:50:27

正则表达式适用于常规语言，您所描述的似乎是上下文无关的，我很确定它不能使用正则表达式来完成。请参阅此处类似问题的答案。您应该寻找其他类型的自动机，例如脚本语言（python 等）。

回复收藏 0 原文

赠佳期 2024-12-24 21:50:27

对于特定的编译器扩展来说这是一个很好的例子。 最新的 GCC 编译器（即 GCC 的 4.6 版本）可以通过插件进行扩展（用 C 语言编写的代码很痛苦） ) 或通过 MELT 扩展； MELT 是一种用于编码 GCC 扩展的高级领域特定语言，并且 MELT 比 C 更容易使用。

但是，我承认编码 GCC 扩展并不完全是微不足道的：您必须部分理解 GCC 是如何工作的，以及它的主要内部表示是什么（Gimple、Tree，...）。扩展 GCC 时，您基本上添加自己的编译器通道，它可以执行您想要的任何操作（包括检测嵌套循环）。编写 GCC 扩展通常需要一周以上的工作。（最难的部分是理解 GCC 是如何工作的）。

在 GCC 框架（通过 C 中的插件或 MELT 中的扩展）中工作的一大优势是，您的扩展与编译器处理相同的数据。

回到查找嵌套循环的问题，不要将其视为纯粹的语法（这就是 grep 无法工作的原因）。在 GCC 编译器中，在某些内部表示级别上，由 for、while、do 甚至使用 实现的循环>goto-s，仍然被认为是循环，对于 GCC 来说，这些东西可以嵌套（并且 GCC 知道嵌套！）。

回复收藏 0 原文

彡翼 2024-12-24 21:50:27

如果没有 C 解析器，您最多只能得到启发式解决方案。

如果您可以依赖代码中始终遵循的某些规则（例如，没有 goto、没有递归循环……），您可以使用正则表达式扫描预处理的 C 代码。当然，grep 还不够复杂，但只需几行 Perl 或类似的代码就可以实现。

但技术上更好、更可靠的方法是使用真正的 C 解析器。

回复收藏 0 原文

梦里泪两行 2024-12-24 21:50:27

C 中有三种循环：

“结构化语法”（while、for、...）
[注意 GCC，它可以隐藏语句，因此使用 (stmt; exp) 语法在表达式内循环！]
使用 goto 的临时循环；这些与结构化语法相互作用。
递归

要找到第一种，您必须找到结构化语法和嵌套。

Grep 当然可以找到关键字（如果忽略注释和字符串中的误报），但它找不到嵌套结构。您当然可以使用 grep 查找所有循环语法，然后简单地检查同一文件中出现的那些语法以查看它们是否嵌套。
（如果您想在没有误报价格的情况下执行此操作，您可以使用我们的源代码搜索引擎，它知道 C 的词法语法，并且永远不会混淆字符串何时是关键字、数字、字符串等。）

如果你想自动找到这些循环，你非常需要一个完整的 C 解析器扩展预处理完成。（否则某些宏可能隐藏循环语法的关键部分）。一旦你有了 C 语言的语法树，就可以很简单地（尽管可能有点不方便）编写一些爬过树的代码，检测循环语法节点，并计算子树中循环的嵌套数量。您可以使用任何能够解析 C 并为您提供抽象 sytnax 树的工具来完成此操作。 ANTLR 可能可以做到这一点；我认为 ANTLR 有一个 C 语法可以很好地处理 C，但在使用 ANTLR 之前您必须运行预处理器。

您还可以使用我们的 DMS Software Reengineering Toolkit 及其 C 前端。我们的 C 前端内置了完整的预处理器，因此它可以直接读取代码并在解析时进行扩展；它还可以处理相对广泛的 C 方言和字符编码（曾经处理过包含日语文本的 C 吗？）。 DMS 提供了一个额外的优势：给定一种语言（例如 C）前端，您可以直接使用该语言语法为该语言编写模式。因此，我们可以轻松表达我们想要查找的片段：

 pattern block_for_loop(t:expression,l:expression,i:expression, s: statements): statement
     " for(\t,\l\,\i) { \s } ";

 pattern statement_for_loop(t:expression,l:expression,i:expression, s: statement): statement
     " for(\t,\l\,\i)  \s ";

 pattern block_while_loop(c:expression, s: statements): statement
     " while(\c) { \s } ";

 pattern statement_while_loop(c:expression): statement
     " for(\c)  \s ";

 ...

并将它们组合在一起：

 pattern_set syntactic_loops
     { block_for_loop,
       statement_for_loop,
       block_while_loop,
       block_statement_loop,
       ...
     }

给定模式集，DMS 可以扫描语法树并找到与任何集合成员的匹配项，而无需编写任何特定的树爬行机制，也无需了解大量信息有关树结构的详细信息。（对于真正的 C 解析器来说，AST 中有很多节点类型！）以这种方式查找嵌套循环将非常简单：从上到下扫描树以查找循环（使用模式集）；任何点击都必须是顶级循环。扫描找到的循环 AST 节点的子树（当您知道外循环的树在哪里时很容易）以查找其他循环；任何命中都必须是嵌套循环；如有必要，请递归以拾取具有任意嵌套的循环。这也适用于 GCC 循环与语句的内容。树节点上印有精确的文件/行/列信息，因此可以轻松生成位置报告。

对于使用 goto 构建的临时循环（什么，您的代码没有任何？），您需要能够生成实际控制流图的东西，然后将该图构建为嵌套的控制流区域。这里的要点是，无论语法如何，包含无条件 goto out 的 while 循环并不是真正的循环。而如果 if 语句的 then 子句返回到 if 上游的代码，则很可能确实是一个循环。所有这些循环语法内容实际上只是色调暗示您可能有循环！
这些控制流区域包含控制流的真正嵌套； DMS 将构建 C 流图，并生成这些结构化区域。它提供了构建和访问该图的库；这样你就可以得到基于goto的“真正的”控制流。找到一对嵌套控制流区域后，可以访问与部分区域关联的 AST 以获取要报告的位置信息。

GCC 总是很有趣，因为它是 C 的显着增强版本。例如，它有一个间接 goto 语句。（常规 C 将此隐藏在 setjmp/longjmp 下！）。
要找出面对这种情况的循环，您需要点分析，DMS 也提供。该信息由嵌套区域分析使用。点分析的准确性存在（保守的）限制，但以您获得正确的嵌套区域图为模。

递归更难找到。为此，您必须确定 A 是否调用 B ... 调用 Z 调用 A，其中 A 和 B 以及 ... 可以位于单独的编译单元中。您需要一个全局调用图，其中包含应用程序的所有编译单元。此时，您可能希望我说 DMS 也能做到这一点，瞧，我很高兴地说它确实。当然，构建该调用图需要对函数调用进行点到分析；是的，DMS 也这样做。通过调用图，您可以在调用图中找到循环，这很可能是递归。此外，通过调用图，您还可以找到间接嵌套，例如，一个函数中的循环调用另一个也包含循环的编译单元中的函数。

为了准确地找到这样的循环结构，您需要大量的机器（这将需要一些努力，但 C 是一种很难分析的语言），DMS 可以提供它。

如果您不关心准确性，也不关心所有类型的循环，则可以使用 grep 和手动过程来获取结构化循环语句暗示的循环的半准确映射。

There are three kinds of loops in C:

"structured syntax" (while, for, ...)
[Watch out for GCC, which can hide statements therefore loops inside expressions using (stmt; exp) syntax!]
ad hoc loops using goto; these interact with the structured syntax.
recursion

To find the first kind, you have to find the structured syntax and the nesting.

Grep can certainly find the keywords (if you ignore false positives in comments and strings), but it can't find nested structures. You could of course use grep to find all the loop syntax and then simply inspect those that occurred in the same file to see if they were nested.
(If you wanted to do this without the false positives price, you could use our Source Code Search Engine, which knows the lexical syntax of C and is never confused as to when a string of characters is a keyword, a number, a string, etc.)

If you want to find those loops automatically, you pretty much need a full C parser with expanded preprocessing accomplished. (Otherwise some macro may hide a critical piece of loop syntax). Once you have a syntax tree for C, it is straightforward (although likely a bit inconvenient) to code something that clambers over the tree, detecting loop syntax nodes, and counting nesting of loops in subtrees. You can do this with any tool that will parse C and give you abstract sytnax trees. ANTLR can likely do this; I think there's a C grammar obtainable for ANTLR that handles C fairly well, but you'll have to run the preprocessor before using ANTLR.

You could also do this with our DMS Software Reengineering Toolkit with its C Front End. Our C Front End has a full preprocessor built in so it can read the code directly and expand as it parses; it also handles a relatively wide variety of dialects of C and character encodings (ever dealt with C containing Japanese text?). DMS provides an additional advantage: given a language (e.g., C) front end, you can write patterns for the that language directly using the language syntax. So we can express fragments of what we want to find easily:

 pattern block_for_loop(t:expression,l:expression,i:expression, s: statements): statement
     " for(\t,\l\,\i) { \s } ";

 pattern statement_for_loop(t:expression,l:expression,i:expression, s: statement): statement
     " for(\t,\l\,\i)  \s ";

 pattern block_while_loop(c:expression, s: statements): statement
     " while(\c) { \s } ";

 pattern statement_while_loop(c:expression): statement
     " for(\c)  \s ";

 ...

and group them together:

 pattern_set syntactic_loops
     { block_for_loop,
       statement_for_loop,
       block_while_loop,
       block_statement_loop,
       ...
     }

Given the pattern set, DMS can scan a syntax tree and find matches to any set member, without coding any particular tree crawling machinery and without knowing a huge amount of detail about the structure of the tree. (There's a lot of node types in an AST for a real C parser!) Finding nested loops this way would be pretty straightforward: scan the tree top down for a loop (using the pattern set); any hits must be top level loops. Scan subtrees of a found loop AST node (easy when you know where the tree for the outer loop is) for additional loops; any hits must be nested loops; recurse if necessary to pick up loops with arbitrary nesting. This works for the GCC loops-with-statements stuff, too. The tree nodes are stamped with precise file/line/column information so its easy to generate a report on location.

For ad hoc loops built using goto (what, your code doesn't have any?), you need something that can produce the actual control flow graph, and then structure that graph into nested control flow regions. The point here is that a while loop that contains an unconditional goto out isn't really a loop in spite of the syntax; and an if statement whose then clause gotos back to code upstream of the if is likely really a loop. All that loop syntax stuff is really just hueristic hints you may have loop!
Those control flow regions contain the real nesting of the control flow; DMS will construct C flow graphs, and will produce those structured regions. It provides libraries to build and access that graph; this way you can get the "true" control flow based on gotos. Having found a pair of nested control flow regions, one can access the AST associated with parts of region to get location information to report.

GCC is always a lot fun due to its signficantly enhanced version of C. For instance, it has an indirect goto statement. (Regular C has this hiding under setjmp/longjmp!).
To figure out loops in the face of this, you need points-to analysis, which DMS also provides. This information is used by the nested region analysis. There are (conservative) limits to the accuracy of points-to analysis, but modulo that you get the correct nested region graph.

Recursion is harder to find. For this, you have to determine if A calls B ... calls Z calls A, where A and B and ... can be in separate compilation units. You need a global call graph, containing all the compilation units of your application. At this point, you are probably expecting me to say that DMS does that too, voila, I'm pleased to say it does. Constructing that call graph of course requires points-to anlaysis for function calls; yes, DMS does that too. With the call graph, you can find cycles in the call graph, which are likely recursion. Also with the call graph, you can find indirect nesting, e.g., loops in one function, that call a function in another compilation unit that also contains loops.

To find structures such a loops accurately you need a lot of machinery (and this will take some effort, but then C is a bitch of a language to analyze) and DMS can provide it.

If you don't care about accuracy, and you don't care about all the kinds of loops, you can use grep and manual procedures to get a semi-accurate map of just the loops hinted at by the structured loop statements.

回复收藏 0 原文

过气美图社 2024-12-24 21:50:27

我怀疑单独使用 grep 不可能找到这样的东西：

public void do(){
    for(...){
        somethingElse();
    }
}

... Insert other code...

public void somethingElse(){
    for(.....){
        print(....);
    }
}

I suspect that finding something like this would be impossible using grep alone:

public void do(){
    for(...){
        somethingElse();
    }
}

... Insert other code...

public void somethingElse(){
    for(.....){
        print(....);
    }
}

回复收藏 0 原文

~没有更多了~