如何清除包含 C 函数声明的字符串中的注释和中间空格?

发布于 2024-12-14 02:37:31 字数 793 浏览 0 评论 0原文

在我用 C++ 编写的程序中,我需要获取一组字符串(每个字符串包含一个 C 函数的声明),并对它们执行许多操作。

其中一项操作是比较一个函数是否等于另一个函数。为此,我计划删除注释和中间空白,这对函数的语义没有影响,然后进行字符串比较。但是,我想保留字符串中的空格,因为删除空格会改变函数产生的输出。

我可以编写一些代码,迭代字符串字符并在遇到引号 (") 时进入“字符串模式”并识别转义的引号,但我想知道是否有更好的方法来执行此操作。一个想法是使用成熟的 C 解析器,在函数字符串上运行它,忽略所有注释和过多的空格,然后再次将 AST 转换回字符串,但是环顾一些 C 解析器,我得到的感觉是,大多数情况下。是个与我的来源整合的混蛋代码(如果我是错的,请证明我错了)。也许我可以尝试使用 yacc 或其他东西并使用现有的 C 语法并自己实现解析器......

所以,关于最佳方法的任何想法编辑

我正在编写的程序采用一个抽象模型并将其转换为 C 代码,该模型由一个图组成,其中节点可能包含也可能不包含 C 代码段(更准确地说,是一个 C 函数定义)。执行必须是完全确定性的(即没有全局状态)并且没有允许内存操作)。该程序在图上进行模式匹配,并合并和拆分遵循这些模式的某些节点。然而,只有当节点具有相同的功能(即,如果它们的 C 函数定义相同)时,才能执行这些操作。这种“检查它们是否相同”将通过简单地比较包含 C 函数声明的字符串来完成。如果它们逐个字符相同,则它们相等。

由于模型生成方式的性质,这是一种相当合理的比较方法,假设删除了注释和多余的空白,因为这是唯一可能不同的因素。这就是我面临的问题——如何以最少的实施工作来做到这一点?

In my program, written in C++, I need to take a set of strings, each containing the declaration of a C function, and perform a number of operations on them.

One of the operations is to compare whether one function is equal to another. To do that I plan to just prune away comments and intermediate whitespace which has no effect on the semantics of the function and then do a string comparison. However, I would like to retain whitespace within a string as removing that would change the output produced by the function.

I could write some code which iterates over the string characters and enters "string mode" whenever a quote (") is encountered and recognize escaped quotes, but I wonder if there is any better way of doing this. An idea is to use a full-fledged C parser, run it over the function string, ignore all comments and excessive whitespace, and then convert the AST back to a string again. But looking around at some C parser I get the feeling that most are a bitch to integrate with my source code (prove me wrong if I am). Perhaps I could try to use yacc or something and use an existing C grammar and implement the parser myself...

So, any ideas on the best way to do this?

EDIT:

The program I'm writing takes an abstract model and converts it into C code. The model consists of a graph, where the nodes may or may not contain segments of C code (more precisely, a C function definition where its execution must be completely deterministic (i.e. no global state) and no memory operations are allowed). The program does pattern matching on the graph and merges and splits certain nodes who adhere to these patterns. However, these operations can only be performed if the nodes exhibit the same functionality (i.e. if their C function definitions are the same). This "checking that they are the same" will be done by simply comparing the strings which contain the C function declarations. If they are character-by-character identical, then they are equal.

Due to the nature of how the models are generated, this is quite a reasonable method of comparison provided that the comments and excess whitespace is removed as this is the only factor that may differ. This is the problem I'm facing -- how to do this with minimal amount of implementation effort?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

后来的我们 2024-12-21 02:37:31

比较一个函数是否等于另一个函数是什么意思?具有适当精确的含义,该问题被称为不可判定

你没有告诉你的程序到底在做什么。正确解析所有真实的 C 程序并非易事(因为 C 语言的语法和语义并不那么简单!)。

您是否考虑过使用现有的工具或库来帮助您? LLVM Clang 是一种可能性,或者扩展 GCC 通过插件,或者更好地使用 MELT 中编码的扩展。

但如果不了解您的真正目标,我们就无法为您提供更多帮助。解析 C 代码可能比您想象的更复杂。

What do you mean by compare whether one function is equal to another ? With a suitably precise meaning, that problem is known to be undecidable!

You did not tell what your program is really doing. Parsing all real C programs correctly is not trivial (because the C language syntax and semantics is not that simple!).

Did you consider using existing tools or libraries to help you? LLVM Clang is a possibility, or extending GCC thru plugins, or even better with extensions coded in MELT.

But we cannot help you more without understanding your real goal. And parsing C code is probably more complex than what you imagine.

甜尕妞 2024-12-21 02:37:31

看起来您可以使用简单的岛语法来删除注释、字符串文字和折叠空格(制表符、“\n”)。由于我正在使用 AXE,因此我为您编写了一个快速语法。您可以使用 Boost.Spirit 编写一组类似的规则。

#include <axe.h>
#include <string>

template<class I>
std::string clean_text(I i1, I i2)
{
    // rules for non-recursive comments, and no line continuation
    auto endl = axe::r_lit('\n');
    auto c_comment = "/*" & axe::r_find(axe::r_lit("*/"));
    auto cpp_comment = "//" & axe::r_find(endl);
    auto comment = c_comment | cpp_comment;

    // rules for string literals
    auto esc_backslash = axe::r_lit("\\\\");
    auto esc_quote = axe::r_lit("\\\"");
    auto string_literal = '"' & *(*(axe::r_any() - esc_backslash - esc_quote) 
        & *(esc_backslash | esc_quote)) & '"';

    auto space = axe::r_any(" \t\n");
    auto dont_care = *(axe::r_any() - comment - string_literal - space);

    std::string result;
    // semantic actions
    // append everything matched
    auto append_all = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += std::string(i1, i2); });
    // append a single space
    auto append_space = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += ' '; });

    // island grammar for text
    auto text = *(dont_care >> append_all 
        & *comment
        & *string_literal >> append_all
        & *(space % comment) >> append_space)
        & axe::r_end();

    if(text(i1, i2).matched)
        return result;
    else
        throw "error";
}

现在您可以进行文本清理:

std::string text; // this is your function
text = clean_text(text.begin(), text.end());

您可能还需要为多余的“;”、空块 {} 等创建规则。您可能还需要合并字符串文字。您需要走多远取决于函数的生成方式,您最终可能会编写相当大一部分的 C 语法。

AX 库即将在 boost 许可证下发布。
我没有测试代码。

It looks like you can get away with simple island grammar removing comments, string literals, and collapsing white spaces (tabs, '\n'). Since I'm working with AXE, I wrote a quick grammar for you. You can write a similar set of rules using Boost.Spirit.

#include <axe.h>
#include <string>

template<class I>
std::string clean_text(I i1, I i2)
{
    // rules for non-recursive comments, and no line continuation
    auto endl = axe::r_lit('\n');
    auto c_comment = "/*" & axe::r_find(axe::r_lit("*/"));
    auto cpp_comment = "//" & axe::r_find(endl);
    auto comment = c_comment | cpp_comment;

    // rules for string literals
    auto esc_backslash = axe::r_lit("\\\\");
    auto esc_quote = axe::r_lit("\\\"");
    auto string_literal = '"' & *(*(axe::r_any() - esc_backslash - esc_quote) 
        & *(esc_backslash | esc_quote)) & '"';

    auto space = axe::r_any(" \t\n");
    auto dont_care = *(axe::r_any() - comment - string_literal - space);

    std::string result;
    // semantic actions
    // append everything matched
    auto append_all = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += std::string(i1, i2); });
    // append a single space
    auto append_space = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += ' '; });

    // island grammar for text
    auto text = *(dont_care >> append_all 
        & *comment
        & *string_literal >> append_all
        & *(space % comment) >> append_space)
        & axe::r_end();

    if(text(i1, i2).matched)
        return result;
    else
        throw "error";
}

So now you can do the text cleaning:

std::string text; // this is your function
text = clean_text(text.begin(), text.end());

You might also need to create rules for superfluous ';', empty blocks {}, and alike. You might also need to merge string literals. How far you need to go depends on the way the functions were generated, you may end up writing a sizable portion of C grammar.

AXE library is soon to be released under boost license.
I didn't test the code.

暖阳 2024-12-21 02:37:31

也许您想要解析的 C 函数并不像我们猜测的那样通用(以文本形式,并且由真正的编译器解析)。

您可能会考虑以相反的方式做事

定义一种小型领域特定语言可能是有意义的(它的语法可能比 C 更容易解析),而不是解析 C 代码,以另一种方式进行:用户将使用您的 DSL,并且您的工具将生成 C 代码(稍后由您常用的 C 编译器进行编译)来自您的DSL。

您的 DSL 实际上可以是对您的抽象模型的描述,并与更多被转换为 C 函数的过程部分混合在一起。由于您关心的 C 函数非常具体,因此生成它们的 DSL 可能很小。

(认为​​许多解析器生成器如 ANTLR 或 YACC 或 Bison 都是基于类似的想法构建的)。

实际上,我在 MELT 中做了一些非常类似的事情(特别是我的DSL2011 论文)。您可能会发现一些关于设计翻译为 C 的 DSL 的有用技巧。

Perhaps your C functions that you want to parse are not as general (in their textual form, and also as parsed by a real compiler) as we are guessing.

You might consider doing things the other way round:

It could make sense to define a small domain specific language (it could have a syntax much simpler to parse than C) and instead of parsing C code, doing it the other way: The user would use your DSL, and your tool would generate C code (to be compiled at a later stage by your usual C compiler) from your DSL.

Your DSL could actually be the description of your abstract model mixed with more procedural parts which are translated to C functions. Since the C functions you care about are quite specific, the DSL generating them could be small.

(Think that many parser generators like ANTLR or YACC or Bison are build on a similar idea).

I actually did something quite similar in MELT (read notably my DSL2011 paper). You might find some useful tricks about designing a DSL translated to C.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文