如何清除包含 C 函数声明的字符串中的注释和中间空格？

发布于 2024-12-14 02:37:31 字数 793 浏览 0 评论 0原文

在我用 C++ 编写的程序中，我需要获取一组字符串（每个字符串包含一个 C 函数的声明），并对它们执行许多操作。

其中一项操作是比较一个函数是否等于另一个函数。为此，我计划删除注释和中间空白，这对函数的语义没有影响，然后进行字符串比较。但是，我想保留字符串中的空格，因为删除空格会改变函数产生的输出。

我可以编写一些代码，迭代字符串字符并在遇到引号 (") 时进入“字符串模式”并识别转义的引号，但我想知道是否有更好的方法来执行此操作。一个想法是使用成熟的 C 解析器，在函数字符串上运行它，忽略所有注释和过多的空格，然后再次将 AST 转换回字符串，但是环顾一些 C 解析器，我得到的感觉是，大多数情况下。是个与我的来源整合的混蛋代码（如果我是错的，请证明我错了）。也许我可以尝试使用 yacc 或其他东西并使用现有的 C 语法并自己实现解析器......

所以，关于最佳方法的任何想法编辑

：

我正在编写的程序采用一个抽象模型并将其转换为 C 代码，该模型由一个图组成，其中节点可能包含也可能不包含 C 代码段（更准确地说，是一个 C 函数定义）。执行必须是完全确定性的（即没有全局状态）并且没有允许内存操作）。该程序在图上进行模式匹配，并合并和拆分遵循这些模式的某些节点。然而，只有当节点具有相同的功能（即，如果它们的 C 函数定义相同）时，才能执行这些操作。这种“检查它们是否相同”将通过简单地比较包含 C 函数声明的字符串来完成。如果它们逐个字符相同，则它们相等。

由于模型生成方式的性质，这是一种相当合理的比较方法，假设删除了注释和多余的空白，因为这是唯一可能不同的因素。这就是我面临的问题——如何以最少的实施工作来做到这一点？

原文

In my program, written in C++, I need to take a set of strings, each containing the declaration of a C function, and perform a number of operations on them.

One of the operations is to compare whether one function is equal to another. To do that I plan to just prune away comments and intermediate whitespace which has no effect on the semantics of the function and then do a string comparison. However, I would like to retain whitespace within a string as removing that would change the output produced by the function.

I could write some code which iterates over the string characters and enters "string mode" whenever a quote (") is encountered and recognize escaped quotes, but I wonder if there is any better way of doing this. An idea is to use a full-fledged C parser, run it over the function string, ignore all comments and excessive whitespace, and then convert the AST back to a string again. But looking around at some C parser I get the feeling that most are a bitch to integrate with my source code (prove me wrong if I am). Perhaps I could try to use yacc or something and use an existing C grammar and implement the parser myself...

So, any ideas on the best way to do this?

EDIT:

The program I'm writing takes an abstract model and converts it into C code. The model consists of a graph, where the nodes may or may not contain segments of C code (more precisely, a C function definition where its execution must be completely deterministic (i.e. no global state) and no memory operations are allowed). The program does pattern matching on the graph and merges and splits certain nodes who adhere to these patterns. However, these operations can only be performed if the nodes exhibit the same functionality (i.e. if their C function definitions are the same). This "checking that they are the same" will be done by simply comparing the strings which contain the C function declarations. If they are character-by-character identical, then they are equal.

Due to the nature of how the models are generated, this is quite a reasonable method of comparison provided that the comments and excess whitespace is removed as this is the only factor that may differ. This is the problem I'm facing -- how to do this with minimal amount of implementation effort?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

后来的我们 2024-12-21 02:37:31

比较一个函数是否等于另一个函数是什么意思？具有适当精确的含义，该问题被称为不可判定！

你没有告诉你的程序到底在做什么。正确解析所有真实的 C 程序并非易事（因为 C 语言的语法和语义并不那么简单！）。

您是否考虑过使用现有的工具或库来帮助您？ LLVM Clang 是一种可能性，或者扩展 GCC 通过插件，或者更好地使用 MELT 中编码的扩展。

但如果不了解您的真正目标，我们就无法为您提供更多帮助。解析 C 代码可能比您想象的更复杂。

回复收藏 0 原文

甜尕妞 2024-12-21 02:37:31

看起来您可以使用简单的岛语法来删除注释、字符串文字和折叠空格（制表符、“\n”）。由于我正在使用 AXE^†，因此我为您编写了一个快速语法^‡。您可以使用 Boost.Spirit 编写一组类似的规则。

#include <axe.h>
#include <string>

template<class I>
std::string clean_text(I i1, I i2)
{
    // rules for non-recursive comments, and no line continuation
    auto endl = axe::r_lit('\n');
    auto c_comment = "/*" & axe::r_find(axe::r_lit("*/"));
    auto cpp_comment = "//" & axe::r_find(endl);
    auto comment = c_comment | cpp_comment;

    // rules for string literals
    auto esc_backslash = axe::r_lit("\\\\");
    auto esc_quote = axe::r_lit("\\\"");
    auto string_literal = '"' & *(*(axe::r_any() - esc_backslash - esc_quote) 
        & *(esc_backslash | esc_quote)) & '"';

    auto space = axe::r_any(" \t\n");
    auto dont_care = *(axe::r_any() - comment - string_literal - space);

    std::string result;
    // semantic actions
    // append everything matched
    auto append_all = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += std::string(i1, i2); });
    // append a single space
    auto append_space = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += ' '; });

    // island grammar for text
    auto text = *(dont_care >> append_all 
        & *comment
        & *string_literal >> append_all
        & *(space % comment) >> append_space)
        & axe::r_end();

    if(text(i1, i2).matched)
        return result;
    else
        throw "error";
}

现在您可以进行文本清理：

std::string text; // this is your function
text = clean_text(text.begin(), text.end());

您可能还需要为多余的“;”、空块 {} 等创建规则。您可能还需要合并字符串文字。您需要走多远取决于函数的生成方式，您最终可能会编写相当大一部分的 C 语法。

^† AX 库即将在 boost 许可证下发布。
^‡ 我没有测试代码。

It looks like you can get away with simple island grammar removing comments, string literals, and collapsing white spaces (tabs, '\n'). Since I'm working with AXE^†, I wrote a quick grammar^‡ for you. You can write a similar set of rules using Boost.Spirit.

#include <axe.h>
#include <string>

template<class I>
std::string clean_text(I i1, I i2)
{
    // rules for non-recursive comments, and no line continuation
    auto endl = axe::r_lit('\n');
    auto c_comment = "/*" & axe::r_find(axe::r_lit("*/"));
    auto cpp_comment = "//" & axe::r_find(endl);
    auto comment = c_comment | cpp_comment;

    // rules for string literals
    auto esc_backslash = axe::r_lit("\\\\");
    auto esc_quote = axe::r_lit("\\\"");
    auto string_literal = '"' & *(*(axe::r_any() - esc_backslash - esc_quote) 
        & *(esc_backslash | esc_quote)) & '"';

    auto space = axe::r_any(" \t\n");
    auto dont_care = *(axe::r_any() - comment - string_literal - space);

    std::string result;
    // semantic actions
    // append everything matched
    auto append_all = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += std::string(i1, i2); });
    // append a single space
    auto append_space = axe::e_ref([&](I i1, I i2) { if(i1 != i2) result += ' '; });

    // island grammar for text
    auto text = *(dont_care >> append_all 
        & *comment
        & *string_literal >> append_all
        & *(space % comment) >> append_space)
        & axe::r_end();

    if(text(i1, i2).matched)
        return result;
    else
        throw "error";
}

So now you can do the text cleaning:

std::string text; // this is your function
text = clean_text(text.begin(), text.end());

You might also need to create rules for superfluous ';', empty blocks {}, and alike. You might also need to merge string literals. How far you need to go depends on the way the functions were generated, you may end up writing a sizable portion of C grammar.

^† AXE library is soon to be released under boost license.
^‡ I didn't test the code.

回复收藏 0 原文