不使用条件编译加速 C 程序

发布于 2024-12-27 14:08:06 字数 1930 浏览 3 评论 0 原文

我们正在开发一种模型检查工具,它可以执行某些搜索例程数十亿次。我们有不同的搜索例程,当前使用预处理器指令选择这些例程。这不仅非常不方便,因为我们每次做出不同的选择时都需要重新编译,而且还使代码难以阅读。现在是时候开始新版本了,我们正在评估是否可以避免条件编译。

这是一个非常人为的示例,显示了效果:

/* program_define */

#include <stdio.h>
#include <stdlib.h>

#define skip 10

int main(int argc, char** argv) {
    int i, j;
    long result = 0;

    int limit = atoi(argv[1]);

    for (i = 0; i < 10000000; ++i) {
        for (j = 0; j < limit; ++j) {
            if (i + j % skip == 0) {
                continue;
            }
            result  += i + j;
        }
    }

    printf("%lu\n", result);
    return 0;
}

这里,变量 skip 是影响程序行为的值的示例。不幸的是,每次我们想要一个新的 skip 值时,我们都需要重新编译。

让我们看看该程序的另一个版本:

/* program_variable */

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    int i, j;
    long result = 0;

    int limit = atoi(argv[1]);
    int skip = atoi(argv[2]);

    for (i = 0; i < 10000000; ++i) {
        for (j = 0; j < limit; ++j) {
            if (i + j % skip == 0) {
                continue;
            }
            result  += i + j;
        }
    }

    printf("%lu\n", result);
    return 0;
}

这里,skip 的值作为命令行参数传递。这增加了很大的灵活性。然而,这个程序要慢得多:

$ time ./program_define 1000 10
50004989999950500

real 0m25.973s
user 0m25.937s
sys  0m0.019s

$ time ./program_variable 1000 10
50004989999950500

real 0m50.829s
user 0m50.738s
sys  0m0.042s

我们正在寻找的是一种将值传递到程序中的有效方法(通过命令行参数或文件输入),并且以后永远不会改变。有没有办法优化代码(或告诉编译器)以使其运行更有效?

非常感谢任何帮助!


评论:

正如 Dirk 在他的评论中所写,这与具体示例无关。我的意思是一种替换 if 的方法,它评估一个在被调用数十亿次的函数内设置一次然后从未更改的变量(例如命令行选项) 通过更有效的构造。我们目前使用预处理器来定制所需的函数版本。如果有更好的方法,不需要重新编译,那就太好了。

we are working on a model checking tool which executes certain search routines several billion times. We have different search routines which are currently selected using preprocessor directives. This is not only very unhandy as we need to recompile every time we make a different choice, but also makes the code hard to read. It's now time to start a new version and we are evaluating whether we can avoid conditional compilation.

Here is a very artificial example that shows the effect:

/* program_define */

#include <stdio.h>
#include <stdlib.h>

#define skip 10

int main(int argc, char** argv) {
    int i, j;
    long result = 0;

    int limit = atoi(argv[1]);

    for (i = 0; i < 10000000; ++i) {
        for (j = 0; j < limit; ++j) {
            if (i + j % skip == 0) {
                continue;
            }
            result  += i + j;
        }
    }

    printf("%lu\n", result);
    return 0;
}

Here, the variable skip is an example for a value that influences the behavior of the program. Unfortunately, we need to recompile every time we want a new value of skip.

Let's look at another version of the program:

/* program_variable */

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    int i, j;
    long result = 0;

    int limit = atoi(argv[1]);
    int skip = atoi(argv[2]);

    for (i = 0; i < 10000000; ++i) {
        for (j = 0; j < limit; ++j) {
            if (i + j % skip == 0) {
                continue;
            }
            result  += i + j;
        }
    }

    printf("%lu\n", result);
    return 0;
}

Here, the value for skip is passed as a command line parameter. This adds great flexibility. However, this program is much slower:

$ time ./program_define 1000 10
50004989999950500

real 0m25.973s
user 0m25.937s
sys  0m0.019s

vs.

$ time ./program_variable 1000 10
50004989999950500

real 0m50.829s
user 0m50.738s
sys  0m0.042s

What we are looking for is an efficient way to pass values into a program (by means of a command line parameter or a file input) that will never change afterward. Is there a way to optimize the code (or tell the compiler to) such that it runs more efficiently?

Any help is greatly appreciated!


Comments:

As Dirk wrote in his comment, it is not about the concrete example. What I meant was a way to replace an if that evaluates a variable that is set once and then never changed (say, a command line option) inside a function that is called literally billions of times by a more efficient construct. We currently use the preprocessor to tailor the desired version of the function. It would be nice if there is a nicer way that does not require recompilation.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

回眸一笑 2025-01-03 14:08:06

您可以看一下 libdivide,它可以在除数直到运行时才知道时进行快速除法:(libdivide 是一个开源库
用于优化整数除法
)。

如果您使用 a - b * (a / b)(但使用 libdivide)计算 a % b,您可能会发现它更快。

You can take a look at libdivide which works to do fast division when the divisor isn't known until runtime: (libdivide is an open source library
for optimizing integer division
).

If you calculate a % b using a - b * (a / b) (but with libdivide) you might find that it's faster.

瑕疵 2025-01-03 14:08:06

我在我的系统上运行了您的 program_variable 代码以获得性能基线:

$ gcc -Wall test1.c
$ time ./a.out 1000 10
50004989999950500

real    0m55.531s
user    0m55.484s
sys     0m0.033s

如果我使用 -O3 编译 test1.c,那么我会得到:

$ time ./a.out 1000 10
50004989999950500

real    0m54.305s
user    0m54.246s
sys     0m0.030s

在第三个测试中,我手动设置 limitskip 的值:

int limit = 1000, skip = 10;

然后重新运行测试:

$ gcc -Wall test2.c
$ time ./a.out
50004989999950500

real    0m54.312s
user    0m54.282s
sys     0m0.019s

取出 atoi()调用并没有多大区别。但是,如果我在打开 -O3 优化的情况下进行编译,那么我会遇到一个减速带:

$ gcc -Wall -O3 test2.c
$ time ./a.out
50004989999950500

real    0m26.756s
user    0m26.724s
sys     0m0.020s

#define 宏。 blogspot.com/2008/04/atoi-vs-custom-string-to-int-convertion.html" rel="nofollow">ersatz atoi() 函数 有一点帮助,但没有做很多:

#define QSaToi(iLen, zString, iOut) {int j = 1; iOut = 0; \
    for (int i = iLen - 1; i >= 0; --i) \
        { iOut += ((zString[i] - 48) * j); \
        j = j*10;}}

...
int limit, skip;
QSaToi(4, argv[1], limit);
QSaToi(2, argv[2], skip);

和测试:

$ gcc -Wall -O3 -std=gnu99 test3.c
$ time ./a.out 1000 10
50004989999950500

real    0m53.514s
user    0m53.473s
sys     0m0.025s

昂贵的部分似乎是那些 atoi() 调用,如果这是 -O3 编译之间的唯一区别。

也许您可以编写一个二进制文件,它循环测试 limitskip 的各种值,例如:

#define NUM_LIMITS 3
#define NUM_SKIPS 2
...
int limits[NUM_LIMITS] = {100, 1000, 1000};
int skips[NUM_SKIPS] = {1, 10};
int limit, skip;
...
for (int limitIdx = 0; limitIdx < NUM_LIMITS; limitIdx++)
    for (int skipIdx = 0; skipIdx < NUM_SKIPS; skipIdx++)
        /* per-limit, per-skip test */

如果您在编译之前知道参数,也许您可​​以这样做就这样。如果您希望将结果保存在单独的文件中,您可以使用 fprintf() 将输出写入按限制、按跳过的文件输出。

I ran your program_variable code on my system to get a baseline of performance:

$ gcc -Wall test1.c
$ time ./a.out 1000 10
50004989999950500

real    0m55.531s
user    0m55.484s
sys     0m0.033s

If I compile test1.c with -O3, then I get:

$ time ./a.out 1000 10
50004989999950500

real    0m54.305s
user    0m54.246s
sys     0m0.030s

In a third test, I manually set the values of limit and skip:

int limit = 1000, skip = 10;

I then re-run the test:

$ gcc -Wall test2.c
$ time ./a.out
50004989999950500

real    0m54.312s
user    0m54.282s
sys     0m0.019s

Taking out the atoi() calls doesn't make much of a difference. But if I compile with -O3 optimizations turned on, then I get a speed bump:

$ gcc -Wall -O3 test2.c
$ time ./a.out
50004989999950500

real    0m26.756s
user    0m26.724s
sys     0m0.020s

Adding a #define macro for an ersatz atoi() function helped a little, but didn't do much:

#define QSaToi(iLen, zString, iOut) {int j = 1; iOut = 0; \
    for (int i = iLen - 1; i >= 0; --i) \
        { iOut += ((zString[i] - 48) * j); \
        j = j*10;}}

...
int limit, skip;
QSaToi(4, argv[1], limit);
QSaToi(2, argv[2], skip);

And testing:

$ gcc -Wall -O3 -std=gnu99 test3.c
$ time ./a.out 1000 10
50004989999950500

real    0m53.514s
user    0m53.473s
sys     0m0.025s

The expensive part seems to be those atoi() calls, if that's the only difference between -O3 compilation.

Perhaps you could write one binary, which loops through tests of various values of limit and skip, something like:

#define NUM_LIMITS 3
#define NUM_SKIPS 2
...
int limits[NUM_LIMITS] = {100, 1000, 1000};
int skips[NUM_SKIPS] = {1, 10};
int limit, skip;
...
for (int limitIdx = 0; limitIdx < NUM_LIMITS; limitIdx++)
    for (int skipIdx = 0; skipIdx < NUM_SKIPS; skipIdx++)
        /* per-limit, per-skip test */

If you know your parameters ahead of compilation time, perhaps you can do it this way. You could use fprintf() to write your output to a per-limit, per-skip file output, if you want results in separate files.

你的心境我的脸 2025-01-03 14:08:06

You could try using the GCC likely/unlikely builtins (e.g. here) or profile guided optimization (e.g. here). Also, do you intend (i + j) % 10 or i + (j % 10)? The % operator has higher precedence, so your code as written is testing the latter.

浴红衣 2025-01-03 14:08:06

我对尼尔斯询问的项目有点熟悉。
周围有很多有趣的答案(谢谢),但答案稍微偏离了问题的精神。给出的示例程序实际上只是示例程序。受预处理器语句影响的逻辑要复杂得多。 最终,它不仅仅是执行模运算或简单的除法。它是关于保留或跳过某些过程调用、在其他两个操作之间执行一个操作等、定义数组的大小等。

所有这些事情都可以通过命令行参数设置的变量来保护。但这成本太高,因为许多例程、语句、内存分配都被执行了十亿次。也许这能更好地解决问题。对你的想法还是很感兴趣。

短剑

I'm a bit familiar with the program Niels is asking about.
There are a bunch of interesting answers around (thanks), but the answers slightly miss the spirit of the question. The given example programs are really just example programs. The logic that is subject to pre-processor statements is much much more involved. In the end, it is not just about executing a modulo operation or a simple division. it is about keeping or skipping certain procedure calls, executing an operation between two other operations etc, defining the size of an array, etc.

All these things could be guarded by variables that are set by command-line parameters. But that would be too costly as many of these routines, statements, memory allocations are executed a billion times. Perhaps that shapes the problem a bit better. Still very interested in your ideas.

Dirk

走野 2025-01-03 14:08:06

如果您使用C++而不是C,您可以使用模板,以便可以在编译时计算内容,甚至可以使用递归。
请看一下 C++ 模板元编程

If you would use C++ instead of C you could use templates so that things can be calculated at compile time, even recursions are possible.
Please have a look at C++ template meta programming.

临走之时 2025-01-03 14:08:06

一个愚蠢的答案,但您可以在 gcc 命令行上传递定义,并使用 shell 脚本运行整个过程,该脚本根据命令行参数重新编译并运行程序

#!/bin/sh
skip=$1
out=program_skip$skip
if [ ! -x $out ]; then 
    gcc -O3 -Dskip=$skip -o $out test.c
fi
time $out 1000

A stupid answer, but you could pass the define on the gcc command line and run the whole thing with a shell script that recompiles and runs the program based on a command-line parameter

#!/bin/sh
skip=$1
out=program_skip$skip
if [ ! -x $out ]; then 
    gcc -O3 -Dskip=$skip -o $out test.c
fi
time $out 1000
淑女气质 2025-01-03 14:08:06

我还发现program_define 和program_variable 之间的速度减慢了大约2 倍,分别为26.2 秒和49.0 秒。然后,我尝试

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    int i, j, r;
    long result = 0;

    int limit = atoi(argv[1]);
    int skip = atoi(argv[2]);

    for (i = 0; i < 10000000; ++i) {
        for (j = 0, r = 0; j < limit; ++j, ++r) {
            if (r == skip) r = 0;
            if (i + r == 0) {
                continue;
            }
            result  += i + j;
        }
    }

    printf("%lu\n", result);
    return 0;
}

使用额外的变量来避免代价高昂的除法,结果时间为 18.9 秒,明显优于静态已知常量的模数。然而,只有当变化易于预测时,这种辅助变量技术才有希望。

I got also an about 2× slowdown between program_define and program_variable, 26.2s vs. 49.0s. I then tried

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    int i, j, r;
    long result = 0;

    int limit = atoi(argv[1]);
    int skip = atoi(argv[2]);

    for (i = 0; i < 10000000; ++i) {
        for (j = 0, r = 0; j < limit; ++j, ++r) {
            if (r == skip) r = 0;
            if (i + r == 0) {
                continue;
            }
            result  += i + j;
        }
    }

    printf("%lu\n", result);
    return 0;
}

using an extra variable to avoid the costly division, and the resulting time was 18.9s, so significantly better than the modulo with a statically known constant. However, this auxiliary-variable technique is only promising if the change is easily predictable.

一萌ing 2025-01-03 14:08:06

另一种可能性是消除使用模运算符:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    int i, j;
    long result = 0;

    int limit = atoi(argv[1]);
    int skip = atoi(argv[2]);
    int current = 0;

    for (i = 0; i < 10000000; ++i) {
        for (j = 0; j < limit; ++j) {
            if (++current == skip) {
                current = 0;
                continue;
            }
            result  += i + j;
        }
    }

    printf("%lu\n", result);
    return 0;
}

Another possibility would be to eliminate using the modulus operator:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    int i, j;
    long result = 0;

    int limit = atoi(argv[1]);
    int skip = atoi(argv[2]);
    int current = 0;

    for (i = 0; i < 10000000; ++i) {
        for (j = 0; j < limit; ++j) {
            if (++current == skip) {
                current = 0;
                continue;
            }
            result  += i + j;
        }
    }

    printf("%lu\n", result);
    return 0;
}
清晨说晚安 2025-01-03 14:08:06

如果这是实际的代码,您可以通过几种方法对其进行优化:

(i + j % 10==0) 仅当 i==0 时为 true,因此当 i>0 时,您可以跳过整个 mod 操作。另外,由于 i + j 在每个循环中仅增加 1,因此您可以将 mod 提升出来,只需在遇到 skip 时递增并重置一个变量(如其他答案中已经指出)。

If that is the actual code, you have a few ways to optimize it:

(i + j % 10==0) is only true when i==0, so you can skip that entire mod operation when i>0. Also, since i + j only increases by 1 on each loop, you can hoist the mod out and simply have a variable you increment and reset when it hits skip (as has been pointed out in other answers).

天涯离梦残月幽梦 2025-01-03 14:08:06

您还可以在程序中拥有所有可能的函数实现,并在运行时更改函数指针以选择您实际正在使用的函数。

您可以使用来避免必须编写重复的代码:

#define MYFUNCMACRO(name, myvar) void name##doit(){/* time consuming code using myvar */}

MYFUNCMACRO(TEN,10)
MYFUNCMACRO(TWENTY,20)
MYFUNCMACRO(FOURTY,40)
MYFUNCMACRO(FIFTY,50)

如果您需要太多这样的宏(数百个?),您可以编写一个代码生成器,它会自动为一定范围的内容写入cpp文件价值观。

我没有编译也没有测试代码,但也许你看到了原理。

You can also have all possible function implementations already in the program, and at runtime you change the function pointer to select the function which you are actually are using.

You can use macros to avoid that you have to write duplicate code:

#define MYFUNCMACRO(name, myvar) void name##doit(){/* time consuming code using myvar */}

MYFUNCMACRO(TEN,10)
MYFUNCMACRO(TWENTY,20)
MYFUNCMACRO(FOURTY,40)
MYFUNCMACRO(FIFTY,50)

If you need to have too many of these macros (hundreds?) you can write a codegenerator which writes the cpp file automatically for a range of values.

I didn't compile nor test the code, but maybe you see the principle.

戈亓 2025-01-03 14:08:06

您可能在没有优化的情况下进行编译,这将导致您的程序在每次检查时加载跳过,而不是文字 10。尝试将 -O2 添加到编译器的命令行,和/或使用

register int skip;

You might be compiling without optimisation, which will lead your program to load skip each time it's checked, instead of the literal of 10. Try adding -O2 to your compiler's command line, and/or use

register int skip;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文