我们正在开发一种模型检查工具,它可以执行某些搜索例程数十亿次。我们有不同的搜索例程,当前使用预处理器指令选择这些例程。这不仅非常不方便,因为我们每次做出不同的选择时都需要重新编译,而且还使代码难以阅读。现在是时候开始新版本了,我们正在评估是否可以避免条件编译。
这是一个非常人为的示例,显示了效果:
/* program_define */
#include <stdio.h>
#include <stdlib.h>
#define skip 10
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (i + j % skip == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
这里,变量 skip
是影响程序行为的值的示例。不幸的是,每次我们想要一个新的 skip
值时,我们都需要重新编译。
让我们看看该程序的另一个版本:
/* program_variable */
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (i + j % skip == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
这里,skip
的值作为命令行参数传递。这增加了很大的灵活性。然而,这个程序要慢得多:
$ time ./program_define 1000 10
50004989999950500
real 0m25.973s
user 0m25.937s
sys 0m0.019s
与
$ time ./program_variable 1000 10
50004989999950500
real 0m50.829s
user 0m50.738s
sys 0m0.042s
我们正在寻找的是一种将值传递到程序中的有效方法(通过命令行参数或文件输入),并且以后永远不会改变。有没有办法优化代码(或告诉编译器)以使其运行更有效?
非常感谢任何帮助!
评论:
正如 Dirk 在他的评论中所写,这与具体示例无关。我的意思是一种替换 if
的方法,它评估一个在被调用数十亿次的函数内设置一次然后从未更改的变量(例如命令行选项) 通过更有效的构造。我们目前使用预处理器来定制所需的函数版本。如果有更好的方法,不需要重新编译,那就太好了。
we are working on a model checking tool which executes certain search routines several billion times. We have different search routines which are currently selected using preprocessor directives. This is not only very unhandy as we need to recompile every time we make a different choice, but also makes the code hard to read. It's now time to start a new version and we are evaluating whether we can avoid conditional compilation.
Here is a very artificial example that shows the effect:
/* program_define */
#include <stdio.h>
#include <stdlib.h>
#define skip 10
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (i + j % skip == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
Here, the variable skip
is an example for a value that influences the behavior of the program. Unfortunately, we need to recompile every time we want a new value of skip
.
Let's look at another version of the program:
/* program_variable */
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
int i, j;
long result = 0;
int limit = atoi(argv[1]);
int skip = atoi(argv[2]);
for (i = 0; i < 10000000; ++i) {
for (j = 0; j < limit; ++j) {
if (i + j % skip == 0) {
continue;
}
result += i + j;
}
}
printf("%lu\n", result);
return 0;
}
Here, the value for skip
is passed as a command line parameter. This adds great flexibility. However, this program is much slower:
$ time ./program_define 1000 10
50004989999950500
real 0m25.973s
user 0m25.937s
sys 0m0.019s
vs.
$ time ./program_variable 1000 10
50004989999950500
real 0m50.829s
user 0m50.738s
sys 0m0.042s
What we are looking for is an efficient way to pass values into a program (by means of a command line parameter or a file input) that will never change afterward. Is there a way to optimize the code (or tell the compiler to) such that it runs more efficiently?
Any help is greatly appreciated!
Comments:
As Dirk wrote in his comment, it is not about the concrete example. What I meant was a way to replace an if
that evaluates a variable that is set once and then never changed (say, a command line option) inside a function that is called literally billions of times by a more efficient construct. We currently use the preprocessor to tailor the desired version of the function. It would be nice if there is a nicer way that does not require recompilation.
发布评论
评论(11)
您可以看一下
libdivide
,它可以在除数直到运行时才知道时进行快速除法:(libdivide 是一个开源库用于优化整数除法)。
如果您使用
a - b * (a / b)
(但使用 libdivide)计算a % b
,您可能会发现它更快。You can take a look at
libdivide
which works to do fast division when the divisor isn't known until runtime: (libdivide is an open source libraryfor optimizing integer division).
If you calculate
a % b
usinga - b * (a / b)
(but with libdivide) you might find that it's faster.我在我的系统上运行了您的
program_variable
代码以获得性能基线:如果我使用
-O3
编译test1.c
,那么我会得到:在第三个测试中,我手动设置
limit
和skip
的值:然后重新运行测试:
取出
atoi()
调用并没有多大区别。但是,如果我在打开-O3
优化的情况下进行编译,那么我会遇到一个减速带:为 #define 宏。 blogspot.com/2008/04/atoi-vs-custom-string-to-int-convertion.html" rel="nofollow">ersatz
atoi()
函数 有一点帮助,但没有做很多:和测试:
昂贵的部分似乎是那些
atoi()
调用,如果这是-O3
编译之间的唯一区别。也许您可以编写一个二进制文件,它循环测试
limit
和skip
的各种值,例如:如果您在编译之前知道参数,也许您可以这样做就这样。如果您希望将结果保存在单独的文件中,您可以使用
fprintf()
将输出写入按限制、按跳过的文件输出。I ran your
program_variable
code on my system to get a baseline of performance:If I compile
test1.c
with-O3
, then I get:In a third test, I manually set the values of
limit
andskip
:I then re-run the test:
Taking out the
atoi()
calls doesn't make much of a difference. But if I compile with-O3
optimizations turned on, then I get a speed bump:Adding a
#define
macro for an ersatzatoi()
function helped a little, but didn't do much:And testing:
The expensive part seems to be those
atoi()
calls, if that's the only difference between-O3
compilation.Perhaps you could write one binary, which loops through tests of various values of
limit
andskip
, something like:If you know your parameters ahead of compilation time, perhaps you can do it this way. You could use
fprintf()
to write your output to a per-limit, per-skip file output, if you want results in separate files.您可以尝试使用 GCC
likely
/unlikely
内置函数(例如 此处)或配置文件引导优化(例如 此处< /a>)。另外,您想要(i + j) % 10
还是i + (j % 10)
?%
运算符具有更高的优先级,因此您编写的代码正在测试后者。You could try using the GCC
likely
/unlikely
builtins (e.g. here) or profile guided optimization (e.g. here). Also, do you intend(i + j) % 10
ori + (j % 10)
? The%
operator has higher precedence, so your code as written is testing the latter.我对尼尔斯询问的项目有点熟悉。
周围有很多有趣的答案(谢谢),但答案稍微偏离了问题的精神。给出的示例程序实际上只是示例程序。受预处理器语句影响的逻辑要复杂得多。 最终,它不仅仅是执行模运算或简单的除法。它是关于保留或跳过某些过程调用、在其他两个操作之间执行一个操作等、定义数组的大小等。
所有这些事情都可以通过命令行参数设置的变量来保护。但这成本太高,因为许多例程、语句、内存分配都被执行了十亿次。也许这能更好地解决问题。对你的想法还是很感兴趣。
短剑
I'm a bit familiar with the program Niels is asking about.
There are a bunch of interesting answers around (thanks), but the answers slightly miss the spirit of the question. The given example programs are really just example programs. The logic that is subject to pre-processor statements is much much more involved. In the end, it is not just about executing a modulo operation or a simple division. it is about keeping or skipping certain procedure calls, executing an operation between two other operations etc, defining the size of an array, etc.
All these things could be guarded by variables that are set by command-line parameters. But that would be too costly as many of these routines, statements, memory allocations are executed a billion times. Perhaps that shapes the problem a bit better. Still very interested in your ideas.
Dirk
如果您使用C++而不是C,您可以使用模板,以便可以在编译时计算内容,甚至可以使用递归。
请看一下 C++ 模板元编程。
If you would use C++ instead of C you could use templates so that things can be calculated at compile time, even recursions are possible.
Please have a look at C++ template meta programming.
一个愚蠢的答案,但您可以在 gcc 命令行上传递定义,并使用 shell 脚本运行整个过程,该脚本根据命令行参数重新编译并运行程序
A stupid answer, but you could pass the define on the gcc command line and run the whole thing with a shell script that recompiles and runs the program based on a command-line parameter
我还发现program_define 和program_variable 之间的速度减慢了大约2 倍,分别为26.2 秒和49.0 秒。然后,我尝试
使用额外的变量来避免代价高昂的除法,结果时间为 18.9 秒,明显优于静态已知常量的模数。然而,只有当变化易于预测时,这种辅助变量技术才有希望。
I got also an about 2× slowdown between program_define and program_variable, 26.2s vs. 49.0s. I then tried
using an extra variable to avoid the costly division, and the resulting time was 18.9s, so significantly better than the modulo with a statically known constant. However, this auxiliary-variable technique is only promising if the change is easily predictable.
另一种可能性是消除使用模运算符:
Another possibility would be to eliminate using the modulus operator:
如果这是实际的代码,您可以通过几种方法对其进行优化:
(i + j % 10==0)
仅当i==0
时为 true,因此当i>0
时,您可以跳过整个 mod 操作。另外,由于i + j
在每个循环中仅增加 1,因此您可以将 mod 提升出来,只需在遇到skip
时递增并重置一个变量(如其他答案中已经指出)。If that is the actual code, you have a few ways to optimize it:
(i + j % 10==0)
is only true wheni==0
, so you can skip that entire mod operation wheni>0
. Also, sincei + j
only increases by 1 on each loop, you can hoist the mod out and simply have a variable you increment and reset when it hitsskip
(as has been pointed out in other answers).您还可以在程序中拥有所有可能的函数实现,并在运行时更改函数指针以选择您实际正在使用的函数。
您可以使用宏来避免必须编写重复的代码:
如果您需要太多这样的宏(数百个?),您可以编写一个代码生成器,它会自动为一定范围的内容写入cpp文件价值观。
我没有编译也没有测试代码,但也许你看到了原理。
You can also have all possible function implementations already in the program, and at runtime you change the function pointer to select the function which you are actually are using.
You can use macros to avoid that you have to write duplicate code:
If you need to have too many of these macros (hundreds?) you can write a codegenerator which writes the cpp file automatically for a range of values.
I didn't compile nor test the code, but maybe you see the principle.
您可能在没有优化的情况下进行编译,这将导致您的程序在每次检查时加载跳过,而不是文字 10。尝试将 -O2 添加到编译器的命令行,和/或使用
You might be compiling without optimisation, which will lead your program to load skip each time it's checked, instead of the literal of 10. Try adding -O2 to your compiler's command line, and/or use