全局变量性能影响(c, c++)

发布于 2024-10-20 06:52:43 字数 635 浏览 3 评论 0 原文

我目前正在开发一种非常快的算法,其中一部分是非常快的扫描仪和统计功能。 在这个任务中,我追求任何性能优势。 因此,我也对保持代码“多线程”友好感兴趣。

现在问题是: 我注意到,将一些非常频繁访问的变量和数组放入“全局”或“静态本地”(效果相同)中,可以带来可衡量的性能优势(+10% 范围内)。 我试图理解原因,并找到解决方案,因为我宁愿避免使用这些类型的分配。 请注意,我不认为差异来自“分配”,因为在堆栈上分配一些变量和小数组几乎是瞬时的。我相信差异来自“访问”和“修改”数据。

在这次搜索中,我发现了 stackoverflow 上的这篇旧帖子: 全局变量的 C++ 性能

但我对那里的答案感到非常失望。很少有解释,主要是抱怨“你不应该这样做”(嘿,这不是问题!)以及非常粗略的陈述,例如“它不会影响性能”,这显然是不正确的,因为我正在用精确的方法来测量它基准测试工具。

如上所述,我正在寻找一个解释,如果存在的话,我正在寻找这个问题的解决方案。到目前为止,我感觉计算本地(动态)变量的内存地址比全局(或本地静态)变量的内存地址要多一些。也许类似于 ADD 操作差异。但这无助于找到解决方案......

I'm currently developing a very fast algorithm, with one part of it being an extremely fast scanner and statistics function.
In this quest, i'm after any performance benefit.
Therefore, I'm also interested in keeping the code "multi-thread" friendly.

Now for the question :
i've noticed that putting some very frequently accessed variables and arrays into "Global", or "static local" (which does the same), there is a measurable performance benefit (in the range of +10%).
I'm trying to understand why, and to find a solution about it, since i would prefer to avoid using these types of allocation.
Note that i don't think the difference comes from "allocation", since allocating a few variables and small array on the stack is almost instantaneous. I believe the difference comes from "accessing" and "modifying" data.

In this search, i've found this old post from stackoverflow :
C++ performance of global variables

But i'm very disappointed by the answers there. Very little explanation, mostly ranting about "you should not do that" (hey, that's not the question !) and very rough statements like 'it doesn't affect performance', which is obviously incorrect, since i'm measuring it with precise benchmark tools.

As said above, i'm looking for an explanation, and, if it exists, a solution to this issue. So far, i've got the feeling that calculating the memory address of a local (dynamic) variable costs a bit more than a global (or local static). Maybe something like an ADD operation difference. But that doesn't help finding a solution...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

在你怀里撒娇 2024-10-27 06:52:43

这实际上取决于您的编译器、平台和其他细节。不过,我可以描述一种全局变量速度更快的场景。

在许多情况下,全局变量具有固定的偏移量。这允许生成的指令直接使用该地址。 (类似于 MOV AX,[MyVar]。)

但是,如果您有一个相对于当前堆栈指针或类或数组成员的变量,则需要进行一些数学计算数组的地址并确定实际变量的地址。

显然,如果您需要在全局变量上放置某种互斥锁以保持其线程安全,那么您几乎肯定会失去任何性能增益。

It really depends on your compiler, platform, and other details. However, I can describe one scenario where global variables are faster.

In many cases, a global variable is at a fixed offset. This allows the generated instructions to simply use that address directly. (Something along the lines of MOV AX,[MyVar].)

However, if you have a variable that's relative to the current stack pointer or a member of a class or array, some math is required to take the address of the array and determine the address of the actual variable.

Obviously, if you need to place some sort of mutex on your global variable in order to keep it thread-safe, then you'll almost certainly more than lose any performance gain.

盗梦空间 2024-10-27 06:52:43

如果局部变量是 POD 类型,那么创建它们实际上是免费的。您可能会因太多堆栈变量或其他类似的基于对齐的原因而溢出缓存行,这些原因非常特定于您的代码段。我通常发现非局部变量会显着降低性能。

Creating local variables can be literally free if they are POD types. You likely are overflowing a cache line with too many stack variables or other similar alignment-based causes which are very specific to your piece of code. I usually find that non-local variables significantly decrease performance.

﹎☆浅夏丿初晴 2024-10-27 06:52:43

就速度而言,静态分配很难被超越,虽然 10% 的差异非常小,但这可能是由于地址计算造成的。

但如果你追求的是速度
您在注释中的示例 while(p 显然是展开的候选者,例如:

static int stats[M];
static int index_array[N];
int *p = index_array, *pend = p+N;
// ... initialize the arrays ...
while (p < pend-8){
  stats[p[0]]++;
  stats[p[1]]++;
  stats[p[2]]++;
  stats[p[3]]++;
  stats[p[4]]++;
  stats[p[5]]++;
  stats[p[6]]++;
  stats[p[7]]++;
  p += 8;
}
while(p<pend) stats[*p++]++;

不要指望编译器为您做这件事。它可能或可能无法弄清楚。

我想到了其他可能的优化,但它们取决于您实际想要做什么。

It's hard to beat static allocation for speed, and while the 10% is a pretty small difference, it could be due to address calculation.

But if you're looking for speed,
your example in a comment while(p<end)stats[*p++]++; is an obvious candidate for unrolling, such as:

static int stats[M];
static int index_array[N];
int *p = index_array, *pend = p+N;
// ... initialize the arrays ...
while (p < pend-8){
  stats[p[0]]++;
  stats[p[1]]++;
  stats[p[2]]++;
  stats[p[3]]++;
  stats[p[4]]++;
  stats[p[5]]++;
  stats[p[6]]++;
  stats[p[7]]++;
  p += 8;
}
while(p<pend) stats[*p++]++;

Don't count on the compiler to do it for you. It might or might not be able to figure it out.

Other possible optimizations come to mind, but they depend on what you're actually trying to do.

给我一枪 2024-10-27 06:52:43

如果您有类似的东西,那么

int stats[256]; while (p<end) stats[*p++]++;

static int stats[256]; while (p<end) stats[*p++]++;

您实际上并没有比较相同的东西,因为首先您没有对数组进行初始化。明确地写出第二行相当于

static int stats[256] = { 0 }; while (p<end) stats[*p++]++;

所以为了公平比较,您应该首先阅读

 int stats[256] = { 0 }; while (p<end) stats[*p++]++;

如果变量处于已知状态,您的编译器可能会推断出更多的东西。

那么,静态情况可能具有运行时优势,因为初始化是在编译时(或程序启动时)完成的。

要测试这是否弥补了差异,您应该使用静态声明和循环多次运行相同的函数,以查看如果调用次数增加,差异是否会消失。

但正如其他人已经说过的,最好是检查编译器生成的汇编器,看看生成的代码中有什么有效的差异。

If you have something like

int stats[256]; while (p<end) stats[*p++]++;

static int stats[256]; while (p<end) stats[*p++]++;

you are not really comparing the same thing because for the first instance you are not doing an initialization of your array. Written explicitly the second line is equivalent to

static int stats[256] = { 0 }; while (p<end) stats[*p++]++;

So to be a fair comparison you should have the first read

 int stats[256] = { 0 }; while (p<end) stats[*p++]++;

Your compiler might deduce much more things if he has the variables in a known state.

Now then, there could be runtime advantage of the static case, since the initialization is done at compile time (or program startup).

To test if this makes up for your difference you should run the same function with the static declaration and the loop several times, to see if the difference vanishes if your number of invocations grows.

But as other said already, best is to inspect the assembler that your compiler produces to see what effective difference there are in the code that is produced.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文