静态局部变量可以减少内存分配时间吗?

发布于 2024-09-19 06:12:16 字数 271 浏览 3 评论 0原文

假设我在单线程程序中有一个函数,如下所示

void f(some arguments){
    char buffer[32];
    some operations on buffer;
}

,并且 f 出现在某个经常调用的循环中,所以我想使其尽可能快。在我看来,每次调用 f 时都需要分配缓冲区,但如果我将其声明为静态,则不会发生这种情况。这是正确的推理吗?这是免费加速吗?正是因为这个事实(这是一个简单的加速),优化编译器是否已经为我做了类似的事情?

Suppose I have a function in a single threaded program that looks like this

void f(some arguments){
    char buffer[32];
    some operations on buffer;
}

and f appears inside some loop that gets called often, so I'd like to make it as fast as possible. It looks to me like the buffer needs to get allocated every time f is called, but if I declare it to be static, this wouldn't happen. Is that correct reasoning? Is that a free speed up? And just because of that fact (that it's an easy speed up), does an optimizing compiler already do something like this for me?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

撩起发的微风 2024-09-26 06:12:17

不,这不是免费的加速。

首先,分配一开始几乎是免费的(因为它仅包含向堆栈指针添加 32),其次,至少有两个原因导致静态变量可能较慢,并且

  • 会丢失缓存地点。堆栈上分配的数据已经在 CPU 缓存中,因此访问它的成本非常低。静态数据分配在不同的内存区域中,因此可能不会被缓存,因此会导致缓存未命中,并且您必须等待数百个时钟周期才能从主内存中获取数据。
  • 你失去了线程安全。如果两个线程同时执行该函数,它就会崩溃并烧毁,除非放置一个锁,以便一次只允许一个线程执行该代码部分。这意味着您将失去拥有多个 CPU 核心的优势。

所以这不是免费的加速。但在的情况下它可能会更快(尽管我对此表示怀疑)。
因此,尝试一下,对其进行基准测试,看看哪种方法最适合您的特定场景。

No, it's not a free speedup.

First, the allocation is almost free to begin with (since it consists merely of adding 32 to the stack pointer), and secondly, there are at least two reasons why a static variable might be slower

  • you lose cache locality. Data allocated on the stack are going to be in the CPU cache already, so accessing it is extremely cheap. Static data is allocated in a different area of memory, and so it may not be cached, and so it will cause a cache miss, and you'll have to wait hundreds of clock cycles for the data to be fetched from main memory.
  • you lose thread safety. If two threads execute the function simultaneously, it'll crash and burn, unless a lock is placed so only one thread at a time is allowed to execute that section of the code. And that would mean you'd lose the benefit of having multiple CPU cores.

So it's not a free speedup. But it is possible that it is faster in your case (although I doubt it).
So try it out, benchmark it, and see what works best in your particular scenario.

絕版丫頭 2024-09-26 06:12:17

在几乎所有系统上,在堆栈上增加 32 个字节几乎不需要任何成本。但你应该测试一下。对静态版本和本地版本进行基准测试并回发。

Incrementing 32 bytes on the stack will cost virtually nothing on nearly all systems. But you should test it out. Benchmark a static version and a local version and post back.

焚却相思 2024-09-26 06:12:17

对于使用堆栈作为局部变量的实现,分配通常涉及推进寄存器(向其中添加值),例如堆栈指针(SP)寄存器。这个时间可以忽略不计,通常是一条指令或更少。

然而,堆栈变量的初始化需要更长的时间,但也不会太长。查看您的汇编语言列表(由编译器或调试器生成)以获取确切的详细信息。标准中没有任何关于初始化变量所需的指令的持续时间或数量的内容。

静态局部变量的分配通常被区别对待。一种常见的方法是将这些变量放置在与全局变量相同的区域中。通常该区域中的所有变量都会在调用 main() 之前初始化。这种情况下的分配是将地址分配给寄存器或将区域信息存储在存储器中。这里没有浪费太多执行时间。

动态分配是执行周期被烧毁的情况。但这不在你的问题范围之内。

For implementations that use a stack for local variables, often times allocation involves advancing a register (adding a value to it), such as the Stack Pointer (SP) register. This timing is very negligible, usually one instruction or less.

However, initialization of stack variables takes a little longer, but again, not much. Check out your assembly language listing (generated by compiler or debugger) for exact details. There is nothing in the standard about the duration or number of instructions required to initialize variables.

Allocation of static local variables is usually treated differently. A common approach is to place these variables in the same area as global variables. Usually all the variables in this area are initialized before calling main(). Allocation in this case is a matter of assigning addresses to registers or storing the area information in memory. Not much execution time wasted here.

Dynamic allocation is the case where execution cycles are burned. But that is not in the scope of your question.

分开我的手 2024-09-26 06:12:17

按照现在的写法,没有分配成本:32 个字节在堆栈上。唯一真正的工作是您需要进行零初始化。

局部静态在这里不是一个好主意。它不会更快,并且您的函数不能再从多个线程使用,因为所有调用共享相同的缓冲区。更不用说局部静态初始化不能保证线程安全。

The way it is written now, there is no cost for allocation: the 32 bytes are on the stack. The only real work is you need to zero-initialize.

Local statics is not a good idea here. It wont be faster, and your function can't be used from multiple threads anymore, as all calls share the same buffer. Not to mention that local statics initialization is not guaranteed to be thread safe.

只是在用心讲痛 2024-09-26 06:12:17

我建议解决这个问题的更通用的方法是,如果您有一个多次调用的函数需要一些局部变量,那么请考虑将其包装在一个类中并使这些变量成为成员函数。考虑一下您是否需要使大小动态化,因此您可以使用 std::vector而不是 char buffer[32]缓冲区(必需的大小)。这比每次循环时初始化数组的成本更高

class BufferMunger {
public:
   BufferMunger() {};
   void DoFunction(args);
private:
   char buffer[32];
};

BufferMunger m;
for (int i=0; i<1000; i++) {
   m.DoFunction(arg[i]);  // only one allocation of buffer
}

还有使缓冲区静态的另一个含义,即该函数现在在多线程应用程序中是不安全的,因为两个线程可能会调用它并覆盖缓冲区中的数据同时。另一方面,在每个需要它的线程中使用单独的 BufferMunger 是安全的。

I would suggest that a more general approach to this problem is that if you have a function called many times that needs some local variables then consider wrapping it in a class and making these variables member functions. Consider if you needed to make the size dynamic, so instead of char buffer[32] you had std::vector<char> buffer(requiredSize). This is more expensive than an array to initialise every time through the loop

class BufferMunger {
public:
   BufferMunger() {};
   void DoFunction(args);
private:
   char buffer[32];
};

BufferMunger m;
for (int i=0; i<1000; i++) {
   m.DoFunction(arg[i]);  // only one allocation of buffer
}

There's another implication of making the buffer static, which is that the function is now unsafe in a multithreaded application, as two threads may call it and overwrite the data in the buffer at the same time. On the other hand it's safe to use a separate BufferMunger in each thread that requires it.

青柠芒果 2024-09-26 06:12:17

请注意,C++(与 C 不同)中的块级静态变量在首次使用时进行初始化。这意味着您将引入额外的运行时检查。该分支最终可能会使性能变得更糟,而不是更好。 (但实际上,您应该进行概要分析,正如其他人提到的那样。)

无论如何,我认为这不值得,特别是因为您会故意牺牲可重入性。

Note that block-level static variables in C++ (as opposed to C) are initialized on first use. This implies that you'll be introducing the cost of an extra runtime check. The branch potentially could end up making performance worse, not better. (But really, you should profile, as others have mentioned.)

Regardless, I don't think it's worth it, especially since you'd be intentionally sacrificing re-entrancy.

英雄似剑 2024-09-26 06:12:17

如果您正在为 PC 编写代码,那么无论哪种方式都不太可能有任何有意义的速度优势。在某些嵌入式系统上,避免所有局部变量可能是有利的。在其他一些系统上,局部变量可能会更快。

前者的一个例子:在 Z80 上,为具有任何局部变量的函数设置堆栈帧的代码相当长。此外,访问局部变量的代码仅限于使用 (IX+d) 寻址模式,该模式仅适用于 8 位指令。如果 X 和 Y 都是全局/静态或都是局部变量,则语句“X=Y”可以汇编为:

; If both are static or global: 6 bytes; 32 cycles
  ld HL,(_Y) ; 16 cycles
  ld (_X),HL ; 16 cycles
; If both are local: 12 bytes; 56 cycles
  ld E,(IX+_Y)   ; 14 cycles
  ld D,(IX+_Y+1) ; 14 cycles
  ld (IX+_X),D   ; 14 cycles
  ld (IX+_X+1),E ; 14 cycles

除了设置堆栈帧的代码和时间之外,还有 100% 的代码空间损失和 75% 的时间损失!

在 ARM 处理器上,单个指令可以加载位于地址指针 +/-2K 范围内的变量。如果函数的局部变量总数为 2K 或更少,则可以使用单条指令来访问它们。全局变量通常需要两条或更多条指令来加载,具体取决于它们的存储位置。

If you are writing code for a PC, there is unlikely to be any meaningful speed advantage either way. On some embedded systems, it may be advantageous to avoid all local variables. On some other systems, local variables may be faster.

An example of the former: on the Z80, the code to set up the stack frame for a function with any local variables was pretty long. Further, the code to access local variables was limited to using the (IX+d) addressing mode, which was only available for 8-bit instructions. If X and Y were both global/static or both local variables, the statement "X=Y" could assemble as either:

; If both are static or global: 6 bytes; 32 cycles
  ld HL,(_Y) ; 16 cycles
  ld (_X),HL ; 16 cycles
; If both are local: 12 bytes; 56 cycles
  ld E,(IX+_Y)   ; 14 cycles
  ld D,(IX+_Y+1) ; 14 cycles
  ld (IX+_X),D   ; 14 cycles
  ld (IX+_X+1),E ; 14 cycles

A 100% code space penalty and 75% time penalty in addition to the code and time to set up the stack frame!

On the ARM processor, a single instruction can load a variable which is located within +/-2K of an address pointer. If a function's local variables total 2K or less, they may be accessed with a single instruction. Global variables will generally require two or more instructions to load, depending upon where they are stored.

流年里的时光 2024-09-26 06:12:17

使用 gcc,我确实看到了一些加速:

void f() {
    char buffer[4096];
}

int main() {
    int i;
    for (i = 0; i < 100000000; ++i) {
        f();
    }
}

以及时间:

$ time ./a.out

real    0m0.453s
user    0m0.450s
sys  0m0.010s

将缓冲区更改为静态:

$ time ./a.out

real    0m0.352s
user    0m0.360s
sys  0m0.000s

With gcc, I do see some speedup:

void f() {
    char buffer[4096];
}

int main() {
    int i;
    for (i = 0; i < 100000000; ++i) {
        f();
    }
}

And the time:

$ time ./a.out

real    0m0.453s
user    0m0.450s
sys  0m0.010s

changing buffer to static:

$ time ./a.out

real    0m0.352s
user    0m0.360s
sys  0m0.000s
半山落雨半山空 2024-09-26 06:12:17

根据变量的具体用途及其使用方式,速度的提升几乎为零。因为(在 x86 系统上)堆栈内存是通过一个简单的单一 func(sub esp,amount) 同时为所有本地变量分配的,因此只有一个其他堆栈变量就消除了任何增益。唯一的例外是缓冲区非常大,在这种情况下,编译器可能会坚持使用 _chkstk 来分配内存(但如果你的缓冲区太大,你应该重新评估你的代码)。编译器无法通过优化将堆栈内存转换为静态内存,因为它不能假设该函数将在单线程环境中使用,而且它会扰乱对象构造函数和静态内存。析构函数等

Depending on what exactly the variable is doing and how its used, the speed up is almost nothing to nothing. Because (on x86 systems) stack memory is allocated for all local vars at the same time with a simple single func(sub esp,amount), thus having just one other stack var eliminates any gain. the only exception to this is really huge buffers in which case a compiler might stick in _chkstk to alloc memory(but if your buffer is that big you should re-evaluate your code). The compiler cannot turn stack memory into static memory via optimization, as it cannot assume that the function is going to be used in a single threaded enviroment, plus it would mess with object constructors & destructors etc

得不到的就毁灭 2024-09-26 06:12:17

如果函数中有任何局部自动变量,则需要调整堆栈指针。调整所需的时间是恒定的,并且不会根据声明的变量数量而变化。如果您的函数没有任何本地自动变量,您可能会节省一些时间。

如果静态变量被初始化,则会在某处有一个标志来确定该变量是否已经被初始化。检查标志需要一些时间。在您的示例中,变量未初始化,因此可以忽略这部分。

如果您的函数有可能被递归调用或从两个不同的线程调用,则应避免使用静态变量。

If there are any local automatic variables in the function at all, the stack pointer needs to be adjusted. The time taken for the adjustment is constant, and will not vary based on the number of variables declared. You might save some time if your function is left with no local automatic variables whatsoever.

If a static variable is initialized, there will be a flag somewhere to determine if the variable has already been initialized. Checking the flag will take some time. In your example the variable is not initialized, so this part can be ignored.

Static variables should be avoided if your function has any chance of being called recursively or from two different threads.

下雨或天晴 2024-09-26 06:12:17

在大多数实际情况下,这将使该函数显着变慢。这是因为静态数据段不在堆栈附近,您将失去缓存一致性,因此当您尝试访问它时,您将获得缓存未命中。然而,当您在堆栈上分配常规 char[32] 时,它就位于所有其他所需数据旁边,并且访问成本非常低。基于堆栈的 char 数组的初始化成本是没有意义的。

这忽略了静力学还有许多其他问题。

您确实需要实际分析代码并查看速度减慢的位置,因为没有分析器会告诉您分配静态大小的字符缓冲区是一个性能问题。

It will make the function substantially slower on most real cases. This is because the static data segment is not near the stack and you will lose cache coherency, so you will get a cache miss when you try to access it. However when you allocate a regular char[32] on the stack, it is right next to all your other needed data and costs very little to access. The initialization costs of a stack-based array of char are meaningless.

This is ignoring that statics have many other problems.

You really need to actually profile your code and see where the slowdowns are, because no profiler will tell you that allocating a statically-sized buffer of characters is a performance problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文