这个问题听起来可能相当简单,但这是我与另一位合作的开发人员进行的辩论。
我小心翼翼地在可能的地方进行堆栈分配,而不是堆分配它们。他一边跟我说话,一边看着我,并评论说没有必要,因为他们的表现是一样的。
我总是有这样的印象:堆栈的增长是恒定的时间,堆分配的性能取决于堆的当前复杂性,用于分配(找到适当大小的孔)和取消分配(折叠孔以减少碎片,如如果我没有记错的话,许多标准库实现在删除期间需要时间来执行此操作)。
我觉得这可能非常依赖于编译器。特别是对于这个项目,我使用 Metrowerks 编译器来实现 PPC 架构。深入了解这种组合将是最有帮助的,但一般来说,对于 GCC 和 MSVC++ 来说,情况如何?堆分配的性能不如堆栈分配吗?有没有区别?或者差异如此之小以至于变得毫无意义的微观优化。
This question may sound fairly elementary, but this is a debate I had with another developer I work with.
I was taking care to stack allocate things where I could, instead of heap allocating them. He was talking to me and watching over my shoulder and commented that it wasn't necessary because they are the same performance wise.
I was always under the impression that growing the stack was constant time, and heap allocation's performance depended on the current complexity of the heap for both allocation (finding a hole of the proper size) and de-allocating (collapsing holes to reduce fragmentation, as many standard library implementations take time to do this during deletes if I am not mistaken).
This strikes me as something that would probably be very compiler dependent. For this project in particular I am using a Metrowerks compiler for the PPC architecture. Insight on this combination would be most helpful, but in general, for GCC, and MSVC++, what is the case? Is heap allocation not as high performing as stack allocation? Is there no difference? Or are the differences so minute it becomes pointless micro-optimization.
发布评论
评论(24)
堆栈分配速度要快得多,因为它实际上所做的只是移动堆栈指针。
使用内存池,您可以从堆分配中获得相当的性能,但这会稍微增加复杂性并带来一些麻烦。
此外,堆栈与堆不仅是性能考虑因素,也是性能考虑因素。它还告诉您很多有关对象的预期寿命的信息。
Stack allocation is much faster since all it really does is move the stack pointer.
Using memory pools, you can get comparable performance out of heap allocation, but that comes with a slight added complexity and its own headaches.
Also, stack vs. heap is not only a performance consideration; it also tells you a lot about the expected lifetime of objects.
堆栈速度要快得多。在大多数情况下,它实际上只在大多数体系结构上使用一条指令,例如在 x86 上:(
将堆栈指针向下移动 0x10 字节,从而“分配”这些字节以供变量使用。)
当然,堆栈的大小是非常非常有限,因为您很快就会发现是否过度使用堆栈分配或尝试进行递归:-)
此外,没有理由优化不需要可验证的代码的性能,例如通过分析证明的那样。 “过早的优化”通常会导致比其价值更多的问题。
我的经验法则:如果我知道我在编译时需要一些数据,并且它的大小在几百字节以下,我就会对它进行堆栈分配。否则我会对其进行堆分配。
Stack is much faster. It literally only uses a single instruction on most architectures, in most cases, e.g. on x86:
(That moves the stack pointer down by 0x10 bytes and thereby "allocates" those bytes for use by a variable.)
Of course, the stack's size is very, very finite, as you will quickly find out if you overuse stack allocation or try to do recursion :-)
Also, there's little reason to optimize the performance of code that doesn't verifiably need it, such as demonstrated by profiling. "Premature optimization" often causes more problems than it's worth.
My rule of thumb: if I know I'm going to need some data at compile-time, and it's under a few hundred bytes in size, I stack-allocate it. Otherwise I heap-allocate it.
老实说,编写一个程序来比较性能是微不足道的:
据说 愚蠢的一致性是小头脑的大妖精。显然,优化编译器是许多程序员心中的恶魔。这个讨论曾经位于答案的底部,但人们显然不愿意读那么远,所以我将其移到此处以避免出现我已经回答过的问题。
优化编译器可能会注意到这段代码什么都不做,并可能将其全部优化掉。优化器的工作就是做这样的事情,与优化器对抗是愚蠢的差事。
我建议在关闭优化的情况下编译此代码,因为没有好的方法可以欺骗当前使用的每个优化器或者将来会使用的。
任何打开优化器然后抱怨与之对抗的人都应该受到公众的嘲笑。
如果我关心纳秒精度,我就不会使用std::clock()。如果我想将结果作为博士论文发表,我会对此做更多的研究,我可能会比较 GCC、Tendra/Ten15、LLVM、Watcom、Borland、Visual C++、Digital Mars、ICC 和其他编译器。事实上,堆分配比堆栈分配花费的时间长数百倍,而且我认为进一步研究这个问题没有任何用处。
优化器的使命是删除我正在测试的代码。我看不出有任何理由告诉优化器运行,然后试图欺骗优化器不进行实际优化。但如果我看到这样做的价值,我会执行以下一项或多项操作:
将数据成员添加到
空
,并在循环中访问该数据成员;但如果我只从数据成员中读取,优化器可以进行常量折叠并删除循环;如果我只写入数据成员,优化器可能会跳过循环的最后一次迭代之外的所有迭代。此外,问题不是“堆栈分配和数据访问与堆分配和数据访问”。声明
e
易失性
,但是易失性
经常被错误地编译(PDF)。获取循环内
e
的地址(并且可能将其分配给声明为extern
并在另一个文件中定义的变量)。但即使在这种情况下,编译器也可能会注意到——至少在堆栈上——e
将始终分配在相同的内存地址,然后像上面的 (1) 中那样进行常量折叠。我得到了循环的所有迭代,但该对象从未实际分配。除了显而易见的问题之外,该测试还存在缺陷,因为它同时测量分配和释放,并且最初的问题没有询问释放。当然,在堆栈上分配的变量会在其作用域结束时自动释放,因此不调用
delete
会 (1) 扭曲数字(堆栈释放包含在有关堆栈分配的数字中,因此它只是公平地衡量堆释放)和(2)会导致非常严重的内存泄漏,除非我们保留对新指针的引用并在获得时间测量后调用delete
。在我的机器上,在 Windows 上使用 g++ 3.4.4,对于少于 100000 次分配的堆栈和堆分配,我得到“0 个时钟滴答”,即使这样,我也得到“0 个时钟滴答”的堆栈分配和“15 个时钟滴答” “用于堆分配。当我测量 10,000,000 次分配时,堆栈分配需要 31 个时钟周期,堆分配需要 1562 个时钟周期。
是的,优化编译器可能会省略创建空对象。如果我理解正确的话,它甚至可能会省略整个第一个循环。当我将迭代次数提高到 10,000,000 次时,堆栈分配需要 31 个时钟周期,堆分配需要 1562 个时钟周期。我认为可以肯定地说,如果不告诉 g++ 优化可执行文件,g++ 就不会删除构造函数。
自从我写这篇文章以来,Stack Overflow 上的偏好一直是发布优化构建的性能。总的来说,我认为这是正确的。但是,我仍然认为当您实际上不希望优化代码时要求编译器优化代码是愚蠢的。在我看来,这与支付额外的代客泊车费用但拒绝交出钥匙非常相似。在这种特殊情况下,我不希望优化器运行。
使用基准测试的稍微修改版本(以解决原始程序每次通过循环都没有在堆栈上分配某些内容的有效点)并在不进行优化的情况下进行编译,但链接到发布库(以解决我们不这样做的有效点)不想包括因链接到调试库而导致的任何减慢):
显示:
在使用命令行
cl foo.cc /Od /MT /EHsc
编译时在我的系统上。您可能不同意我获得非优化构建的方法。没关系:您可以随意修改基准测试。当我打开优化时,我得到:
不是因为堆栈分配实际上是瞬时的,而是因为任何半体面的编译器都可以注意到
on_stack
没有做任何有用的事情并且可以被优化掉。我的 Linux 笔记本电脑上的 GCC 还注意到on_heap
没有做任何有用的事情,并且也将其优化掉:Honestly, it's trivial to write a program to compare the performance:
It's said that a foolish consistency is the hobgoblin of little minds. Apparently optimizing compilers are the hobgoblins of many programmers' minds. This discussion used to be at the bottom of the answer, but people apparently can't be bothered to read that far, so I'm moving it up here to avoid getting questions that I've already answered.
An optimizing compiler may notice that this code does nothing, and may optimize it all away. It is the optimizer's job to do stuff like that, and fighting the optimizer is a fool's errand.
I would recommend compiling this code with optimization turned off because there is no good way to fool every optimizer currently in use or that will be in use in the future.
Anybody who turns the optimizer on and then complains about fighting it should be subject to public ridicule.
If I cared about nanosecond precision I wouldn't use
std::clock()
. If I wanted to publish the results as a doctoral thesis I would make a bigger deal about this, and I would probably compare GCC, Tendra/Ten15, LLVM, Watcom, Borland, Visual C++, Digital Mars, ICC and other compilers. As it is, heap allocation takes hundreds of times longer than stack allocation, and I don't see anything useful about investigating the question any further.The optimizer has a mission to get rid of the code I'm testing. I don't see any reason to tell the optimizer to run and then try to fool the optimizer into not actually optimizing. But if I saw value in doing that, I would do one or more of the following:
Add a data member to
empty
, and access that data member in the loop; but if I only ever read from the data member the optimizer can do constant folding and remove the loop; if I only ever write to the data member, the optimizer may skip all but the very last iteration of the loop. Additionally, the question wasn't "stack allocation and data access vs. heap allocation and data access."Declare
e
volatile
, butvolatile
is often compiled incorrectly (PDF).Take the address of
e
inside the loop (and maybe assign it to a variable that is declaredextern
and defined in another file). But even in this case, the compiler may notice that -- on the stack at least --e
will always be allocated at the same memory address, and then do constant folding like in (1) above. I get all iterations of the loop, but the object is never actually allocated.Beyond the obvious, this test is flawed in that it measures both allocation and deallocation, and the original question didn't ask about deallocation. Of course variables allocated on the stack are automatically deallocated at the end of their scope, so not calling
delete
would (1) skew the numbers (stack deallocation is included in the numbers about stack allocation, so it's only fair to measure heap deallocation) and (2) cause a pretty bad memory leak, unless we keep a reference to the new pointer and calldelete
after we've got our time measurement.On my machine, using g++ 3.4.4 on Windows, I get "0 clock ticks" for both stack and heap allocation for anything less than 100000 allocations, and even then I get "0 clock ticks" for stack allocation and "15 clock ticks" for heap allocation. When I measure 10,000,000 allocations, stack allocation takes 31 clock ticks and heap allocation takes 1562 clock ticks.
Yes, an optimizing compiler may elide creating the empty objects. If I understand correctly, it may even elide the whole first loop. When I bumped up the iterations to 10,000,000 stack allocation took 31 clock ticks and heap allocation took 1562 clock ticks. I think it's safe to say that without telling g++ to optimize the executable, g++ did not elide the constructors.
In the years since I wrote this, the preference on Stack Overflow has been to post performance from optimized builds. In general, I think this is correct. However, I still think it's silly to ask the compiler to optimize code when you in fact do not want that code optimized. It strikes me as being very similar to paying extra for valet parking, but refusing to hand over the keys. In this particular case, I don't want the optimizer running.
Using a slightly modified version of the benchmark (to address the valid point that the original program didn't allocate something on the stack each time through the loop) and compiling without optimizations but linking to release libraries (to address the valid point that we don't want to include any slowdown caused by linking to debug libraries):
displays:
on my system when compiled with the command line
cl foo.cc /Od /MT /EHsc
.You may not agree with my approach to getting a non-optimized build. That's fine: feel free modify the benchmark as much as you want. When I turn on optimization, I get:
Not because stack allocation is actually instantaneous but because any half-decent compiler can notice that
on_stack
doesn't do anything useful and can be optimized away. GCC on my Linux laptop also notices thaton_heap
doesn't do anything useful, and optimizes it away as well:关于 Xbox 360 Xenon 处理器上的堆栈与堆分配,我了解到的一件有趣的事情(可能也适用于其他多核系统)是,在堆上分配会导致进入临界区以停止所有其他核心,以便分配不会不冲突。因此,在紧密循环中,堆栈分配是固定大小数组的最佳选择,因为它可以防止停顿。
如果您正在为多核/多进程进行编码,这可能是另一个需要考虑的加速,因为您的堆栈分配只能由运行作用域函数的核心查看,并且不会影响任何其他核心/CPU。
An interesting thing I learned about Stack vs. Heap Allocation on the Xbox 360 Xenon processor, which may also apply to other multicore systems, is that allocating on the Heap causes a Critical Section to be entered to halt all other cores so that the alloc doesn't conflict. Thus, in a tight loop, Stack Allocation was the way to go for fixed sized arrays as it prevented stalls.
This may be another speedup to consider if you're coding for multicore/multiproc, in that your stack allocation will only be viewable by the core running your scoped function, and that will not affect any other cores/CPUs.
您可以为特定大小的对象编写一个性能非常好的特殊堆分配器。然而,通用堆分配器的性能并不是特别好。
我也同意 Torbjörn Gyllebring 关于物体预期寿命的观点。好点!
You can write a special heap allocator for specific sizes of objects that is very performant. However, the general heap allocator is not particularly performant.
Also I agree with Torbjörn Gyllebring about the expected lifetime of objects. Good point!
C++ 语言特有的问题
首先,C++ 不强制要求所谓的“堆栈”或“堆”分配。如果您谈论的是块作用域中的自动对象,它们甚至没有“分配”。 (顺便说一句,C 中的自动存储持续时间绝对与“分配”不同;后者在 C++ 术语中是“动态”的。)动态分配的内存位于自由存储上,不一定位于“堆”,尽管后者通常是(默认)实现。
尽管按照抽象机语义规则,自动对象仍然占用内存,当一个符合标准的 C++ 实现能够证明这并不重要时(当它不改变程序的可观察行为时),就可以忽略这个事实。此权限由 as-if 规则授予ISO C++,这也是支持通常优化的通用子句(ISO C 中也有几乎相同的规则)。除了 as-if 规则之外,ISO C++ 还有复制省略规则 允许省略对象的特定创建。因此省略了涉及的构造函数和析构函数调用。因此,与源代码隐含的朴素抽象语义相比,这些构造函数和析构函数中的自动对象(如果有)也被消除了。
另一方面,免费商店分配绝对是设计好的“分配”。根据 ISO C++ 规则,这样的分配可以通过调用分配函数来实现。然而,从 ISO C++14 开始,有一个新的(非 as -if) 规则 允许在特定情况下合并全局分配函数(即
::operator new
)调用。因此,动态分配操作的一部分也可以像自动对象的情况一样是无操作的。分配函数分配内存资源。可以使用分配器在分配的基础上进一步分配对象。对于自动对象,它们是直接呈现的——虽然底层内存可以被访问并用于为其他对象提供内存(通过放置
new
),但这对于自由存储来说没有多大意义,因为没有办法将资源转移到其他地方。所有其他问题都超出了 C++ 的范围。尽管如此,它们仍然具有重要意义。
关于 C++ 的实现
C++ 不公开具体化的激活记录或某种一流的延续(例如著名的
call/cc
),没有办法直接操作激活记录帧 - 实现需要将自动对象放置到其中。一旦与底层实现(“本机”非可移植代码,例如内联汇编代码)没有(不可移植)互操作,帧的底层分配的省略可能是相当微不足道的。例如,当被调用的函数被内联时,框架可以有效地合并到其他框架中,因此无法显示什么是“分配”。然而,一旦尊重互操作性,事情就会变得复杂。 C++ 的典型实现将公开 ISA(指令集架构)上的互操作能力,并使用一些调用约定作为与本机(ISA 级机器)代码共享的二进制边界。值得注意的是,在维护堆栈指针时,这显然是昂贵的,堆栈指针通常直接由 ISA 级寄存器保存(可能需要访问特定的机器指令)。堆栈指针指示(当前活动的)函数调用的顶部框架的边界。当进入函数调用时,需要一个新的帧,并且堆栈指针被添加或减去不小于所需帧大小的值(取决于ISA的约定)。当堆栈指针在操作之后时,该帧就被称为分配。函数的参数也可以传递到堆栈帧上,具体取决于调用所使用的调用约定。框架可以保存由C++源代码指定的自动对象(可能包括参数)的内存。从这种实现的意义上来说,这些对象是“分配的”。当控件退出函数调用时,不再需要该帧,通常通过将堆栈指针恢复到调用前的状态(根据调用约定之前保存)来释放该帧。这可以被视为“解除分配”。这些操作使激活记录有效地成为一种 LIFO 数据结构,因此它通常被称为“(调用)堆栈< /a>”。堆栈指针有效地指示堆栈的顶部位置。
由于大多数 C++ 实现(特别是针对 ISA 级本机代码并使用汇编语言作为其直接输出的实现)都使用类似的策略,因此这种令人困惑的“分配”方案很受欢迎。此类分配(以及解除分配)确实会花费机器周期,并且当(非优化)调用频繁发生时,成本可能会很高,即使现代 CPU 微体系结构可以通过硬件针对常见代码模式实现复杂的优化(例如使用堆栈引擎实现
PUSH
/POP
指令)。但无论如何,一般来说,堆栈帧分配的成本确实明显低于对操作自由存储的分配函数的调用(除非它完全优化掉),它本身可以具有数百(如果不是数百万:-)操作来维护堆栈指针和其他状态。分配函数通常基于托管环境提供的API(例如操作系统提供的运行时)。与保存自动对象以供函数调用的目的不同,这种分配是通用目的的,因此它们不会像堆栈那样具有框架结构。传统上,它们从称为堆(或多个堆)的池存储中分配空间。与“栈”不同,这里的“堆”概念并不表示所使用的数据结构; 它源自几十年前的早期语言实现。 (顺便说一句,调用堆栈通常由程序/线程启动时的环境从堆中分配固定或用户指定的大小。)用例的性质使得堆的分配和释放要复杂得多(比压入/弹出要复杂得多)堆栈帧),并且很难通过硬件直接优化。
对内存访问的影响
通常的堆栈分配总是将新帧放在顶部,因此它具有相当好的局部性。这对缓存很友好。 OTOH,在自由存储中随机分配的内存没有这样的属性。从 ISO C++17 开始,
提供了池资源模板。这种接口的直接目的是允许连续分配的结果在内存中靠近在一起。这承认了这样一个事实,即该策略通常有利于当代实现的性能,例如对现代体系结构中的缓存友好。不过,这是关于访问的性能,而不是分配。并发性
对内存的并发访问的预期在堆栈和堆之间可能会产生不同的影响。在典型的 C++ 实现中,调用堆栈通常由一个执行线程独占所有。 OTOH,堆通常在进程中的线程之间共享。对于此类堆,分配和释放函数必须保护共享的内部管理数据结构免受数据竞争的影响。因此,由于内部同步操作,堆分配和释放可能会产生额外的开销。
空间效率
由于用例和内部数据结构的性质,堆可能会受到内部内存的影响。碎片,而堆栈则没有。这对内存分配的性能没有直接影响,但在具有虚拟内存的系统中,低空间效率可能会降低内存访问的整体性能。当 HDD 用作物理内存的交换时,这尤其糟糕。它可能会导致相当长的延迟 - 有时长达数十亿个周期。
堆栈分配的局限性
虽然实际上堆栈分配在性能上往往优于堆分配,但这当然并不意味着堆栈分配总是可以取代堆分配。
首先,无法使用 ISO C++ 以可移植的方式在运行时指定大小的堆栈上分配空间。 alloca 和 G++ 的 VLA(可变长度数组)等实现提供了一些扩展,但有理由避免使用它们。 (IIRC,Linux 源代码最近删除了 VLA 的使用。)(另请注意,ISO C99 确实强制要求 VLA,但 ISO C11 将支持变为可选。)
其次,没有可靠且可移植的方法来检测堆栈空间耗尽。这通常称为堆栈溢出(嗯,此站点的词源),但可能更准确地说,堆栈溢出。实际上,这通常会导致无效的内存访问,然后程序的状态就会被破坏(……或者更糟糕的是,出现安全漏洞)。事实上,ISO C++ 没有“堆栈”的概念,并且 当资源耗尽时使其行为未定义。请注意应为自动对象留出多少空间。
如果堆栈空间耗尽,则堆栈中分配的对象过多,这可能是由于函数的主动调用过多或自动对象使用不当造成的。这种情况可能表明存在错误,例如没有正确退出条件的递归函数调用。
然而,有时需要深度递归调用。在需要支持未绑定活动调用(其中调用深度仅受总内存限制)的语言实现中,不可能像典型的那样直接使用(当代)本机调用堆栈作为目标语言激活记录C++ 实现。为了解决这个问题,需要构建激活记录的替代方法。例如, SML/NJ 显式在堆上分配帧并使用 仙人掌堆栈。这种激活记录帧的复杂分配通常不如调用堆栈帧那么快。但是,如果在保证适当的尾递归的情况下进一步实现此类语言,则直接堆栈分配在对象语言中(即,语言中的“对象”不存储为引用,而是可以一对一映射到非共享 C++ 对象的本机原始值)甚至更加复杂,通常会带来更多性能损失。当使用 C++ 实现此类语言时,很难估计性能影响。
Concerns Specific to the C++ Language
First of all, there is no so-called "stack" or "heap" allocation mandated by C++. If you are talking about automatic objects in block scopes, they are even not "allocated". (BTW, automatic storage duration in C is definitely NOT the same to "allocated"; the latter is "dynamic" in the C++ parlance.) The dynamically allocated memory is on the free store, not necessarily on "the heap", though the latter is often the (default) implementation.
Although as per the abstract machine semantic rules, automatic objects still occupy memory, a conforming C++ implementation is allowed to ignore this fact when it can prove this does not matter (when it does not change the observable behavior of the program). This permission is granted by the as-if rule in ISO C++, which is also the general clause enabling the usual optimizations (and there is also an almost same rule in ISO C). Besides the as-if rule, ISO C++ also has copy elision rules to allow omission of specific creations of objects. The constructor and destructor calls involved are thereby omitted. As a result, the automatic objects (if any) in these constructors and destructors are also eliminated, compared to naive abstract semantics implied by the source code.
On the other hand, free store allocation is definitely "allocation" by design. Under ISO C++ rules, such an allocation can be achieved by a call of an allocation function. However, since ISO C++14, there is a new (non-as-if) rule to allow merging global allocation function (i.e.
::operator new
) calls in specific cases. So parts of dynamic allocation operations can also be no-op like the case of automatic objects.Allocation functions allocate resources of memory. Objects can be further allocated based on allocation using allocators. For automatic objects, they are directly presented - although the underlying memory can be accessed and be used to provide memory to other objects (by placement
new
), but this does not make great sense as the free store, because there is no way to move the resources elsewhere.All other concerns are out of the scope of C++. Nevertheless, they can be still significant.
About Implementations of C++
C++ does not expose reified activation records or some sorts of first-class continuations (e.g. by the famous
call/cc
), there is no way to directly manipulate the activation record frames - where the implementation need to place the automatic objects to. Once there is no (non-portable) interoperations with the underlying implementation ("native" non-portable code, such as inline assembly code), an omission of the underlying allocation of the frames can be quite trivial. For example, when the called function is inlined, the frames can be effectively merged into others, so there is no way to show what is the "allocation".However, once interops are respected, things are getting complex. A typical implementation of C++ will expose the ability of interop on ISA (instruction-set architecture) with some calling conventions as the binary boundary shared with the native (ISA-level machine) code. This would be explicitly costly, notably, when maintaining the stack pointer, which is often directly held by an ISA-level register (with probably specific machine instructions to access). The stack pointer indicates the boundary of the top frame of the (currently active) function call. When a function call is entered, a new frame is needed and the stack pointer is added or subtracted (depending on the convention of ISA) by a value not less than the required frame size. The frame is then said allocated when the stack pointer after the operations. Parameters of functions may be passed onto the stack frame as well, depending on the calling convention used for the call. The frame can hold the memory of automatic objects (probably including the parameters) specified by the C++ source code. In the sense of such implementations, these objects are "allocated". When the control exits the function call, the frame is no longer needed, it is usually released by restoring the stack pointer back to the state before the call (saved previously according to the calling convention). This can be viewed as "deallocation". These operations make the activation record effectively a LIFO data structure, so it is often called "the (call) stack". The stack pointer effectively indicates the top position of the stack.
Because most C++ implementations (particularly the ones targeting ISA-level native code and using the assembly language as its immediate output) use similar strategies like this, such a confusing "allocation" scheme is popular. Such allocations (as well as deallocations) do spend machine cycles, and it can be expensive when the (non-optimized) calls occur frequently, even though modern CPU microarchitectures can have complex optimizations implemented by hardware for the common code pattern (like using a stack engine in implementing
PUSH
/POP
instructions).But anyway, in general, it is true that the cost of stack frame allocation is significantly less than a call to an allocation function operating the free store (unless it is totally optimized away), which itself can have hundreds of (if not millions of :-) operations to maintain the stack pointer and other states. Allocation functions are typically based on API provided by the hosted environment (e.g. runtime provided by the OS). Different to the purpose of holding automatic objects for functions calls, such allocations are general-purpose, so they will not have frame structure like a stack. Traditionally, they allocate space from the pool storage called the heap (or several heaps). Different from the "stack", the concept "heap" here does not indicate the data structure being used; it is derived from early language implementations decades ago. (BTW, the call stack is usually allocated with fixed or user-specified size from the heap by the environment in program/thread startup.) The nature of use cases makes allocations and deallocations from a heap far more complicated (than pushing/poppoing of stack frames), and hardly possible to be directly optimized by hardware.
Effects on Memory Access
The usual stack allocation always puts the new frame on the top, so it has a quite good locality. This is friendly to the cache. OTOH, memory allocated randomly in the free store has no such property. Since ISO C++17, there are pool resource templates provided by
<memory_resource>
. The direct purpose of such an interface is to allow the results of consecutive allocations being close together in memory. This acknowledges the fact that this strategy is generally good for performance with contemporary implementations, e.g. being friendly to cache in modern architectures. This is about the performance of access rather than allocation, though.Concurrency
Expectation of concurrent access to memory can have different effects between the stack and heaps. A call stack is usually exclusively owned by one thread of execution in a typical C++ implementation. OTOH, heaps are often shared among the threads in a process. For such heaps, the allocation and deallocation functions have to protect the shared internal administrative data structure from the data race. As a result, heap allocations and deallocations may have additional overhead due to internal synchronization operations.
Space Efficiency
Due to the nature of the use cases and internal data structures, heaps may suffer from internal memory fragmentation, while the stack does not. This does not have direct impacts on the performance of memory allocation, but in a system with virtual memory, low space efficiency may degenerate overall performance of memory access. This is particularly awful when HDD is used as a swap of physical memory. It can cause quite long latency - sometimes billions of cycles.
Limitations of Stack Allocations
Although stack allocations are often superior in performance than heap allocations in reality, it certainly does not mean stack allocations can always replace heap allocations.
First, there is no way to allocate space on the stack with a size specified at runtime in a portable way with ISO C++. There are extensions provided by implementations like
alloca
and G++'s VLA (variable-length array), but there are reasons to avoid them. (IIRC, Linux source removes the use of VLA recently.) (Also note ISO C99 does have mandated VLA, but ISO C11 turns the support optional.)Second, there is no reliable and portable way to detect stack space exhaustion. This is often called stack overflow (hmm, the etymology of this site), but probably more accurately, stack overrun. In reality, this often causes invalid memory access, and the state of the program is then corrupted (... or maybe worse, a security hole). In fact, ISO C++ has no concept of "the stack" and makes it undefined behavior when the resource is exhausted. Be cautious about how much room should be left for automatic objects.
If the stack space runs out, there are too many objects allocated in the stack, which can be caused by too many active calls of functions or improper use of automatic objects. Such cases may suggest the existence of bugs, e.g. a recursive function call without correct exit conditions.
Nevertheless, deep recursive calls are sometimes desired. In implementations of languages requiring support of unbound active calls (where the call depth only limited by total memory), it is impossible to use the (contemporary) native call stack directly as the target language activation record like typical C++ implementations. To work around the problem, alternative ways of the construction of activation records are needed. For example, SML/NJ explicitly allocates frames on the heap and uses cactus stacks. The complicated allocation of such activation record frames is usually not as fast as the call stack frames. However, if such languages are implemented further with the guarantee of proper tail recursion, the direct stack allocation in the object language (that is, the "object" in the language does not stored as references, but native primitive values which can be one-to-one mapped to unshared C++ objects) is even more complicated with more performance penalty in general. When using C++ to implement such languages, it is difficult to estimate the performance impacts.
我不认为堆栈分配和堆分配通常可以互换。我也希望两者的性能足以满足一般使用。
我强烈推荐小物品,选择更适合分配范围的物品。对于大型项目,堆可能是必要的。
在具有多个线程的 32 位操作系统上,堆栈通常相当有限(尽管通常至少为几 MB),因为需要划分地址空间,迟早一个线程堆栈会遇到另一个线程堆栈。在单线程系统(Linux glibc 单线程)上,限制要小得多,因为堆栈可以不断增长。
在 64 位操作系统上,有足够的地址空间使线程堆栈变得相当大。
I don't think stack allocation and heap allocation are generally interchangable. I also hope that the performance of both of them is sufficient for general use.
I'd strongly recommend for small items, whichever one is more suitable to the scope of the allocation. For large items, the heap is probably necessary.
On 32-bit operating systems that have multiple threads, stack is often rather limited (albeit typically to at least a few mb), because the address space needs to be carved up and sooner or later one thread stack will run into another. On single threaded systems (Linux glibc single threaded anyway) the limitation is much less because the stack can just grow and grow.
On 64-bit operating systems there is enough address space to make thread stacks quite large.
通常堆栈分配只是从堆栈指针寄存器中减去。这比搜索堆快得多。
有时堆栈分配需要添加虚拟内存页面。添加新的归零内存页面不需要从磁盘读取页面,因此通常这仍然比搜索堆快得多(特别是如果堆的一部分也被调出)。在极少数情况下,您可以构建这样一个示例,RAM 中已有的堆部分恰好有足够的空间可用,但为堆栈分配新页面必须等待其他页面写出到磁盘。在这种罕见的情况下,堆速度更快。
Usually stack allocation just consists of subtracting from the stack pointer register. This is tons faster than searching a heap.
Sometimes stack allocation requires adding a page(s) of virtual memory. Adding a new page of zeroed memory doesn't require reading a page from disk, so usually this is still going to be tons faster than searching a heap (especially if part of the heap was paged out too). In a rare situation, and you could construct such an example, enough space just happens to be available in part of the heap which is already in RAM, but allocating a new page for the stack has to wait for some other page to get written out to disk. In that rare situation, the heap is faster.
除了比堆分配具有数量级的性能优势之外,堆栈分配更适合长时间运行的服务器应用程序。即使是最好的托管堆最终也会变得碎片化,从而导致应用程序性能下降。
Aside from the orders-of-magnitude performance advantage over heap allocation, stack allocation is preferable for long running server applications. Even the best managed heaps eventually get so fragmented that application performance degrades.
堆分配与堆栈分配的最大问题可能是,一般情况下堆分配是无界操作,因此在计时有问题的情况下不能使用它。
对于其他时间不是问题的应用程序,它可能并不那么重要,但如果堆分配很多,这将影响执行速度。始终尝试将堆栈用于短期内存和经常分配的内存(例如在循环中),并尽可能长时间使用 - 在应用程序启动期间进行堆分配。
Probably the biggest problem of heap allocation versus stack allocation, is that heap allocation in the general case is an unbounded operation, and thus you can't use it where timing is an issue.
For other applications where timing isn't an issue, it may not matter as much, but if you heap allocate a lot, this will affect the execution speed. Always try to use the stack for short lived and often allocated memory (for instance in loops), and as long as possible - do heap allocation during application startup.
栈的容量有限,而堆则不然。进程或线程的典型堆栈约为 8K。大小一旦分配就无法更改。
堆栈变量遵循作用域规则,而堆变量则不然。如果指令指针超出函数范围,则与该函数关联的所有新变量都会消失。
最重要的是,您无法提前预测整个函数调用链。因此,仅仅分配 200 字节就可能会引发堆栈溢出。如果您正在编写库而不是应用程序,这一点尤其重要。
A stack has a limited capacity, while a heap is not. The typical stack for a process or thread is around 8K. You cannot change the size once it's allocated.
A stack variable follows the scoping rules, while a heap one doesn't. If your instruction pointer goes beyond a function, all the new variables associated with the function go away.
Most important of all, you can't predict the overall function call chain in advance. So a mere 200 bytes allocation on your part may raise a stack overflow. This is especially important if you're writing a library, not an application.
这不仅仅是堆栈分配更快。使用堆栈变量也能带来很多好处。他们有更好的参考局部性。最后,解除分配也便宜得多。
It's not jsut stack allocation that's faster. You also win a lot on using stack variables. They have better locality of reference. And finally, deallocation is a lot cheaper too.
正如其他人所说,堆栈分配通常要快得多。
但是,如果复制对象的成本很高,那么如果不小心的话,在堆栈上分配可能会导致稍后使用对象时性能受到巨大影响。
例如,如果您在堆栈上分配某些内容,然后将其放入容器中,那么最好在堆上分配并将指针存储在容器中(例如使用 std::shared_ptr<>)。如果您按值传递或返回对象以及其他类似的场景,情况也是如此。
关键是,尽管在许多情况下堆栈分配通常比堆分配更好,但有时如果您在不最适合计算模型时不遗余力地进行堆栈分配,则可能会导致比它解决的问题更多的问题。
As others have said, stack allocation is generally much faster.
However, if your objects are expensive to copy, allocating on the stack may lead to an huge performance hit later when you use the objects if you aren't careful.
For example, if you allocate something on the stack, and then put it into a container, it would have been better to allocate on the heap and store the pointer in the container (e.g. with a std::shared_ptr<>). The same thing is true if you are passing or returning objects by value, and other similar scenarios.
The point is that although stack allocation is usually better than heap allocation in many cases, sometimes if you go out of your way to stack allocate when it doesn't best fit the model of computation, it can cause more problems than it solves.
堆栈分配是几条指令,而据我所知最快的 rtos 堆分配器 (TLSF) 平均使用大约 150 条指令。此外,堆栈分配不需要锁,因为它们使用线程本地存储,这是另一个巨大的性能优势。因此,堆栈分配速度可能会快 2-3 个数量级,具体取决于您的环境的多线程程度。
一般来说,如果您关心性能,堆分配是您的最后手段。一个可行的中间选项可以是固定池分配器,它也只有几条指令,并且每次分配的开销非常小,因此它非常适合小型固定大小的对象。缺点是它只适用于固定大小的对象,本质上不是线程安全的,并且存在块碎片问题。
Stack allocation is a couple instructions whereas the fastest rtos heap allocator known to me (TLSF) uses on average on the order of 150 instructions. Also stack allocations don't require a lock because they use thread local storage which is another huge performance win. So stack allocations can be 2-3 orders of magnitude faster depending on how heavily multithreaded your environment is.
In general heap allocation is your last resort if you care about performance. A viable in-between option can be a fixed pool allocator which is also only a couple instructions and has very little per-allocation overhead so it's great for small fixed size objects. On the downside it only works with fixed size objects, is not inherently thread safe and has block fragmentation problems.
我认为生命周期很重要,分配的东西是否必须以复杂的方式构建。例如,在事务驱动的建模中,您通常必须填写一个包含一堆字段的事务结构并将其传递给操作函数。查看 OSCI SystemC TLM-2.0 标准的示例。
将它们分配在靠近操作调用的堆栈上往往会导致巨大的开销,因为构造成本很高。最好的方法是在堆上分配并通过池化或诸如“此模块只需要一个事务对象”之类的简单策略来重用事务对象。
这比在每个操作调用上分配对象要快很多倍。
原因很简单,该物体具有昂贵的结构和相当长的使用寿命。
我想说:尝试两者,看看哪种最适合您的情况,因为它实际上取决于您的代码的行为。
I think the lifetime is crucial, and whether the thing being allocated has to be constructed in a complex way. For example, in transaction-driven modeling, you usually have to fill in and pass in a transaction structure with a bunch of fields to operation functions. Look at the OSCI SystemC TLM-2.0 standard for an example.
Allocating these on the stack close to the call to the operation tends to cause enormous overhead, as the construction is expensive. The good way there is to allocate on the heap and reuse the transaction objects either by pooling or a simple policy like "this module only needs one transaction object ever".
This is many times faster than allocating the object on each operation call.
The reason is simply that the object has an expensive construction and a fairly long useful lifetime.
I would say: try both and see what works best in your case, because it can really depend on the behavior of your code.
关于此类优化有一个普遍的观点。
您获得的优化与程序计数器实际在该代码中的时间量成正比。
如果您对程序计数器进行采样,您会发现它在哪里花费了时间,而这通常是在代码的一小部分,并且通常在您无法控制的库例程中。
只有当您发现它在对象的堆分配上花费了大量时间时,堆栈分配它们才会明显更快。
There's a general point to be made about such optimizations.
The optimization you get is proportional to the amount of time the program counter is actually in that code.
If you sample the program counter, you will find out where it spends its time, and that is usually in a tiny part of the code, and often in library routines you have no control over.
Only if you find it spending much time in the heap-allocation of your objects will it be noticeably faster to stack-allocate them.
堆栈分配几乎总是与堆分配一样快或更快,尽管堆分配器当然可以简单地使用基于堆栈的分配技术。
然而,在处理堆栈与基于堆的分配(或者稍微好一点的术语,本地分配与外部分配)的整体性能时,存在更大的问题。通常,堆(外部)分配很慢,因为它要处理许多不同类型的分配和分配模式。减少您正在使用的分配器的范围(使其成为算法/代码的本地分配器)往往会在不进行任何重大更改的情况下提高性能。为分配模式添加更好的结构(例如,对分配和释放对强制执行 LIFO 排序)还可以通过以更简单、更结构化的方式使用分配器来提高分配器的性能。或者,您可以使用或编写一个针对您的特定分配模式进行调整的分配器;大多数程序经常分配一些离散的大小,因此基于一些固定(最好是已知)大小的后备缓冲区的堆将表现得非常好。正是出于这个原因,Windows 使用其低碎片堆。
另一方面,如果线程太多,32 位内存范围上基于堆栈的分配也会充满危险。堆栈需要连续的内存范围,因此拥有的线程越多,它们需要的虚拟地址空间就越多,才能在不发生堆栈溢出的情况下运行。对于 64 位来说,这不会是一个问题(目前),但它肯定会对具有大量线程的长时间运行的程序造成严重破坏。由于碎片而耗尽虚拟地址空间始终是一个令人头疼的问题。
Stack allocation will almost always be as fast or faster than heap allocation, although it is certainly possible for a heap allocator to simply use a stack based allocation technique.
However, there are larger issues when dealing with the overall performance of stack vs. heap based allocation (or in slightly better terms, local vs. external allocation). Usually, heap (external) allocation is slow because it is dealing with many different kinds of allocations and allocation patterns. Reducing the scope of the allocator you are using (making it local to the algorithm/code) will tend to increase performance without any major changes. Adding better structure to your allocation patterns, for example, forcing a LIFO ordering on allocation and deallocation pairs can also improve your allocator's performance by using the allocator in a simpler and more structured way. Or, you can use or write an allocator tuned for your particular allocation pattern; most programs allocate a few discrete sizes frequently, so a heap that is based on a lookaside buffer of a few fixed (preferably known) sizes will perform extremely well. Windows uses its low-fragmentation-heap for this very reason.
On the other hand, stack-based allocation on a 32-bit memory range is also fraught with peril if you have too many threads. Stacks need a contiguous memory range, so the more threads you have, the more virtual address space you will need for them to run without a stack overflow. This won't be a problem (for now) with 64-bit, but it can certainly wreak havoc in long running programs with lots of threads. Running out of virtual address space due to fragmentation is always a pain to deal with.
请注意,在选择堆栈与堆分配时,考虑的因素通常不是速度和性能。堆栈的作用类似于堆栈,这意味着它非常适合推入块并再次弹出它们,后进先出。程序的执行也是类似堆栈的,最后进入的程序最先退出。在大多数编程语言中,过程中所需的所有变量仅在过程执行期间可见,因此它们在进入过程时被压入,并在退出或返回时从堆栈中弹出。
现在举一个不能使用堆栈的例子:
如果您在过程S中分配了一些内存并将其放入堆栈中,然后退出S,则分配的数据将从堆栈中弹出。但是 P 中的变量 x 也指向该数据,因此 x 现在指向堆栈指针下方的某个位置(假设堆栈向下增长),内容未知。如果堆栈指针只是向上移动而没有清除其下面的数据,则内容可能仍然存在,但如果您开始在堆栈上分配新数据,则指针 x 实际上可能指向该新数据。
Remark that the considerations are typically not about speed and performance when choosing stack versus heap allocation. The stack acts like a stack, which means it is well suited for pushing blocks and popping them again, last in, first out. Execution of procedures is also stack-like, last procedure entered is first to be exited. In most programming languages, all the variables needed in a procedure will only be visible during the procedure's execution, thus they are pushed upon entering a procedure and popped off the stack upon exit or return.
Now for an example where the stack cannot be used:
If you allocate some memory in procedure S and put it on the stack and then exit S, the allocated data will be popped off the stack. But the variable x in P also pointed to that data, so x is now pointing to some place underneath the stack pointer (assume stack grows downwards) with an unknown content. The content might still be there if the stack pointer is just moved up without clearing the data beneath it, but if you start allocating new data on the stack, the pointer x might actually point to that new data instead.
在asm中就会是这样的。当您处于
func
中时,f1
和指针f2
已在堆栈上分配(自动存储)。顺便说一句,Foof1(a1)
对堆栈指针(esp
)没有指令影响,它已经被分配了,如果func
想要获取成员f1
,它的指令是这样的:lea ecx [ebp+f1], call Foo::SomeFunc()
。堆栈分配的另一件事可能会让某人认为内存类似于FIFO
,FIFO
只是在您进入某个函数时发生,如果您在该函数中并分配了一些东西就像int i = 0
一样,没有发生推送。It would be like this in asm. When you're in
func
, thef1
and pointerf2
has been allocated on stack (automated storage). And by the way, Foof1(a1)
has no instruction effects on stack pointer (esp
),It has been allocated, iffunc
wants get the memberf1
, it's instruction is something like this:lea ecx [ebp+f1], call Foo::SomeFunc()
. Another thing the stack allocate may make someone think the memory is something likeFIFO
, theFIFO
just happened when you go into some function, if you are in the function and allocate something likeint i = 0
, there no push happened.之前已经提到过,堆栈分配只是简单地移动堆栈指针,即大多数体系结构上的单个指令。将其与堆分配情况下通常发生的情况进行比较。
操作系统将空闲内存部分维护为链表,其中有效负载数据由指向空闲部分起始地址的指针和空闲部分的大小组成。为了分配 X 字节的内存,遍历链接列表并按顺序访问每个注释,检查其大小是否至少为 X。当找到大小 P >= X 的部分时,将 P 分成两部分尺寸为 X 和 PX。链表被更新并返回指向第一部分的指针。
正如您所看到的,堆分配取决于许多因素,例如您请求的内存量、内存的碎片程度等等。
It has been mentioned before that stack allocation is simply moving the stack pointer, that is, a single instruction on most architectures. Compare that to what generally happens in the case of heap allocation.
The operating system maintains portions of free memory as a linked list with the payload data consisting of the pointer to the starting address of the free portion and the size of the free portion. To allocate X bytes of memory, the link list is traversed and each note is visited in sequence, checking to see if its size is at least X. When a portion with size P >= X is found, P is split into two parts with sizes X and P-X. The linked list is updated and the pointer to the first part is returned.
As you can see, heap allocation depends on may factors like how much memory you are requesting, how fragmented the memory is and so on.
一般来说,正如上面几乎每个答案都提到的那样,堆栈分配比堆分配更快。堆栈压入或弹出的时间复杂度为 O(1),而从堆中分配或释放可能需要遍历先前的分配。然而,您通常不应该在紧密的性能密集型循环中进行分配,因此选择通常取决于其他因素。
做出这种区分可能会很好:您可以在堆上使用“堆栈分配器”。严格来说,我认为堆栈分配是指实际的分配方法而不是分配的位置。如果您在实际程序堆栈上分配大量内容,那么由于多种原因,这可能会很糟糕。另一方面,如果可能的话,使用堆栈方法在堆上分配是分配方法的最佳选择。
既然你提到了 Metrowerks 和 PPC,我猜你指的是 Wii。在这种情况下,内存非常宝贵,并且尽可能使用堆栈分配方法可以保证您不会在片段上浪费内存。当然,这样做比“正常”堆分配方法需要更加小心。评估每种情况的权衡是明智的。
In general, stack allocation is faster than heap allocation as mentioned by almost every answer above. A stack push or pop is O(1), whereas allocating or freeing from a heap could require a walk of previous allocations. However you shouldn't usually be allocating in tight, performance-intensive loops, so the choice will usually come down to other factors.
It might be good to make this distinction: you can use a "stack allocator" on the heap. Strictly speaking, I take stack allocation to mean the actual method of allocation rather than the location of the allocation. If you're allocating a lot of stuff on the actual program stack, that could be bad for a variety of reasons. On the other hand, using a stack method to allocate on the heap when possible is the best choice you can make for an allocation method.
Since you mentioned Metrowerks and PPC, I'm guessing you mean Wii. In this case memory is at a premium, and using a stack allocation method wherever possible guarantees that you don't waste memory on fragments. Of course, doing this requires a lot more care than "normal" heap allocation methods. It's wise to evaluate the tradeoffs for each situation.
当然,堆栈分配速度更快。对于堆分配,分配器必须在某处找到可用内存。通过堆栈分配,编译器只需为您的函数提供一个更大的堆栈框架即可为您完成此操作,这意味着分配根本不花费时间。 (我假设您没有使用alloca或任何东西来分配动态数量的堆栈空间,但即使这样它也非常快。)
但是,您必须警惕隐藏的动态分配。例如:
您可能认为这会在堆栈上分配 4 KiB,但您错了。它在堆栈上分配
vector
实例,但该vector
实例又在堆上分配其 4 KiB,因为vector
总是 在堆上分配其内部数组(至少除非您指定自定义分配器,我不会在这里讨论)。如果您想使用类似 STL 的容器在堆栈上分配,您可能需要std: :array
,或者可能boost::static_vector
(由外部Boost 图书馆)。Naturally, stack allocation is faster. With heap allocation, the allocator has to find the free memory somewhere. With stack allocation, the compiler does it for you by simply giving your function a larger stack frame, which means the allocation costs no time at all. (I'm assuming you're not using
alloca
or anything to allocate a dynamic amount of stack space, but even then it's very fast.)However, you do have to be wary of hidden dynamic allocation. For example:
You might think this allocates 4 KiB on the stack, but you'd be wrong. It allocates the
vector
instance on the stack, but thatvector
instance in turn allocates its 4 KiB on the heap, becausevector
always allocates its internal array on the heap (at least unless you specify a custom allocator, which I won't get into here). If you want to allocate on the stack using an STL-like container, you probably wantstd::array
, or possiblyboost::static_vector
(provided by the external Boost library).切勿过早假设,因为其他应用程序代码和使用情况可能会影响您的功能。所以光看功能是否隔离是没有用的。
如果您认真对待应用程序,请对其进行 VTune 或使用任何类似的分析工具并查看热点。
克坦
Never do premature assumption as other application code and usage can impact your function. So looking at function is isolation is of no use.
If you are serious with application then VTune it or use any similar profiling tool and look at hotspots.
Ketan
我想说的是,实际上由 GCC 生成的代码(我也记得 VS)没有进行堆栈分配的开销。
说一下以下函数:
以下是生成的代码:
因此,无论您有多少局部变量(即使在 if 或 switch 内部),只有 3880 会更改为另一个值。除非你没有局部变量,否则这条指令只需要执行即可。所以分配局部变量没有开销。
I'd like to say actually code generate by GCC (I remember VS also) doesn't have overhead to do stack allocation.
Say for following function:
Following is the code generate:
So whatevery how much local variable you have (even inside if or switch), just the 3880 will change to another value. Unless you didn't have local variable, this instruction just need to execute. So allocate local variable doesn't have overhead.