在 C++ 中优化空间而不是速度;
当提到“优化”时,人们往往会想到“速度”。但是对于速度并不是那么重要但内存是主要限制的嵌入式系统呢?有哪些准则、技术和技巧可用于削减 ROM 和 RAM 中额外的千字节?如何通过“分析”代码来查看内存膨胀的位置?
PS One 可能会说,“过早”优化嵌入式系统中的空间并不是那么邪恶,因为您为自己留下了更多的数据存储和功能扩展的空间。它还可以让您降低硬件生产成本,因为您的代码可以在较小的 ROM/RAM 上运行。
PPS 也欢迎参考文章和书籍!
When you say "optimization", people tend to think "speed". But what about embedded systems where speed isn't all that critical, but memory is a major constraint? What are some guidelines, techniques, and tricks that can be used for shaving off those extra kilobytes in ROM and RAM? How does one "profile" code to see where the memory bloat is?
P.S. One could argue that "prematurely" optimizing for space in embedded systems isn't all that evil, because you leave yourself more room for data storage and feature creep. It also allows you to cut hardware production costs because your code can run on smaller ROM/RAM.
P.P.S. References to articles and books are welcome too!
P.P.P.S. These questions are closely related: 404615, 1561629
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(16)
我在极其受限的嵌入式内存环境中获得的经验:
My experience from an extremely constrained embedded memory environment:
您可以采取很多措施来减少内存占用,我相信人们已经写过有关该主题的书籍,但其中一些主要的是:
用于减少代码大小的编译器选项(包括 -Os 和打包/对齐选项)
用于去除死代码的链接器选项
如果您正在加载从闪存(或 ROM)到 RAM 执行(而不是从闪存执行),然后使用压缩的闪存映像,并使用引导加载程序解压缩它。
使用静态分配:堆是分配有限内存的低效方式,如果受到限制,可能会因碎片而失败。
使用静态分配:
用于查找堆栈高水位线的工具(通常它们用模式填充堆栈,执行程序,然后查看模式保留的位置),以便您可以最佳地设置堆栈大小
当然,还可以优化您的算法用于内存占用(通常以牺牲速度为代价)
There are many things you can do to reduce your memory footprints, I'm sure people have written books on the subject, but a few of the major ones are:
Compiler options to reduce code size (including -Os and packing/alignment options)
Linker options to strip dead code
If you're loading from flash (or ROM) to ram to execute (rather than executing from flash), then use a compressed flash image, and decompress it with your bootloader.
Use static allocation: a heap is an inefficient way to allocate limited memory, and if it might fail due to fragmentation if it is constrained.
Tools to find the stack high-watermark (typically they fill the stack with a pattern, execute the program, then see where the pattern remains), so you can set the stack size(s) optimally
And of course, optimising the algorithms you use for memory footprint (often at expense of speed)
一些明显的问题
const
声明常量数据表。这将避免将数据从闪存复制到 RAM将知识折叠成数据
Unix 哲学 的规则之一可以帮助代码更紧凑:
我无法计算有多少次看到复杂的分支逻辑,跨越许多页面,它们可以被折叠成一个漂亮的紧凑的规则、常量和函数指针表。状态机通常可以用这种方式表示(状态模式)。命令模式也适用。这都是关于声明式与命令式编程风格的。
日志代码 + 二进制数据而不是文本
不记录纯文本,而是记录事件代码和二进制数据。然后使用“短语手册”来重构事件消息。短语手册中的消息甚至可以包含 printf 样式的格式说明符,以便事件数据值在文本中整齐地显示。
最小化线程数量
每个线程都需要自己的内存块来存放堆栈和 TSS。如果您不需要抢占,请考虑让您的任务在同一线程中协同执行(
使用内存池而不是囤积
为了避免堆碎片,我经常看到单独的模块囤积大量静态内存缓冲区供自己使用,即使只是偶尔需要内存。可以使用内存池来代替,因此内存仅“按需”使用。然而,这种方法可能需要仔细的分析和检测,以确保池在运行时不会耗尽。
仅在初始化时动态分配
在只有一个应用程序无限期运行的嵌入式系统中,您可以以一种不会导致碎片的合理方式使用动态分配:只需在各种初始化例程中动态分配一次,并且永远不会释放内存。
reserve()
将容器保留到正确的容量,并且不要让它们自动增长。如果您需要频繁分配/释放数据缓冲区(例如,用于通信数据包),请使用内存池。我曾经甚至扩展了 C/C++ 运行时,这样如果在初始化序列之后有任何东西试图动态分配内存,它就会中止我的程序。A few obvious ones
const
. This will avoid the data being copied from flash to RAMFolding knowledge into data
One of the rules of Unix philosophy can help make code more compact:
I can't count how many times I've seen elaborate branching logic, spanning many pages, that could've been folded into a nice compact table of rules, constants, and function pointers. State machines can often be represented this way (State Pattern). The Command Pattern also applies. It's all about the declarative vs imperative styles of programming.
Log codes + binary data instead of text
Instead of logging plain text, log event codes and binary data. Then use a "phrasebook" to reconstitute the event messages. The messages in the phrasebook can even contain printf-style format specifiers, so that the event data values are displayed neatly within the text.
Minimize the number of threads
Each thread needs it own memory block for a stack and TSS. Where you don't need preemption, consider making your tasks execute co-operatively within the same thread (cooperative multi-tasking).
Use memory pools instead of hoarding
To avoid heap fragmentation, I've often seen separate modules hoard large static memory buffers for their own use, even when the memory is only occasionally required. A memory pool could be used instead so the the memory is only used "on demand". However, this approach may require careful analysis and instrumentation to make sure pools are not depleted at runtime.
Dynamic allocation only at initialization
In embedded systems where only one application runs indefinitely, you can use dynamic allocation in a sensible way that doesn't lead to fragmentation: Just dynamically allocate once in your various initialization routines, and never free the memory.
reserve()
your containers to the correct capacity and don't let them auto-grow. If you need to frequently allocate/free buffers of data (say, for communication packets), then use memory pools. I once even extended the C/C++ runtimes so that it would abort my program if anything tried to dynamically allocate memory after the initialization sequence.与所有优化一样,首先优化算法,其次优化代码和数据,最后优化编译器。
我不知道你的程序是做什么的,所以我无法对算法提出建议。许多其他人写过有关编译器的文章。因此,这里有一些关于代码和数据的建议:
As with all optimization, first optimize algorithms, second optimize the code and data, finally optimize the compiler.
I don't know what your program does, so I can't advice on algorithms. Many others have written about the compiler. So, here's some advice on code and data:
从链接器生成映射文件。它将显示内存是如何分配的。这是优化内存使用时的良好开端。它还将显示所有函数以及代码空间的布局方式。
Generate a map file from your linker. It will show how the memory is allocated. This is a good start when optimizing for memory usage. It also will show all the functions and how the code-space is laid out.
这是一本关于该主题的书小内存软件:内存有限的系统的模式。
Here's a book on the subject Small Memory Software: Patterns for systems with limited memory.
在 VS 中使用 /O 进行编译。通常,这甚至比优化速度更快,因为更小的代码大小==更少的分页。
应在链接器中启用 Comdat 折叠(默认情况下在发布版本中)
注意数据结构打包;通常,这会导致编译器生成更多代码(==更多内存)来生成访问未对齐内存的程序集。 使用 1 位作为布尔标志是一个典型的示例。
此外,在选择内存高效算法而不是运行时间更好的算法时要小心。这就是过早优化的用武之地。
Compile in VS with /Os. Often times this is even faster than optimizing for speed anyway, because smaller code size == less paging.
Comdat folding should be enabled in the linker (it is by default in release builds)
Be careful about data structure packing; often time this results in the compiler generated more code (== more memory) to generate the assembly to access unaligned memory. Using 1 bit for a boolean flag is a classic example.
Also, be careful when choosing a memory efficient algorithm over an algorithm with a better runtime. This is where premature optimizations come in.
好吧,大多数内容已经提到了,但无论如何,这里是我的列表:
静态分配的缓冲区(池或最大实例大小的静态缓冲区)。
相反,Super-C 和 C++ 在重要的地方使用:在高级逻辑、GUI 等中。
最后但并非最不重要的 - while寻找尽可能最小的代码大小 - 不要过度。还要注意性能和可维护性。过度优化的代码往往会很快退化。
Ok most were mentioned already, but here is my list anyway:
statically allocated buffer (pool or maximum instance sized static buffer).
rather Super-C and C++ is used where it counts: in high level logic, GUI, etc.
Last but not least - while hunting for smallest possible code size - don't overdo it. Watch out also for performance and maintainability. Over-optimized code tends to decay very quickly.
首先,告诉编译器优化代码大小。 GCC 为此提供了
-Os
标志。其他一切都在算法级别 - 使用与查找内存泄漏类似的工具,但寻找可以避免的分配和释放。
还要看看常用的数据结构打包 - 如果您可以将它们削减一两个字节,则可以大大减少内存使用。
Firstly, tell your compiler to optimize for code size. GCC has the
-Os
flag for this.Everything else is at the algorithmic level - use similar tools that you would for finding memory leaks, but instead look for allocs and frees that you could avoid.
Also take a look at commonly used data structure packing - if you can shave a byte or two off them, you can cut down memory use substantially.
如果您正在寻找一种分析应用程序堆使用情况的好方法,请查看 valgrind 的 massif工具。它可以让您拍摄一段时间内应用程序内存使用情况的快照,然后您可以使用该信息更好地了解“容易实现的目标”在哪里,并相应地进行优化。
If you're looking for a good way to profile your application's heap usage, check out valgrind's massif tool. It will let you take snapshots of your app's memory usage profile over time, and you can then use that information to better see where the "low hanging fruit" is, and aim your optimizations accordingly.
分析代码或数据膨胀可以通过映射文件完成:对于 gcc,请参阅 此处,对于 VS,请参见此处.
不过,我还没有看到一个有用的大小分析工具(并且没有时间修复我的 VS AddIn hack)。
Profiling code or data bloat can be done via map files: for gcc see here, for VS see here.
I have yet to see a useful tool for size profiling though (and don't have time to fix my VS AddIn hack).
在其他人的建议之上:
限制使用 C++ 功能,像 ANSI C 那样编写,并带有较小的扩展。标准(std::)模板使用大型动态分配系统。如果可以的话,完全避免使用模板。虽然本质上没有危害,但它们使得仅通过几个简单、干净、优雅的高级指令生成大量机器代码变得太容易了。尽管有所有“干净的代码”优势,但这鼓励以一种非常消耗内存的方式进行编写。
如果您必须使用模板,请编写自己的模板或使用专为嵌入式使用而设计的模板,将固定大小作为模板参数传递,并编写测试程序,以便您可以测试您的模板并检查 -S 输出以确保编译器不会生成可怕的程序集代码来实例化它。
手动对齐结构,或使用 #pragma pack
出于同样的原因,使用集中式全局数据存储结构而不是分散的局部静态变量。
智能平衡 malloc()/new 和静态结构的使用。
如果您需要给定库的功能子集,请考虑编写自己的库。
展开短循环。
比更长的
不要这样做。
将多个文件打包在一起,让编译器内联短函数并执行链接器无法执行的各种优化。
on top what others suggest:
Limit use of c++ features, write like in ANSI C with minor extensions. Standard (std::) templates use a large system of dynamic allocation. If you can, avoid templates altogether. While not inherently harmful, they make it way too easy to generate lots and lots of machine code from just a couple simple, clean, elegant high-level instructions. This encourages writing in a way that - despite all the "clean code" advantages - is very memory hungry.
If you must use templates, write your own or use ones designed for embedded use, pass fixed sizes as template parameters, and write a test program so you can test your template AND check your -S output to ensure the compiler is not generating horrible assembly code to instantiate it.
Align your structures by hand, or use #pragma pack
For the same reason, use a centralized global data storage structure instead of scattered local static variables.
Intelligently balance usage of malloc()/new and static structures.
If you need a subset of functionality of given library, consider writing your own.
Unroll short loops.
is longer than
Don't do that for longer ones.
Pack multiple files together to let the compiler inline short functions and perform various optimizations Linker can't.
不要害怕在程序中编写“小语言”。有时,一个字符串表和一个解释器可以完成很多工作。例如,在我工作过的系统中,我们有很多内部表,必须以各种方式访问它们(循环,等等)。我们有一个用于引用表格的内部命令系统,该系统形成了一种半途语言,对于它所得到的内容来说非常紧凑。
但是,要小心!知道你正在写这样的东西(我自己无意中写了一篇),并记录你在做什么。最初的开发人员似乎并没有意识到他们在做什么,因此管理起来比应有的要困难得多。
Don't be afraid to write 'little languages' inside your program. Sometimes a table of strings and an interpreter can get a LOT done. For instance, in a system I've worked on, we have a lot of internal tables, which have to be accessed in various ways (loop through, whatever). We've got an internal system of commands for referencing the tables that forms a sort of half-way language that's quite compact for what it gets donw.
But, BE CAREFUL! Know that you are writing such things (I wrote one accidentally, myself), and DOCUMENT what you are doing. The original developers do NOT seem to have been conscious of what they were doing, so it's much harder to manage than it should be.
优化是一个流行的术语,但在技术上常常是不正确的。它的字面意思是使最优。无论是速度还是尺寸,这样的条件实际上从未实现。我们可以简单地采取一些措施来走向优化。
许多(但不是全部)用于实现计算结果最短时间的技术都会牺牲内存需求,并且许多(但不是全部)用于实现最小内存需求的技术会延长获得结果的时间。
内存需求的减少相当于固定数量的通用技术。很难找到一种特定的技术不能完美地适应其中的一个或多个。如果您完成了所有这些,即使不是绝对最小可能的空间要求,您也会非常接近程序的最小空间要求。对于一个真正的应用程序,一个经验丰富的程序员团队可能需要一千年的时间才能完成。
这是该主题的计算机科学观点,而不是开发人员的观点。
例如,打包数据结构就是结合上面的(3)和(9)的工作。压缩数据是至少部分实现上述(1)的一种方法。减少高级编程结构的开销是在 (7) 和 (8) 中取得一些进展的一种方法。动态分配是一种利用多任务环境来使用(3)的尝试。编译警告如果打开,可以帮助解决 (5) 问题。析构函数尝试协助 (6)。套接字、流和管道可用于完成 (2)。简化多项式是在(8)中取得进展的一种技术。
对九的含义以及实现它们的各种方法的理解是多年学习和检查编译产生的内存映射的结果。由于可用内存有限,嵌入式程序员通常可以更快地学习它们。
在 gnu 编译器上使用 -Os 选项会向编译器发出请求,尝试查找可以转换以完成这些任务的模式,但 -Os 是一个聚合标志,可打开许多优化功能,每个功能都尝试执行转换以完成上述 9 项任务之一。
编译器指令可以在不需要程序员努力的情况下产生结果,但是编译器中的自动化过程很少纠正由于代码编写者缺乏意识而产生的问题。
Optimizing is a popular term but often technically incorrect. It literally means to make optimal. Such a condition is never actually achieved for either speed or size. We can simply take measures to move toward optimization.
Many (but not all) of the techniques used to move toward minimum time to a computing result sacrifices memory requirement, and many (but not all) of the techniques used to move toward minimum memory requirement lengthens the time to result.
Reduction of memory requirements amounts to a fixed number of general techniques. It is difficult to find a specific technique that does not neatly fit into one or more of these. If you did all of them, you'd have something very close to the minimal space requirement for the program if not the absolute minimum possible. For a real application, it could take a team of experienced programmers a thousand years to do it.
This is a computer science view of the topic, not a developer's one.
For instance, packing a data structure is an effort that combines (3) and (9) above. Compressing data is a way to at least partly achieve (1) above. Reducing overhead of higher level programming constructs is a way to achieve some progress in (7) and (8). Dynamic allocation is an attempt to exploit a multitasking environment to employ (3). Compilation warnings, if turned on, can help with (5). Destructors attempt to assist with (6). Sockets, streams, and pipes can be used to accomplish (2). Simplifying a polynomial is a technique to gain ground in (8).
Understanding of the meaning of nine and the various ways to achieve them is the result of years of learning and checking memory maps resulting from compilation. Embedded programmers often learn them more quickly because of limited memory available.
Using the -Os option on a gnu compiler makes a request to the compiler to attempt to find patterns that can be transformed to accomplish these, but the -Os is an aggregate flag that turns on a number of optimization features, each of which attempts to perform transformations to accomplish one of the 9 tasks above.
Compiler directives can produce results without programmer effort, but automated processes in the compiler rarely correct problems created by lack of awareness in the writers of the code.
请记住某些 C++ 功能的实现成本,例如虚函数表和创建临时对象的重载运算符。
Bear in mind the implementation cost of some C++ features, such as virtual function tables and overloaded operators that create temporary objects.
除了其他人所说的之外,我还想补充一点,不要使用虚拟函数,因为使用虚拟函数必须创建一个 VTable,谁知道它会占用多少空间。
还要注意异常情况。使用 gcc,我不相信每个 try-catch 块的大小都会增加(每个 try-catch 的 2 个函数
call
除外),但是有一个固定大小的函数,它必须是链接其中可能会浪费宝贵的字节Along with that everyone else said, I'd just like to add don't use virtual functions because with virtual functions a VTable must be created which can take up who knows how much space.
Also watch out for exceptions. With gcc, I don't believe there is a growing size for each try-catch block(except for 2 function
call
s for each try-catch), but there is a fixed size function which must be linked in which could be wasting precious bytes