当前位置：文江博客话题详情

分析和提高嵌入式系统性能的步骤/策略是什么

发布于 2024-09-13 11:48:24 字数 253 浏览 3 评论 0原文

我将把这个问题分解为子问题。我很困惑是否应该单独问他们还是在一个问题中问他们。所以我只会提出一个问题。

分析和提高 C 应用程序性能的一般步骤是什么？
如果我正在为嵌入式系统进行开发，这些步骤是否会发生变化？
有哪些工具可以帮助我？

最近我接到一个任务，要提高我们产品在 ARM11 平台上的性能。我对嵌入式系统领域相对较新，需要这里的专家来帮助我。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

拥抱影子 2024-09-20 11:48:25

只需更改编译器就可以将相同源代码的 C 性能提高很多倍。多年来，GCC 的性能不一定会变得更好，对于某些程序，gcc 3.x 生成的代码比 4.x 更紧凑。当我使用这些工具时，ARM 编译器生成的代码明显比 gcc 好。速度快了 3 到 4 倍。 LLVM 已经赶上了 GCC 4.x，我怀疑在交叉编译嵌入式代码的性能和整体使用方面会超过 gcc。如果您使用 gcc，请尝试不同版本的 gcc、3.x 和 4.x。 Metaware的编译器和arms adt围绕gcc3.x运行，gcc3.x将给gcc4.xa运行arm代码，对于thumb代码gcc4.x更好，对于thumb2（不适用于你）gcc4.x也更好的。请记住，我还没有说过要更改一行代码。

除了比 gcc 无限多的调节旋钮之外，LLVM 还能够进行全面的程序优化。尽管生成的代码（版本 27）在我尝试过的几个程序的性能方面仅赶上当前的 gcc 4.x。而且我没有尝试 n 因子数的优化组合（优化编译步骤，每个文件的不同选项，或者组合两个文件或三个文件或所有文件并优化这些包，我的理论是不对 C 到 bc 进行优化步骤，将所有 bc 连接在一起，然后对整个程序进行一次优化，当 llc 将其带到目标时允许默认优化）。

同样，只需了解您的编译器和优化即可极大地提高代码的性能，而无需对其进行任何更改。你有一个ARM11 arr，你是为arm11还是通用arm编译的？通过具体告诉编译器哪种架构/系列（例如armv6）比通常选择作为默认值的通用armv4（ARM7），您可以获得百分之几到十几的收益。如果你勇敢的话，知道使用-O2或-O3。

通常情况并非如此，但切换到拇指模式可以提高特定平台的性能。不适用于您，但 gameboy advance 是一个完美的例子，加载了非零等待状态 16 位总线。 Thumb 的开销只有一小部分，因为它需要更多的指令来完成相同的操作，但通过增加获取时间并利用 GBA Thumb 代码的一些顺序读取功能，可以比 Arm 代码运行得更快。相同的源代码。

拥有arm11，您可能有L1，也许还有L2缓存，它们打开了吗？它们配置好了吗？您有 mmu 并且您的大量使用内存是否被缓存？或者您正在运行零等待状态内存并且不需要缓存并且应该将其关闭？除了没有意识到您可以通过更改编译器或选项来获取相同的源代码并使其运行速度提高许多倍之外，人们通常也没有意识到，当您使用缓存时，只需在启动代码中添加一个最多几个 nop（作为一种技巧，可以将代码在内存中的位置调整一个、两个或几个字），您可以将代码执行速度更改多达 10% 到 20%。这些缓存行读取在频繁使用的函数/循环中命中的位置会产生很大的差异。即使通过调整代码着陆位置来节省一个缓存行读取也是很明显的（例如将其从 3 减少到 2 或从 2 减少到 1）。

了解您的架构、处理器和内存环境是调整（如果有的话）的起点。大多数 C 库，如果您的级别足够高，可以在其 C 代码中使用一个库（我经常不使用 C 库，因为我在没有操作系统且资源非常有限的情况下运行），有时会添加一些汇编程序来制作像 memcpy 这样的瓶颈例程，快得多。如果您的程序在对齐的 32 位或更好的 64 位地址上运行，并且您进行调整，即使这意味着为每个结构/数组/memcpy 使用更多字节的内存，使其成为 32 位或 64 位的整数倍，您将看到显着的改进（如果您的代码使用结构或以其他方式复制数据）。除了让你的结构（如果你使用它们，我当然不会使用嵌入式代码）大小对齐之外，即使你浪费内存，让元素对齐，也可以考虑为每个元素使用 32 位整数而不是字节或半字。根据您的记忆系统，这可能会有所帮助（顺便说一句，它也可能会造成伤害）。与上面的 GBA 示例一样，通过分析或直觉查看特定函数，您知道这些函数并未以利用处理器、平台或库的方式实现，您可能希望从头开始使用汇编程序或最初从 C 进行编译然后拆卸并手工调整。 Memcpy 是一个很好的例子，您可能了解系统内存性能，并且可能选择专门为对齐数据创建自己的 memcpy，每条指令复制 64 或 128 或更多位。

同样，混合全局变量和局部变量可以产生显着的性能差异。传统上人们被告知永远不要使用全局变量，但在嵌入式中这不一定是真的，取决于嵌入的深度、调整的程度、速度以及您感兴趣的其他因素。这是一个敏感的话题，我可能会因此而受到攻击，所以我就这样吧。

编译器必须烧录和逐出寄存器才能进行函数调用，此外，如果使用局部变量，可能需要堆栈帧，因此函数调用的成本很高，但同时，这取决于函数中现在具有的代码通过避免使用函数来增加大小，您可能会产生您试图避免的问题，逐出寄存器以重新使用它们。即使是一行 C 代码也可以使函数中的所有变量都适合寄存器，从而导致必须开始逐出一堆寄存器。对于您知道需要一些性能增益的函数或代码段，进行编译和反汇编（并查看寄存器的使用情况、获取内存或写入内存的频率）。您可以并且将会找到需要采用良好使用的循环并使其成为自己的函数的地方，即使函数调用有一个惩罚，因为通过这样做，编译器可以更好地优化循环而不是逐出/重用寄存器，并且您会得到一个整体净收益。即使是循环中执行数百次的一条额外指令也会对性能造成可衡量的影响。

希望您已经知道绝对不要编译调试，关闭所有编译调试选项。您可能已经知道，为调试而编译且运行时没有错误的代码并不意味着它已被调试，为调试而编译并使用调试器会隐藏错误，使它们成为代码中的定时炸弹，供您最终编译发布。了解始终编译发布版本并使用发布版本进行测试，以提高性能并查找代码中的错误。

大多数指令集没有除法功能。尽可能避免在代码中使用除法或取模，它们是性能杀手。当然，对于 2 的幂来说情况并非如此，为了节省编译器并在心理上避免除法和模数，请尝试使用移位和与。乘法更容易，也更常见于指令集中，但成本仍然很高。这是编写汇编程序来执行乘法而不是让 C 编译器执行的好例子。 ARM 乘法是 32 位 * 32 位 = 32 位，因此要在不溢出的情况下进行精确数学运算，必须在乘法周围包含额外的 C 代码，如果您已经知道不会溢出，请烧录函数调用的寄存器并在中进行乘法汇编器（用于手臂）。

同样，大多数指令集都没有浮点单元，但您的指令集可能有浮点单元，但如果可能的话，请避免浮动。如果你必须使用浮动，那就是另一个潘多拉魔盒的性能问题。大多数人不会这么简单地看到代码的性能问题：

float a,b;

...

a = b * 7.0;

剩下的问题不在于理解浮点精度，也不了解 C 库只是试图将常量转换为浮点形式。同样，浮动是关于性能问题的另一个漫长的讨论。

我是 Michael Abrash 的产品（我实际上有一本汇编语言之禅的印刷版），最重要的是为你的代码计时。提出一种准确的方法来对代码进行计时，您可能认为您知道瓶颈在哪里，并且您可能认为您了解自己的架构，但尝试不同的事情，即使您认为它们是错误的，并对它们进行计时，您可能会发现并最终不得不这样做找出你思维中的错误。将 nops 添加到 start.S 作为最后的调整步骤就是一个很好的例子，您为性能所做的所有其他工作都可以通过与缓存没有良好的对齐来立即删除，这也意味着重新安排您的函数中的功能源代码，以便它们落在二进制图像中的不同位置。我发现由于缓存行对齐，速度增加和减少有 10% 到 20% 的波动。

simply changing compilers can improve your C performance for the same source code by many times over. GCC has not necessarily gotten better for performance over the years, for some programs gcc 3.x produces much tighter code than 4.x. Back when I had access to the tools, ARMs compiler produced significantly better code than gcc. As much as 3 or 4 times faster. LLVM has caught up to GCC 4.x and I suspect will pass gcc by in terms of performance and overall use for cross compiling embedded code. Try different versions of gcc, 3.x and 4.x if you are using gcc. Metaware's compiler and arms adt ran circles around gcc3.x, gcc3.x will give gcc4.x a run for its money with arm code, for thumb code gcc4.x is better and for thumb2 (which doesnt apply to you) gcc4.x also better. Remember I have not said a word about changing a single line of code (yet).

LLVM is capable of full program optimization in addition to infinitely more tuning knobs than gcc. Despite that the code generated (ver 27) is only just catching up to the current gcc 4.x in terms of performance for the few programs I tried. And I didnt try the n factoral number of optimization combinations (optimize on the compile step, different options for each file, or combine two files or three files or all files and optimize those bundles, my theory is do no optimization on the C to bc steps, link all the bc together then do a single optimization pass on the whole program, the allow the default optimization when llc takes it to the target).

By the same token simply knowing your compiler and the optimizations can greatly improve the performance of the code without having to change any of it. You have an ARM11 arr you compiling for arm11 or generic arm? You can gain a few to a dozen percent by telling the compiler specifically which architecture/family (armv6 for example) over the generic armv4 (ARM7) that is often chosen as the default. Knowing to use -O2 or -O3 if you are brave.

It is often not the case but switching to thumb mode can improve performance for specific platforms. Doesnt apply to you but the gameboy advance is a perfect example, loaded with non-zero wait state 16 bit busses. Thumb has a handful of a percent overhead because it takes more instructions to do the same thing, but by increasing the fetch times, and taking advantage of some of the sequential read features of the gba thumb code can run significantly faster than arm code for the same source code.

having an arm11 you probably have an L1 and maybe L2 cache, are they on? Are they configured? Do you have an mmu and is your heavy use memory cached? or are you running zero wait state memory and dont need a cache and should turn it off? In addition to not realizing that you can take the same source code and make it run many times faster by changing compilers or options, folks often dont realize that when you use a cache simply adding a single up to a few nops in your startup code (as a trick to adjust where code lands in memory by one, two, a few words) you can change your codes execution speed by as much as 10 to 20 percent. Where those cache line reads hit in heavily used functions/loops makes a big difference. Even saving one cache line read by adjusting where the code lands is noticeable (cutting it from 3 to 2 or 2 to 1 for example).

Knowing your architecture, both the processor and your memory environment is where the tuning if any would start. Most C libraries if you are high level enough to use one (I often dont use a C library as I run without an operating system and with very limited resources) both in their C code and sometimes add some assembler to make bottleneck routines like memcpy, much faster. If your programs are operating on aligned 32 or even better 64 bit addresses, and you adjust even if it means using a handful of bytes more memory for every structure/array/memcpy to be an integral multiple of 32 bits or 64 bits you will see noticeable improvements (if your code uses structs or copies data in other ways). In addition to getting your structures (if you use them, I certainly dont with embedded code) size aligned, even if you waste memory, getting elements aligned, consider using 32 bit integers for every element instead of bytes or halfwords. Depending on your memory system this can help (it can hurt too btw). As with the GBA example above looking at specific functions that either by profiling or intuition you know are not being implemented in a manner that takes advantage of your processor or platform or libraries you may want to turn to assembler either from scratch or compiling from C initially then disassembling and hand tuning. Memcpy is a good example you may know your systems memory performance and may chose to create your own memcpy specifically for aligned data, copying 64 or 128 or more bits per instruction.

Likewise mixing global and local variables can make a noticeable performance difference. Traditionally folks are told never to use globals, but in embedded this isnt necessarily true, depends on how deeply embedded and how much tuning and speed and other factors you are interested in. This is a touchy subject and I may get flamed for it, so I will leave it at that.

The compiler has to burn and evict registers in order to make function calls, plus if you use local variables a stack frame may be required, so function calls are expensive, but at the same time, depending on the code within a function that has now grown in size by avoiding functions, you may create the problem you were trying to avoid, evicting registers to re-use them. Even a single line of C code can make the difference between all the variables in a function fits in registers to having to start evicting a bunch of registers. For functions or segments of code where you know you need some performance gain compile and disassemble (and look at register usage, how often it fetches memory or writes to memory). You can and will find places where you need to take a well used loop and make it its own function even though the function call has a penalty because by doing that the compiler can better optimize the loop and not evict/reuse registers and you get an overall net gain. Even a single extra instruction in a loop that goes around hundreds of times is a measurable performance hit.

Hopefully you already know to absolutely not compile for debug, turn all of the compile for debug options off. You may already know that code compile for debug that runs without bugs doesnt mean it is debugged, compiling for debug and using debuggers hide bugs leaving them as time bombs in your code for your final compile for release. Learn to always compile for release and test with the release version both for performance and finding bugs in your code.

Most instruction sets do not have a divide function. Avoid using divides or modulo in your code as much as humanly possible they are performance killers. Naturally this is not the case for powers of two, to save the compiler and to mentally avoid divides and modulos try to use shifts and ands. Multplies are easier and more often found in instruction sets, but are still costly. This is a good case to write assembler to do your multiplies instead of letting the C copiler do it. The arm multiply is a 32bit * 32bit = 32 bit so to do accurate math without overflowing there has to be extra C code wrapped around the multiply, if you already know you wont overflow, burn the registers for a function call and do the multiply in assembler (for the arm).

Likewise most instruction sets do not have a floating point unit, with yours you might, even so avoid float if at all possible. If you have to use float that is a whole other pandora's box of performance issues. Most folks dont see the performance problems with code as simple as this:

float a,b;

...

a = b * 7.0;

The rest of the problem is not understanding floating point accuracy and how good or bad the C libraries are just trying to get your constants into floating point form. Again float is a whole other long discussion on performance problems.

I am a product of Michael Abrash (I actually have a print copy of zen of assembly language) and the bottom line is time your code. Come up with an accurate way to time the code, you may think you know where the bottlenecks are and you may think you know your architecture but trying different things even if you think they are wrong, and timing them you may find and eventually have to figure out the error in your thinking. Adding nops to start.S as a final tuning step is a good example of this, all the other work you have done for performance can be instantly erased by not having a good alignment with the cache, this also means re-arranging functions within your source code so that they land in different places in the binary image. I have seen 10 to 20 percent swings of speed increase and decrease as a result of cache line alignments.

回复收藏 0 原文

讽刺将军 2024-09-20 11:48:25

代码审查：
什么是好的代码审查技术？
代码的静态和动态分析。
静态分析工具：Sparrow、Prevent、Klockworks
动态分析工具：Valgrind、净化
Gprof 允许您了解您的位置程序花费了时间，以及哪些函数在执行时调用了哪些其他函数。
步骤相同
除了列出的第 1 点之外，还有诸如 memcheck 等
这里有一个基于平台的大列表

回复收藏 0 原文

冰雪梦之恋 2024-09-20 11:48:25

唷!!真是个大问题啊！

一般有哪些步骤
分析并改进C的性能
应用程序？

除了这里提到的其他静态代码分析器之外，还有一个相当便宜的版本，名为 PC-Lint< /a> 已经存在很多年了。有时会针对一个错误引发大量错误和警告，但到最后您会很高兴并因此对 C/C++ 了解更多。

对于所有代码分析器，某些问题可能对代码来说更具结构性，因此最好从编码的第一天开始分析；在旧软件上运行分析可能会让您陷入困境，这些问题可能需要一段时间才能解决，最好从一开始就保持干净。

但代码分析器不会捕获所有逻辑错误，即它不会执行您希望它执行的操作！这些最好先通过代码审查然后测试来完成。通过尝试使算法尽可能简单、保持循环中的指令紧密、可能展开循环（您的编译器优化可能会这样做）、在访问缓慢获取的数据时使用快速缓存，通常可以提高性能。

代码审查可能会从许多其他人的角度提出很多问题。不要找太多人，如果可能的话尝试找另外 3 个人，有时初级开发人员会问一些最有洞察力的问题，比如“我们为什么要这样做？”。

测试可以大致分为两个部分：自动测试和手动测试。自动化测试需要努力为功能/单元生成测试处理程序，但一旦运行就可以非常快速地一次又一次运行。手动测试需要计划、自律以按要求执行所有测试、想象力来思考可能会损害性能的场景，并且您必须保持敏锐的观察力（您可能已经通过了测试，但“范围跟踪”有点异常）测试前/测试后）。

“如果我是，这些步骤会改变吗？
为嵌入式系统开发？”

和应用程序系统上的性能分析可能有所不同；“嵌入式”现在涵盖的范围非常广泛，它取决于您以硬件为中心的程度。如果您想要一个更便宜和更愉快的方法是使用测试输出引脚来测量代码部分，或者在开发环境附带的模拟器上使用断点来测量它们，

确保不仅测量任务的典型长度，而且测量最大值。其中一项任务可能会开始阻碍其他任务，并且您的计划任务无法及时完成。

有哪些工具可以
帮帮我？

IDE 上的模拟器、静态分析工具、动态分析工具，但最重要的是您和其他人满足正确的需求、适当的审查（代码和测试）和彻底的测试（自动和手动）。

祝你好运！

Phew!! Quite a big question!

What are generally the steps to
analyze and improve performance of C
applications?

As well as other static code analysers mentioned here there is a fairly cheap version called PC-Lint which has been around for ages. Sometimes throws up lots of errors and warnings for one error but by the end of it you'll be happy and know waaaaay more about C/C++ because of it.

With all code analysers some of the issues may be more structural to the code so best to start analysing it from day 1 of coding; running analysis on old software may swamp you with issues which may take a while to untangle, best to keep it clean from the beginning.

But code analysers will not catch all logical errors, i.e. it doesn't do what you want it to do! These are best done by code reviews first, then testing. Performance is often improved by by trying to keep the algorithms as simple as possible, keeping instructions in loops tight, possibly unrolling loops (your compiler optimisations may do this), use of fast caches when accessing data which is slow to get.

Code reviews can raise a lot of issues from lots of other peoples eyes looking at it. Don't get too many people, try to get 3 other people if possible, sometimes junior developers ask the most insightful questions like, "why are we doing this?".

Testing can be roughly split into two sections, automated and manual. Automated testing requires effort producing test handlers for functions/units but once run can be run again and again very quickly. Manual testing requires planning, self-discipline to perform them all to the required, imagination to think up of scenarios that may impair performance and you have to be observant (you may have passed the test but the 'scope trace has a bit of an anomaly before/after the test).

"Do these steps change if I am
developing for an embedded system?"

Performance ananlysis can be different on embedded systems to applications systems; with the very broad brush that "embedded" now covers it depends how hardware-centric you are. It can be done using profilers, if you want a more cheap and chearful method then use test output pins to measure sections of code, or measure them with breakpoints on simulators that come with the development environment.

Make sure that not just a typical length of task is measured but also a maximum, as that is where one task may start impeding on other tasks and your scheduled tasks are not completed in time.

What tools are out there which can
help me?

Simulators on the IDEs, static analysis tools, dynamic analysis tools, but most of all you and other humans getting the requirements right, decent reviewing (of code and testing) and thorough testing (automated and manual).

Good luck!

回复收藏 0 原文

等往事风中吹 2024-09-20 11:48:25

我的经历。

函数调用很慢，可以用宏或内联方法来消除。查看反汇编程序列表即可看到。
如果使用 GCC，请使用 #pragma GCC optimize("O3") 标记优化部分或单独编译它们。
尝试应用内联属性的不同组合（基本上找到大小和速度之间的平衡）。

回复收藏 0 原文

意犹 2024-09-20 11:48:25

这是一个很难回答的问题，因为已经提出了各种技术，例如流程图和状态图，所以你可以看看一些标题：

ARM片上系统架构，第二版--Steve Furber

ARM系统开发人员指南- 设计和优化系统软件——Andrew N. Sloss、Dominic Symes、Chris Wright & John Rayfield

ARM Cortex-M3 权威指南 --Joseph Yiu

嵌入式系统 C 编程 --Kirk Zurell

嵌入式 C -- Michael J. Pont

用 C 和 C++ 进行嵌入式系统编程 --Michael Barr

嵌入式软件入门 --David E、Simon

嵌入式微处理器系统第 3 版 --Stuart Ball

嵌入式系统的全球规范和验证 - 集成异构组件 --G。尼科莱斯库AA Jerraya

嵌入式系统：建模、技术和应用 --Gunter Hommel & Co.盛焕业

嵌入式系统与计算机体系结构 --Graham Wilson

设计嵌入式硬件 --John Catsoulis

回复收藏 0 原文

盗心人 2024-09-20 11:48:25

您必须使用分析器。它将帮助您识别应用程序的瓶颈。然后专注于改进您花费最多时间和调用最多的功能。重复此过程，直到您对应用程序性能感到满意为止。
不，他们没有。
取决于您正在开发的平台：
Windows：AMD 代码分析师、VTune、Sleepy
Linux：valgrind/callgrind/cachegrind
Mac：Xcode 分析器非常好。

尝试找到适合您实际使用的架构的分析器。

回复收藏 0 原文

~没有更多了~