TI DSP 编程 - C 语言足够快还是我需要汇编器?
我将为德州仪器达芬奇平台编写一些图像处理程序。有一些适合用 C 语言编程的工具,但我想知道是否真的可以充分利用 DSP 处理器而不诉诸汇编语言。您知道在此 DSP 平台上用 C 语言编写的程序与用汇编程序编写的程序之间的速度比较吗?
I am going to write some image processing programs for Texas Instruments DaVinci platform. There are tools appropriate for programming in the C language, but I wonder if it is really possible to take full advantage of the DSP processor without resorting to an assembly language. Do you know about any comparisons of speed between programs written in C and in assembler on this DSP platform?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
我使用过其他一些 TI DSP,C 通常都很好。通常的方法是首先用 C 编写所有内容,然后分析代码以查看是否需要手动优化任何内容。
您通常也可以在 C 中进行优化,通过调整 C 代码直到获得所需的汇编输出。了解 DSP 的工作原理以及哪些工作方式更快或更慢非常重要。
I've used some other TI DSPs and C was usually fine. The usual approach is to start by writing everything in C and then profile the code to see if anything needs to be hand-optimised.
You can often do the optimisation in C too, by adjusting the C code until you get the assembly output you want. It's important to know how the DSP works and what ways of working are faster or slower.
OMAP3 上 C64x/C64x+ DSP 的 TI 编译器支持 TI 所谓的“内在”函数调用。它们并不是真正的函数调用,它们只是告诉编译器对于可能无法直接用 C 语言表达的操作使用什么汇编操作码的一种方式。它对于利用 C64x/C64x+ DSP 中的 SIMD 操作码特别有用C.
一个例子可能是:
A = _add2(B, C);
此 SIMD 指令将 B 和 C 的低/高 16 位相加,并将结果存储在 A 的低/高 16 位中。您无法用常规 C 来表达这一点,但可以使用固有的 C 操作码来实现。
我使用内在 C 语言的效果非常接近成熟的汇编语言的效果(5-10% 以内)。它对于过滤和运动补偿等视频功能特别有用(_dotpsu4!)。
我通常使用 -al 开关进行编译,并查看管道以尝试识别哪些功能单元过载,然后查看我的内在函数以查看是否可以重新平衡循环(如果我使用太多 S 单元,我可能会看到如果我可以更改操作码以使用 M 单元)。
此外,记住 C64x DSP 有 64 个寄存器也很有帮助,因此加载局部变量并且永远不要将指令的输出分配回同一变量 - 它会对编译器正确进行流水线处理的能力产生负面影响。
The TI compiler for the C64x/C64x+ DSP on the OMAP3 includes support for what TI calls "intrinsic" function calls. They're not really function calls, they are just a way to tell the compiler what assembly opcode to use for an operation that might not be directly expressable in C. It is especially useful for leveraging the SIMD opcodes in the C64x/C64x+ DSP from C.
An example might be:
A = _add2(B, C);
This SIMD instruction adds the low/high 16 bits of B and C together and store the results in the low/high 16 bits of A. You can't express this in regular C, but you can do it with the intrinsic C opcodes.
I have used intrinsic C to get very close to what you could do with full-blown assembly language (within 5-10%). It is especially useful for video functions like filtering and motion compensation (_dotpsu4!).
I usually compile with the -al switch and look at the pipeline to try and identify what functional units are overloaded and then look at my intrinsics to see if I can rebalance the loop (if I'm using too many S units, I might see if I could change the opcode to use an M unit).
Also, it's helpful to remember that the C64x DSP has 64 registers, so load up the local variables and never assign the output of an instruction back into the same variable -- it'll negatively affect the compiler's ability to pipeline properly.
通常 C 是一个很好的起点。您可以快速摆脱整体框架和算法,并编写在实际数学之间移动数据的大部分管道。一旦完成,并且您对数据结构的正确性感到满意,您可以在分析器中查看并找出需要手动压缩的例程。
Usually C is a good place to start. You can get the overall framework and algorithms shaken out quickly, and write most of the plumbing that moves the data around between the real math. Once that's in place and you're happy that your data structures are correct, you can look at in a profiler and figure out which routines need to be squeezed by hand.
C 编译器(据我测试)没有充分利用该架构。
但您可以侥幸逃脱,因为 DSP 的速度可能足以满足您需要执行的操作。
因此,归根结底是测试和分析您的代码,以查看必须加快速度才能使系统正常工作的部分。
The C-Compiler (as far as I tested) does not take full advantage of the architecture.
But you can get away with it, because the DSP might be fast enough for the operations you need to do.
So it comes down to testing and profiling your code to see the parts which must be speed up to get the system to work.
取决于 C 编译器和你对“足够快”的定义。标准 C 编译器常常难以有效地利用特殊 DSP 硬件,例如
并行访问
Depends on the C compiler and your definition of "fast enough". Standard C compilers often struggle to make efficient use of special DSP hardware, such as:
accessed in parallel
简单的速度比较没有任何意义。如果比汇编更方便的话,肯定是 c。你必须衡量你的系统的时间成本,如果c代码满足你对速度的要求,你就不必使用汇编程序。如果速度不够,你可以分析你的代码,找出最耗时的源代码,例如循环代码,然后优化它!
the simple compare of the speed means nothing. Definitely c if more convenient than assembler. You must measure the cost of time of your system, if c code satisfy your require for speed ,you don't have to use assembler. If the speed is not enough, you can profile your code ,find out the most time consuming source code such as loop code, then optimize it!
我会坚持使用 C,直到我知道有一个热点可以从汇编编码中受益。 这是我使用的“分析”方法。您可能会感到惊讶,加速代码的方法不是热点,而是可以删除的中间函数调用。
I would stick to C until I know there is a hotspot that could benefit from assembly coding. This is the "profiling" method I use. You could be surprised that there are ways to speed up the code that are not hotspots, but rather intermediate function calls that could be removed.
使用 -O3 优化进行编译。它非常强大。
如果它还不够好,您可以根据自己的喜好进一步优化生成的汇编代码,而不是从头开始在 ASM 中自己编写所有内容。
Compile using the -O3 optimisation. It is very powerful.
In the event it is not good enough, you can further optimise the generated assembly code to your liking instead of coding everything yourself in ASM from scratch.