如何使用 DSP 加速 OMAP 上的代码?
我正在开发 OMAP3430 的视频编解码器。 我已经用 C++ 编写了代码,并且尝试修改/移植其中的某些部分以利用 DSP(我拥有的 SDK (OMAP ZOOM3430 SDK) 有一个额外的 DSP)。
我尝试移植一个小型 for 循环,该循环在非常少量的数据(~250 字节)上运行,但在不同的数据上运行大约 2M 次。 但是 CPU 和 DSP 之间的通信带来的过载远远大于增益(如果有的话)。
我认为这个任务很像优化普通计算机中 GPU 的代码。 我的问题是移植什么样的部件会有好处? GPU 程序员如何处理此类任务?
编辑:
- GPP 应用程序分配大小为 0x1000 字节的缓冲区。
- GPP 应用程序调用 DSPProcessor_ReserveMemory 为每个分配的缓冲区保留 DSP 虚拟地址空间,使用的大小比分配的缓冲区大 4K,以实现自动页对齐。 总预留大小还必须沿 4K 页边界对齐。
- GPP 应用程序调用 DSPProcessor_Map 将每个分配的缓冲区映射到上一步中保留的 DSP 虚拟地址空间。
- GPP应用程序准备消息来通知DSP执行阶段虚拟地址空间的基地址,该基地址已映射到GPP上分配的缓冲区。 GPP 应用程序使用 DSPNode_PutMessage 将消息发送到 DSP。
- GPP调用memcpy将待处理的数据复制到共享内存中。
- GPP 应用程序调用 DSPProcessor_FlushMemory 以确保数据缓存已被刷新。
- GPP 应用程序准备一条消息来通知 DSP 执行阶段它已完成对缓冲区的写入,并且 DSP 现在可以访问缓冲区。 该消息还包含写入缓冲区的数据量,以便 DSP 知道要复制多少数据。 GPP 使用 DSPNode_PutMessage 将消息发送到 DSP,然后调用 DSPNode_GetMessage 等待听到 DSP 返回的消息。
之后DSP程序开始执行,DSP处理完成后用消息通知GPP。 只是为了尝试,我没有在 DSP 程序中放入任何处理。 我只是将“处理完成”消息发送回 GPP。 而这仍然要消耗大量的时间。 这可能是因为内部/外部内存的使用,还是仅仅是因为通信过载?
I'm working on a video codec for OMAP3430. I already have code written in C++, and I try to modify/port certain parts of it to take advantage of the DSP (the SDK (OMAP ZOOM3430 SDK) I have has an additional DSP).
I tried to port a small for loop which is running over a very small amount of data (~250 bytes), but about 2M times on different data. But the overload from the communication between CPU and DSP is much more than the gain (if I have any).
I assume this task is much like optimizing a code for the GPU's in normal computers. My question is porting what kind of parts would be beneficial? How do GPU programmers take care of such tasks?
edit:
- GPP application allocates a buffer of size 0x1000 bytes.
- GPP application invokes DSPProcessor_ReserveMemory to reserve a DSP virtual address space for each allocated buffer using a size that is 4K greater than the allocated buffer to account for automatic page alignment. The total reservation size must also be aligned along a 4K page boundary.
- GPP application invokes DSPProcessor_Map to map each allocated buffer to the DSP virtual address spaces reserved in the previous step.
- GPP application prepares a message to notify the DSP execute phase of the base address of virtual address space, which have been mapped to a buffer allocated on the GPP. GPP application uses DSPNode_PutMessage to send the message to the DSP.
- GPP invokes memcpy to copy the data to be processed into the shared memory.
- GPP application invokes DSPProcessor_FlushMemory to ensure that the data cache has been flushed.
- GPP application prepares a message to notify the DSP execute phase that it has finished writing to the buffer and the DSP may now access the buffer. The message also contains the amount of data written to the buffer so that the DSP will know just how much data to copy. The GPP uses DSPNode_PutMessage to send the message to the DSP and then invokes DSPNode_GetMessage to wait to hear a message back from the DSP.
After these the execution of DSP program starts, and DSP notifies the GPP with a message when it finishes the processing. Just to try I don't put any processing inside the DSP program. I just send a "processing finished" message back to the GPP. And this still consumes a lot of time. Could that be because of the internal/external memory usage, or is it merely because of the communication overload?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
OMAP3430 没有板载 DSP,它具有连接到系统总线的 IVA2+ 视频/音频解码引擎,并且 Cortex 内核具有类似 DSP 的 SIMD 指令。 OMAP3430 上的 GPU 是基于 PowerVR SGX 的单元。 虽然它确实有可编程着色器,但我不相信有任何对 CUDA 或 OpenCL 等通用编程的支持。 我可能是错的,但我从来没有听说过这样的支持
如果您使用板上的 IVA2+ 编码/解码引擎,您需要使用该单元的正确库,并且它仅支持我所知道的特定编解码器。 您是否想将自己的库写入此模块?
如果您使用 Cortex 的内置 DSPish(SIMD 指令),请发布一些代码。
如果您的开发板上有一些额外的 DSP,那么 DSP 是什么以及它如何连接到 OMAP?
至于桌面 GPU 问题,在视频解码的情况下,您使用供应商提供的函数库来调用硬件,有几个,Linux 上用于 Nvidia 的 VDAPU,Windows 上的类似库(我认为它称为 PureViewHD)。 ATI 还为其板载解码引擎提供了 Linux 和 Windows 库,我不知道名称。
The OMAP3430 does not have an on board DSP, it has a IVA2+ Video/Audio decode engine hooked to the system bus and the Cortex core has DSP-like SIMD instructions. The GPU on the OMAP3430 is a PowerVR SGX based unit. While it does have programmable shaders and i don't believe there is any support for general purpose programming ala CUDA or OpenCL. I could be wrong but I've never heard of such support
If your using the IVA2+ encode/decode engine that is on board you need to use the proper libraries for this unit and it only supports specific codecs from that I know. Are you trying to write your own library to this module?
If your using the Cortex's built in DSPish (SIMD instructions), post some code.
If your dev board has some extra DSP on it, what is the DSP and how is it connected to the OMAP?
As to the desktop GPU question, in the case of video decode you use the vender supplied function libraries to make calls to the hardware, there are several, VDAPU for Nvidia on linux, similar libraries on windows(PureViewHD I think its called). ATI also has both linux and windows libraries for their on board decode engines, i don't know the names.
我不知道您传输数据的时基是什么,但我知道 SDK 规格表中列出的 TMS32064x 具有非常强大的 DMA 引擎。 (我假设它是原始的 ZOOM OMAP34X MDK。它说它有 64xx。)我希望 OMAP 有类似的东西,充分利用它们。 我建议在 64xx 的内部 RAM 中设置“乒乓”缓冲区,并使用 SDRAM 作为共享内存,并通过 DMA 进行传输处理。 外部 RAM 将成为任何 6xxx 系列部件的瓶颈,因此请将可以锁定的所有内容保留在内部存储器中以提高性能。 通常,这些部件一旦进入内部存储器,就能够将 8 个 32 位字总线传输到处理器核心,但根据它允许您映射为直接访问 RAM 的缓存级别,每个部件的情况有所不同。 TI 的成本敏感部件将“可映射内存”移得比其他一些芯片更远。 此外,所有部件手册均可从 TI 免费下载 PDF 版本。 他们甚至免费给了我《TMS320C6000 CPU 和指令集手册》以及许多其他书籍的硬拷贝。
就编程而言,您可能需要使用一些“处理器内在函数”或内联汇编来优化您正在执行的任何数学运算。 对于 64xx,尽可能支持整数运算,因为它没有内置浮点核心。 (这些属于 67xx 系列。)如果查看执行单元,您可以映射您的计算,以便不同的部件以可以在单个周期中发生的方式针对不同的操作,那么您将能够实现最佳性能这些部分。 指令集手册列出了每个执行单元执行的操作类型。 如果您可以将计算分解为双数据流集并稍微展开循环,那么当完全优化开启时,编译器会对您“更好”。 这是因为处理器被分为左侧和右侧,两侧的执行单元几乎相同。
希望这可以帮助。
I don't know what the time base your transfering data in is, but I know the TMS32064x which is listed on the specsheet for the SDK has a very powerful DMA engine. (I'm assuming it's the orignal ZOOM OMAP34X MDK. It says it has a 64xx.) I would hope the OMAP has something simalar, use them to their fullest advantage. I would recomend setting up "ping-pong" buffers in the interal ram of the 64xx and using the SDRAM as shared memory with the transfers handle by DMA. External RAM is going to be a bottleneck on any of the 6xxx series parts so keep whatever you can locked into internal memory to improve performance. Typically these parts will have the ability to bus 8 32bits words to the processor core once it's in internal memory, but that vary from part to part based on what level cache it allows you to map as direct access ram. Cost sensitive parts from TI move the "mappable memory" farther away than some of the other chips. Also all the manuals for the parts are available from TI for free download in PDF. They even gave me hardcopies for free of the TMS320C6000 CPU and Instruction Set manual and many other books.
As far as programming is concerned you may need to use some of the "processor intrinsics" or inline assembly to optimize any math you are doing. For the 64xx favor integer operation when possible because it doesn't have a built in floating point core. (Those are in the 67xx series.) If look at the excution units and you can map your calculations such that the different parts target different operations in a manner which can occur in a single cycle then you will be able to achive the best performance out of those parts. The instruction set manual list the types of ops that are performed by each execution unit. If you can break you calculation up in to a dual data flow sets and unwind the loops a bit the compiler will be "nicer" to you when full optimizaiton is on. This is due to the fact that the processor is broken up into a left and a right side with nearly identical execution units on either side.
Hope this helps.
根据我的测量,CPU 和 DSP 之间的一个消息传递周期大约需要 160us。 我不知道这是因为我使用的内核,还是桥驱动程序的原因; 但这对于一个简单的背部和背部来说是一个很长的时间。 发消息。
似乎只有当总计算负载与消息传递所需的时间相当时,将算法移植到 DSP 才是合理的; 以及该算法是否适合在CPU和DSP上同时计算。
From the measurements I did, one messaging cycle between CPU and DSP takes about 160us. I don't know whether this is because of the kernel I use, or the bridge driver; but this is a very long time for a simple back & forth messaging.
It seems that it is only reasonable to port an algorithm to DSP if the total computational load is comparable to the time required for messaging; and if the algorithm is suitable for simultaneous computing on CPU and DSP.