可以跨 OpenCL 内核在 CPU 和 GPU 上同时运行
假设我有一台具有多核处理器和 GPU 的计算机。我想编写一个在平台的所有核心上运行的 OpenCL 程序。这是可能的还是我需要选择一个设备来运行内核?
Lets assume that I have a computer which has a multicore processor and a GPU. I would like to write an OpenCL program which runs on all cores of the platform. Is this possible or do I need to choose a single device on which to run the kernel?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
理论上是可以的,CL API 允许这样做。但平台/实现必须支持它,而且我认为大多数 CL 实现都不支持它。
为此,请获取 CPU 设备和 GPU 设备的 cl_device_id,并使用 clCreateContext 创建包含这两个设备的上下文。
In theory yes, you can, the CL API allows it. But the platform/implementation must support it, and i don't think most CL implementatations do.
To do it, get the cl_device_id of the CPU device and the GPU device, and create a context with those two devices, using clCreateContext.
不,你不能在 CPU 和 GPU 上自动跨越内核,它要么是其中之一,要么是另一个。
您可以这样做,但这将涉及手动创建和管理两个命令队列(每个设备一个)。
请参阅此线程:
http: //devforums.amd.com/devforum/messageview.cfm?catid=390&threadid=124591&messid=1072238&parentid=0&FTVAR_FORUMVIEWTMP=Single
No you can't span automagically a kernel on both CPU and GPU, it's either one or the other.
You could do it but this will involve creating and managing manually two command queues (one for each device).
See this thread:
http://devforums.amd.com/devforum/messageview.cfm?catid=390&threadid=124591&messid=1072238&parentid=0&FTVAR_FORUMVIEWTMP=Single
一种上下文只能用于一种平台。如果您的多设备代码需要跨平台工作(例如,Intel 平台 CPU OpenCL 和 NVidia GPU),那么您需要单独的上下文。
但是,如果 GPU 和 CPU 碰巧位于同一平台,那么您可以使用同一个上下文。
如果您在同一平台上使用多个设备(两个相同的 GPU,或来自同一制造商的两个 GPU),那么您可以共享上下文 - 只要它们都来自单个 clGetDeviceIDs 调用。
编辑:
我应该补充一点,GPU+CPU 上下文并不意味着任何自动管理的 CPU+GPU 执行。通常,最佳实践是让驱动程序分配可由 GPU 进行 DMA 处理的内存缓冲区,以获得最大性能。如果 CPU 和 GPU 位于同一上下文中,您就可以在两个设备之间共享这些缓冲区。
你还是得自己分担工作量。我最喜欢的负载平衡技术是使用事件。每 n 个工作项,将一个事件对象附加到命令(或将一个标记排队),并等待您在 n 个工作项之前设置的事件(前一个)。如果您不需要等待,那么您需要增加该设备上的 n,如果您确实需要等待,那么您应该减少 n。这将限制队列深度,n 将徘徊在完美深度附近以保持设备繁忙。无论如何你都需要这样做以避免导致 GUI 渲染匮乏。只要在每个命令队列中保留n个命令(其中CPU和GPU有单独的n),它就会完美划分。
One context can only be for one platform. If your multi-device code needs to work across platforms (for example, Intel platform CPU OpenCL, and NVidia GPU) then you need separate contexts.
However, if the GPU and CPU happened to be in the same platform, then yes you could use one context.
If you are using multiple devices on the same platform (two identical GPUs, or two GPUs from the same manufacturer) then you can share the context - as long as they both come from a single clGetDeviceIDs call.
EDIT:
I should add that a GPU+CPU context doesn't mean any automatically managed CPU+GPU execution. Typically, it is a best-practice to let the driver allocate a memory buffer that can be DMA'd by the GPU for maximum performance. In the case where you have the CPU and GPU in the same context, you'd be able to share those buffers across the two devices.
You still have to split the workload up yourself. My favorite load balancing technique is using events. Every n work items, attach an event object to a command (or enqueue a marker), and wait for the event that you set n workitems ago (the prior one). If you didn't have to wait, then you need to increase n on that device, if you did have to wait, then you should decrease n. This will limit the queue depth, n will hover around the perfect depth to keep the device busy. You need to do it anyway to avoid causing GUI render starvation. Just keep n commands in each command queue (where the CPU and GPU have separate n) and it will divide perfectly.
您不能将一个内核扩展到多个设备。但是,如果您重新运行的代码不依赖于其他结果(即:处理 16kB 数据块,需要大量处理),您可以在 GPU 和 CPU 上启动相同的内核。并将一些块放在 GPU 上,一些放在 CPU 上。
这样应该可以提高性能。
您可以通过创建一个为 CPU 和 GPU 共享的 clContext 以及 2 个命令队列来做到这一点。
这并不适用于所有内核。有时,内核代码适用于所有输入数据,并且无法分成部分或块。
You cannot span a kernel to multiple devices. But if the code you a re running is not dependant on other results (ie: Procesing blocks of 16kB of data, that needs huge processing), you can launch the same kernel on GPU and CPU. And put some blocks on the GPU and some on the CPU.
That way it should boost up the performance.
You can do that, creating a clContext shared for CPU and GPU, and 2 command queues.
This is not aplicable to all the kernels. Some times the kernel code applies to all the input data, and is not able to be separated in parts or chunks.