AMD 的 OpenCL 是否提供类似于 CUDA 的 GPUDirect 的功能?

发布于 2025-01-05 20:59:44 字数 941 浏览 0 评论 0原文

NVIDIA 提供 GPUDirect 来减少内存传输开销。我想知道AMD/ATI是否有类似的概念?具体来说:

  1. AMD GPU 在与网卡连接时是否避免第二次内存传输,如此处所述。万一图形在某个时刻丢失,这里描述了 GPUDirect 对从一台机器上的 GPU 获取数据并通过网络接口传输的影响:使用 GPUDirect,GPU 内存进入主机内存,然后直接进入网络接口卡。如果没有GPUDirect,GPU内存会转到一个地址空间中的Host内存,然后CPU必须进行复制以将内存复制到另一个Host内存地址空间中,然后才能出去到网卡。

  2. 当两个 GPU 在同一 PCIe 总线上共享时,AMD GPU 是否允许 P2P 内存传输,如此处所述。万一图形在某个时刻丢失,这里描述了 GPUDirect 对同一 PCIe 总线上的 GPU 之间传输数据的影响: 使用 GPUDirect,数据可以直接在同一 PCIe 总线上的 GPU 之间移动,而无需接触主机内存。如果没有 GPUDirect,数据始终必须先返回主机,然后才能到达另一个 GPU,无论该 GPU 位于何处。

编辑:顺便说一句,我不完全确定 GPUDirect 有多少是蒸汽软件,有多少是真正有用的。我从未真正听说过 GPU 程序员将其用于实际用途。对此的想法也受到欢迎。

NVIDIA offers GPUDirect to reduce memory transfer overheads. I'm wondering if there is a similar concept for AMD/ATI? Specifically:

  1. Do AMD GPUs avoid the second memory transfer when interfacing with network cards, as described here. In case the graphic is lost at some point, here is a description of the impact of GPUDirect on getting data from a GPU on one machine to be transferred across a network interface: With GPUDirect, GPU memory goes to Host memory then straight to the network interface card. Without GPUDirect, GPU memory goes to Host memory in one address space, then the CPU has to do a copy to get the memory into another Host memory address space, then it can go out to the network card.

  2. Do AMD GPUs allow P2P memory transfers when two GPUs are shared on the same PCIe bus, as described here. In case the graphic is lost at some point, here is a description of the impact of GPUDirect on transferring data between GPUs on the same PCIe bus: With GPUDirect, data can move directly between GPUs on the same PCIe bus, without touching host memory. Without GPUDirect, data always has to go back to the host before it can get to another GPU, regardless of where that GPU is located.

Edit: BTW, I'm not entirely sure how much of GPUDirect is vaporware and how much of it is actually useful. I've never actually heard of a GPU programmer using it for something real. Thoughts on this are welcome too.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

┈┾☆殇 2025-01-12 20:59:44

虽然这个问题已经很老了,但我想添加我的答案,因为我认为这里的当前信息不完整。

正如 @Ani 的答案中所述,您可以使用 CL_MEM_ALLOC_HOST_PTR 分配主机内存,并且您很可能会获得固定的主机内存,从而避免根据实现进行第二次复制。例如,NVidia OpenCL 最佳实践指南指出:

OpenCL 应用程序无法直接控制内存对象是否
是否在固定内存中分配,但他们可以使用
CL_MEM_ALLOC_HOST_PTR 标志和此类对象很可能被分配在
驱动程序固定内存以获得最佳性能

我发现之前的答案中缺少的是 AMD 提供 DirectGMA 技术这一事实。该技术使您能够直接在 GPU 和 PCI 总线上的任何其他外设(包括其他 GPU)之间传输数据,而无需通过系统内存。它更类似于 NVidia 的 RDMA(并非在所有平台上都可用)。

为了使用此技术,您必须:

  • 拥有兼容的 AMD GPU(并非所有 GPU 都支持 DirectGMA)。您可以使用 AMD 提供的 OpenCL、DirectX 或 OpenGL 扩展。

  • 让外设驱动程序(网卡、视频采集卡等)公开 GPU DMA 引擎可以读取/写入的物理地址。或者能够对外围 DMA 引擎进行编程,以将数据传输到 GPU 暴露的内存/从 GPU 暴露的内存传输数据。

我使用这项技术将数据直接从视频采集设备传输到 GPU 内存,以及从 GPU 内存传输到专有的 FPGA。这两种情况都非常高效,并且不涉及任何额外的复制。

OpenCL 与 PCIe 设备接口

Although this question is pretty old, I would like to add my answer as I believe the current information here is incomplete.

As stated in the answer by @Ani, you could allocate a host memory using CL_MEM_ALLOC_HOST_PTR and you will most likely get a pinned host memory that avoids the second copy depending on the implementation. For instance, NVidia OpenCL Best Practices Guide states:

OpenCL applications do not have direct control over whether memory objects are
allocated in pinned memory or not, but they can create objects using the
CL_MEM_ALLOC_HOST_PTR flag and such objects are likely to be allocated in
pinned memory by the driver for best performance

The thing I find missing from previous answers is the fact that AMD offers DirectGMA technology. This technology enables you to transfer data between the GPU and any other peripheral on the PCI bus (including other GPUs) directly witout having to go through system memory. It is more similar to NVidia's RDMA (not available on all platforms).

In order to use this technology you must:

  • have a compatible AMD GPU (not all of them support DirectGMA). you can use either OpenCL, DirectX or OpenGL extentions provided by AMD.

  • have the peripheral driver (network card, video capture card etc) either expose a physical address to which the GPU DMA engine can read/write from/to. Or be able to program the peripheral DMA engine to transfer data to / from the GPU exposed memory.

I used this technology to transfer data directly from video capture devices to the GPU memory and from the GPU memory to a proprietary FPGA. Both cases were very efficent and did not involve any extra copying.

Interfacing OpenCL with PCIe devices

樱花落人离去 2025-01-12 20:59:44

我认为您可能正在 clCreateBuffer 中寻找 CL_MEM_ALLOC_HOST_PTR 标志。虽然 OpenCL 规范声明此标志“此标志指定应用程序希望 OpenCL 实现从主机可访问内存中分配内存”,但不确定 AMD 的实现(或其他实现)可能会用它做什么。

这是关于该主题的信息线程 http://www.khronos。 org/message_boards/viewtopic.php?f=28&t=2440

希望这有帮助。

编辑:我确实知道 nVidia 的 OpenCL SDK 将其实现为固定/页面锁定内存中的分配。我相当确定这就是 AMD 的 OpenCL SDK 在 GPU 上运行时所做的事情。

I think you may be looking for the CL_MEM_ALLOC_HOST_PTR flag in clCreateBuffer. While the OpenCL specification states that this flag "This flag specifies that the application wants the OpenCL implementation to allocate memory from host accessible memory", it is uncertain what AMD's implementation (or other implementations) might do with it.

Here's an informative thread on the topic http://www.khronos.org/message_boards/viewtopic.php?f=28&t=2440

Hope this helps.

Edit: I do know that nVidia's OpenCL SDK implements this as allocation in pinned/page-locked memory. I am fairly certain this is what AMD's OpenCL SDK does when running on the GPU.

说好的呢 2025-01-12 20:59:44

正如 @ananthonline 和 @harrism 所指出的,GPUDirect 的许多功能在 OpenCL 中没有直接等效的功能。但是,如果您试图减少内存传输开销,正如问题第一句中提到的那样,零复制内存可能会有所帮助。通常,当应用程序在 GPU 上创建缓冲区时,缓冲区的内容会从 CPU 内存批量复制到 GPU 内存。使用零复制内存,无需预先复制;相反,数据会在 GPU 内核访问时进行复制。

零拷贝并不对所有应用程序都有意义。以下是 AMD APP OpenCL 编程指南中关于何时使用它的建议:

零复制主机驻留内存对象可以提高主机性能
设备以稀疏方式访问内存,或者当内存很大时
主机内存缓冲区在多个设备和副本之间共享
太贵了。选择此项时,传输成本必须
大于较慢访问的额外成本。

编程指南的表 4.3 描述了要传递给 clCreateBuffer 的标志以利用零复制(CL_MEM_ALLOC_HOST_PTR 或 CL_MEM_USE_PERSISTENT_MEM_AMD,具体取决于您是需要设备可访问的主机内存还是主机可访问的设备内存)。请注意,零复制支持取决于操作系统和硬件; Linux 或旧版本的 Windows 似乎不支持它。

AMD APP OpenCL 编程指南:http://developer.amd.com/sdks/AMDAPPSDK /assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

As pointed out by @ananthonline and @harrism, many of the features of GPUDirect have no direct equivalent in OpenCL. However, if you are trying to reduce memory transfer overhead, as mentioned in the first sentence of your question, zero copy memory might help. Normally, when an application creates a buffer on the GPU, the contents of the buffer are copied from CPU memory to GPU memory en masse. With zero copy memory, there is no upfront copy; instead, data is copied over as it is accessed by the GPU kernel.

Zero copy does not make sense for all applications. Here is advice from the AMD APP OpenCL Programming Guide on when to use it:

Zero copy host resident memory objects can boost performance when host
memory is accessed by the device in a sparse manner or when a large
host memory buffer is shared between multiple devices and the copies
are too expensive. When choosing this, the cost of the transfer must
be greater than the extra cost of the slower accesses.

Table 4.3 of the Programming Guide describes which flags to pass to clCreateBuffer to take advantage of zero copy (either CL_MEM_ALLOC_HOST_PTR or CL_MEM_USE_PERSISTENT_MEM_AMD, depending on whether you want device-accessible host memory or host-accessible device memory). Note that zero copy support is dependent on both the OS and the hardware; it appears to not be supported under Linux or older versions of Windows.

AMD APP OpenCL Programming Guide: http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文