减少 CPU 到 GPU 数据传输延迟的技术

发布于 2024-11-17 11:16:22 字数 535 浏览 7 评论 0原文

我一直在研究减少 CPU 和 GPU 来回传输数据所造成的延迟的方法。当我第一次开始使用 CUDA 时，我确实注意到 CPU 和 GPU 之间的数据传输确实需要几秒钟，但我并不太在意，因为这对于我正在编写的小程序来说并不是真正的问题。事实上，对于绝大多数使用 GPU 的程序（包括视频游戏）而言，延迟可能并不是什么大问题，因为它们仍然比在 CPU 上运行要快得多。

然而，我是一位 HPC 爱好者，当我看到天河一号理论峰值 FLOPS 与实际 LINPACK 测量性能之间的巨大差异时，我开始担心我的研究方向。这引起了我对自己是否走上了正确的职业道路的担忧。

通过使用 cudaHostAlloc() 函数使用固定内存（页面锁定）内存是减少延迟的一种方法（非常有效），但是还有其他我不知道的技术吗？需要明确的是，我谈论的是优化代码，而不是硬件本身（这是 NVIDIA 和 AMD 的工作）。

作为一个附带问题，我知道戴尔和惠普销售 Tesla 服务器。我很好奇 GPU 如何充分利用数据库应用程序，在数据库应用程序中，您需要不断地从硬盘驱动器（HDD 或 SSD）读取数据，而这种操作只有 CPU 才能执行，

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

§普罗旺斯的薰衣草 2024-11-24 11:16:22

有几种方法可以解决 CPU-GPU 通信开销 - 我希望这就是您所说的延迟的意思，而不是传输本身的延迟。请注意，我故意使用术语“地址”而不是“减少”，因为如果可以隐藏延迟，则不一定需要减少延迟。另请注意，我对 CUDA 更加熟悉，因此下面我仅参考 CUDA，但 OpenCL 中也提供了一些功能。

正如您提到的页面锁定内存的目的就是增加。此外，还可以将页面锁定主机内存映射到 GPU，这一机制可以直接访问从 GPU 内核分配的数据，而无需额外的数据传输。这种机制称为零复制传输，如果数据仅读/写一次并伴有大量计算，并且对于没有单独内存（移动）的 GPU 来说，它非常有用。然而，如果访问零复制数据的内核不是强计算绑定的，因此无法隐藏数据访问的延迟，则页锁定但未映射的内存将更有效。此外，如果数据无法放入 GPU 内存，零复制仍然有效。
请注意，过多的页面锁定内存可能会导致 CPU 端严重减慢。

从不同的角度解决这个问题，正如 tkerwin 提到的，异步传输（CPU 线程与 GPU 对话）是通过将 CPU 上的计算与传输重叠来隐藏 CPU-GPU 传输延迟的关键。这可以通过 cudaMemcpyAsync() 以及使用零拷贝和异步内核执行来实现。
通过使用多个流将传输与内核执行重叠，可以更进一步。请注意，流调度可能需要特别注意良好的重叠； Tesla 和 Quadro 卡具有双 DMA 引擎，可实现 GPU 之间的同步数据传输。
此外，借助 CUDA 4.0，从多个 CPU 线程使用 GPU 变得更加容易，因此在多线程 CPU 代码中，每个线程都可以将自己的数据发送到 GPU，并更轻松地启动内核。

最后，GMAC 为 CUDA 实现了非对称共享内存模型。它非常有趣的功能之一是它提供的一致性模型，特别是惰性更新和滚动更新，能够以阻塞方式仅传输在 CPU 上修改的数据。
有关详细信息，请参阅以下论文：Gelado 等人。 - 非对称分布式共享内存
异构并行系统模型。

There are a few ways to address CPU-GPU communication overhead - I hope that's what you mean by latency and not the latency of the transfer itself. Note that I deliberately used the term address instead of reduce as you do not necessarily need to reduce the latency if you can hide it. Also note that I am much more familiar with CUDA, so below I only refer to CUDA, but some features are also available in OpenCL.

As you mentioned page-locked memory has the very purpose of increasing. Additionally, one can map page-locked host memory to the GPU, mechanism which enables direct access of the data allocated from the GPU kernel without the need for additional data-transfer. This mechanism is called zero-copy transfer and it is useful if data is read/written only once accompanied by a substantial amount of computation and for GPUs with no separate memory (mobile). However, if the kernel accessing the zero-copied data is not strongly compute-bound and therefore the latency of data access cannot be hidden, page-locked but not mapped memory will be more efficient. Additionally, if the data does not fit into the GPU memory, zero-copy will still work.
Note that excessive amount of page-locked memory can cause serious slowdown on the CPU side.

Approaching the problem from a different angle, as tkerwin mentioned, asynchronous transfer (wrt the CPU thread talking to the GPU) is the key to hide CPU-GPU transfer latency by overlapping computation on the CPU with the transfer. This can be achieved with cudaMemcpyAsync() as well as using zero-copy with asynchronous kernel execution.
One can take this even further by using multiple streams to overlap transfer with kernel execution. Note that stream scheduling might need special attention for good overlapping; Tesla and Quadro cards have dual-DMA engine which enables simultaneous data transfer to and from GPU.
Additionally, with CUDA 4.0 it became easier to use a GPU from multiple CPU threads, so in a multi-threaded CPU code each thread can send its own data to the GPU and launch kernels easier.

Finally, GMAC implements an asymmetric shared memory model for CUDA. One of its very interesting features is the coherency models it provides, in particular lazy- and rolling update enabling the transfer of only data modified on the CPU in a blocked fashion.
For more details see the following paper: Gelado et al. - An Asymmetric Distributed Shared Memory
Model for Heterogeneous Parallel Systems.

回复收藏 0 原文