Cuda cudaMemcpy 和 cudaMalloc

发布于 2024-11-05 03:34:18 字数 85 浏览 0 评论 0原文

我总是读到从 cpu 到 gpu 的分配和传输数据很慢。这是因为 cudaMalloc 很慢吗?是因为 cudaMemcpy 慢吗?或者是因为他们俩都很慢?

i always read that it is slow to allocate and transfer data form cpu to gpu. is this because cudaMalloc is slow? is it because cudaMemcpy is slow? or is it becuase both of them are slow?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

我是有多爱你 2024-11-12 03:34:18

它主要与两件事相关,首先是卡和CPU之间的PCIExpress总线的速度。另一个与这些功能的运作方式有关。现在,我认为新的 CUDA 4 对内存分配(标准或固定)有更好的支持,并且提供了一种通过总线透明地访问内存的方法。

现在,让我们面对现实吧,在某些时候,您需要从 A 点获取数据到 B 点来计算某些内容。最好的处理方法是要么进行非常大的计算,要么使用 CUDA 流来重叠 GPU 上的传输和计算。

It is mostly tied to 2 things, the first begin the speed of the PCIExpress bus between the card and the cpu. The other is tied to the way these functions operate. Now, I think the new CUDA 4 has better support for memory allocation (standard or pinned) and a way to access memory transparently across the bus.

Now, let's face it, at some point, you'll need to get data from point A to point B to compute something. Best way to handle is to either have a really large computation going on or use CUDA streams to overlap transfer and computation on the GPU.

暮倦 2024-11-12 03:34:18

在大多数应用程序中,您应该在开始时执行一次 cudaMalloc,然后不再调用它。因此,瓶颈实际上是cudaMemcpy。

这是由于身体限制造成的。对于标准 PCI-E 2.0 x16 链路,理论上您将获得 8GB/s,但实际上通常为 5-6GB/s。将此与 GTX460 等中档 Fermi 进行比较,设备上的带宽为 80+GB/s。实际上,您的内存带宽受到了一个数量级的影响,从而相应地增加了数据传输时间。

GPGPU 应该是超级计算机,我相信 Seymour Cray(超级计算机专家)说过,“超级计算机将计算限制问题转化为 I/O 限制问题”。因此,优化数据传输就是一切。

根据我个人的经验,迭代算法是迄今为止通过移植到 GPGPU(2-3 个数量级)而显示出最佳改进的算法,因为您可以通过将所有内容保留在 GPU 上来消除传输时间。

In most applications, you should be doing cudaMalloc once at the beginning and then not call it any more. Thus, the bottleneck is really cudaMemcpy.

This is due to physical limitations. For a standard PCI-E 2.0 x16 link, you'll get 8GB/s theoretical but typically 5-6GB/s in practice. Compare this w/ even a mid range Fermi like the GTX460 which has 80+GB/s bandwidth on the device. You're in effect taking an order of magnitude hit in memory bandwidth, spiking your data transfer times accordingly.

GPGPUs are supposed to be supercomputers and I believe Seymour Cray (the supercomputer guy) said, "a supercomputer turns compute-bound problems into I/O bound problems". Thus, optimizing data transfers is everything.

In my personal experience, iterative algorithms are the ones that by far show the best improvements by porting to GPGPU (2-3 orders of magnitude) due to the fact that you can eliminate transfer time by keeping everything in-situ on the GPU.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文