CUDA:在多个设备之间共享数据?
在《CUDA C 编程指南》中,据说
...根据设计,主机线程在任何给定时间只能在一个设备上执行设备代码。因此,需要多个主机线程在多个设备上执行设备代码。此外,通过一个主机线程中的运行时创建的任何 CUDA 资源都不能被另一主机线程的运行时使用...
我想做的是让两个 GPU 共享主机上的数据(映射内存),
但手册似乎说这是不可能的。
有什么解决办法吗
in CUDA C Programming Guide, it is said that
... by design, a host thread can execute device code on only one device at any given time. As a consequence, multiple host threads are required to execute device code on multiple devices. Also, any CUDA resources created through the runtime in one host thread cannot be used by the runtime from another host thread...
What I wanted to do is make two GPUs share data on host(mapped memory),
but the manual is seemed to say that it is not possible.
Is there any solution for this
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
分配主机内存时,应使用
cudaHostAlloc()
进行分配并传递cudaHostAllocPortable
标志。这将允许多个 CUDA 上下文访问内存。When you are allocating the host memory, you should allocate using
cudaHostAlloc()
and pass thecudaHostAllocPortable
flag. This will allow the memory to be accessed by multiple CUDA contexts.解决方案是手动管理这些公共数据。即使使用 SLI。
http://forums.nvidia.com/index.php?showtopic=30740
Solution is to manually manage these common data. Even with SLI.
http://forums.nvidia.com/index.php?showtopic=30740
你可能想看看GMAC。它是一个构建在 CUDA 之上的库,给人一种共享内存的错觉。它实际上所做的就是在主机和GPU设备上的同一虚拟地址上分配内存,并使用页面保护按需传输数据。请注意,它还处于实验阶段,可能处于 Beta 测试阶段。
http://code.google.com/p/adsm/
You may want to look at GMAC. It's a library built on top of CUDA that gives the illusion of shared memory. What it actually does is to allocate memory at the same virtual address on the host and GPU devices, and use page protection to transfer data on demand. Be aware that it is somewhat experimental, maybe in the beta testing stage.
http://code.google.com/p/adsm/
也许考虑将 MPI 之类的东西与 CUDA 一起使用?
http://forums.nvidia.com/index.php?showtopic=30741
http://www.ncsa.illinois.edu /UserInfo/Training/Workshops/CUDA/presentations/tutorial-CUDA.html
Maybe think about using something like MPI along with CUDA?
http://forums.nvidia.com/index.php?showtopic=30741
http://www.ncsa.illinois.edu/UserInfo/Training/Workshops/CUDA/presentations/tutorial-CUDA.html
您希望通过将
cudaHostAllocPortable
传递给cudaHostAlloc()
将固定内存分配为可移植。当然,您可以在内核外部的同一固定内存中的设备之间交换数据,正如我之前所做的那样。至于映射内存,我不太确定,但我不明白为什么你不能这样做。尝试使用 cudaHostGetDevicePointer() 获取用于当前设备的设备指针(您已与同一 CPU 线程关联)。更多信息请参见 CUDA 编程指南的 3.2.5.3 节( v3.2):
You want to allocate your pinned memory as portable by passing
cudaHostAllocPortable
tocudaHostAlloc()
. You can exchange data outside the kernel between devices from the same pinned memory for sure, as I've done this before. As for mapped memory, I'm not quite as sure but I don't see why you wouldn't be able to. Try usingcudaHostGetDevicePointer()
to get the device pointer to use for the current device (that you've associated with the same CPU thread.)There's more info in section 3.2.5.3 of the CUDA Programming Guide (v3.2):
我在 NVIDIA 论坛上专门提出了一个关于如何在两个 GPU 之间传输数据的类似问题,并收到了回复说,如果你想同时使用两个 GPU 并在它们之间传输数据,你必须有两个线程(正如手册所建议的那样) 。该手册说“CUDA资源”不能共享,但是可以共享它们复制的主机内存(使用openmp或mpi)。因此,如果您将内存从每个设备传输回主机,您就可以在设备之间访问内存。
请记住,这将非常慢,因为与设备之间的内存传输将非常慢。
所以不,你不能从gpu2访问gpu1内存(即使使用sli - 我因为与cuda完全无关而被大喊大叫)。但是,您可以使用 gpu1,写入主机上的一个区域,然后使用 gpu2 写入另一个区域,并允许管理每个设备的线程将必要的数据写回到正确的 GPU。
I have specifically asked a similar question on the NVIDIA forums regarding how to transfer data between two gpus and have receieved responses saying that if you want to use two gpus simultaneously and transfer data between them, you must have two threads (as the manual suggests). The manual says that "CUDA resources" cannot be shared, however the host memory they are copied from can be shared (using openmp or mpi). Thus if you transfer your memory back to the host from each device, you could access memory between devices.
Keep in mind that this will be very slow, as the transfer of memory to/from devices will be very slow.
So no you can't access gpu1 memory from gpu2 (even with sli - which I have been yelled at for not being related at all to cuda). however you can take gpu1, write to a region on the host, and then take gpu2 and write to another region, and allow the threads managing each device to write the necessary data back to correct gpu.