Nvidia CUDA 中的预取
我正在 nVidia CUDA 中进行数据预取。我阅读了一些有关设备本身预取的文档,即从共享内存预取到缓存。
但我对 CPU 和 GPU 之间的数据预取感兴趣。任何人都可以给我提供一些有关此事的文件或信息吗?任何帮助将不胜感激。
I'm working on data prefetching in nVidia CUDA. I read some documents on prefetching on device itself i.e. Prefetching from shared memory to cache.
But I'm interested in data prefetching between CPU and GPU. Can anyone connect me with some documents or something regarding this matter. Any help would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
根据您的评论回答:
引入 CUDA 流来准确启用这种方法。
如果您的计算相当密集,那么是的——它可以大大提高您的性能。另一方面,如果数据传输占用了您 90% 的时间,您将仅节省计算时间 - 也就是说 - 最多 10%...
有关如何使用流的详细信息(包括示例)在 CUDA 中提供编程指南。
对于 4.0 版本,这将是“3.2.5.5 Streams”部分,特别是“3.2.5.5.5 Overlapping Behaviour”——在那里,他们在内核仍在运行时启动另一个异步内存副本。
Answer based on your comment:
CUDA streams were introduced to enable exactly this approach.
If your compoutation is rather intensive, then yes --- it can greatly speed up your performance. On the other hand, if data transfers take, say, 90% of your time, you will save only on computation time - that is - 10% tops...
The details, including examples, on how to use streams is provided in CUDA Programming Guide.
For version 4.0, that will be section "3.2.5.5 Streams", and in particular "3.2.5.5.5 Overlapping Behavior" --- there, they launch another, asynchronous memory copy, while a kernel is still running.
也许您对 CUDA 4.0 的异步主机/设备内存传输功能感兴趣?您可以使用页锁定主机内存来重叠主机/设备内存传输和内核。您可以使用它来...
因此,您可以将数据传入和传出 GPU,并同时进行计算(!)。请参阅《CUDA 4.0 编程指南》和《CUDA 4.0 最佳实践指南》以获取更多详细信息。祝你好运!
Perhaps you would be interested in the asynchronous host/device memory transfer capabilities of CUDA 4.0? You can overlap host/device memory transfers and kernels by using page-locked host memory. You could use this to...
So you could be streaming data in and out of the GPU and computing on it all at once (!). Please refer to the CUDA 4.0 Programming Guide and CUDA 4.0 Best Practices Guide for more detailed information. Good luck!
Cuda 6 将消除复制的需要,即复制将是自动的。
但是您仍然可以从预取中受益。
简而言之,您希望在完成当前计算时传输“下一个”计算的数据。要实现这一点,CPU 上至少需要两个线程,以及某种信号方案(以知道何时发送下一个数据)。分块当然会发挥很大的作用并影响性能。
上述操作在 APU(同一芯片上的 CPU+GPU)上可能更容易,因为两个处理器可以访问相同的内存,因此消除了复制的需要。
如果你想找到一些关于 GPU 预取的论文,只需使用 google 学术即可。
Cuda 6 will eliminate the need to copy, ie the copying will be automatic.
however you may still benefit from prefetching.
In a nutshell you want the data for the "next" computation transferring while you complete the current computation. to achieve that you need to have at least two threads on the CPU, and some kind of signalling scheme (to know when to send the next data). Chunking will of course play a big role and affect performance.
The above may be easier on an APU (CPU+GPU on the same die) as the need to copy is eliminated as both processors can access the same memory.
If you want to find some papers on GPU prefetching just use google scholar.