CUDA - 更好的占用 vs 更少的全局内存访问?
我的 CUDA 代码必须使用(减少到平均值/标准差,计算直方图)4 个数组,每个数组 2048 个浮点数,并且已经存储在以前内核的设备内存中。
通常建议启动至少与我拥有的多处理器一样多的块。然而,在这种情况下,我可以将每个数组加载到单个块的共享内存中,因此只启动 4 个块。
这远非“让 GPU 保持忙碌”,但如果我使用更多的块,我将需要通过全局内存进行更多的块间通信,并且我预计由于传输数据所花费的额外时间,多处理器的任何额外利用都将是在静脉中进出全局内存。
在这种情况下并行的最佳方法是什么?
My CUDA code must work with (reduce to mean/std, calculate histogram) 4 arrays, each 2048 floats long and already stored in the device memory from previous kernels.
It is generally advised to launch at least as many blocks as I have multiprocessors. In this case however, I can load each of these arrays into the shared memory of a single block and therefore only launch 4 blocks.
This is far from 'keeping the gpu busy' but if I use more blocks I will need to do more inter-block communication via global memory and I anticipate any extra utilisation of the multiprocessors will be in vein due to extra extra time spent transferring data in and out of global memory.
What is the best way to parallelise in this kind of situation?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
CUDA 中不建议进行块间通信。此外,Fermi 将支持并发内核执行,因此未来更高的占用率将变得不那么重要。因此,我建议暂时保留较低的占用率,除非性能低得令人无法接受。
Interblock communication is not recommended in CUDA. Also, Fermi will support concurrent kernel execution so higher occupancy will become less important in the future. So I would recommend just leaving it with lower occupancy for now unless the performance is unacceptably low.
此示例展示了如何计算所有“汇总统计信息” “通过推力进行单次减少。另一个示例展示了如何使用
thrust::sort
计算直方图。This example shows how to compute all the "summary statistics" in a single reduction with Thrust. Another example shows how to compute a histogram using
thrust::sort
.您正在做的工作量相对较小,因此您可能应该坚持四个块。对于先前/后续内核而言,将数据保留在 GPU 本地仍然有一个优点。
Fermi 将允许并发内核,正是这种情况最受益,因为您可以启动下一个内核在该内核执行时占用剩余的 SM。然而,这确实假设两个内核之间不存在依赖关系 - 当然,在前一个内核完成之前,您将无法启动依赖于前一个内核结果的新内核。
The amount of work you are doing is relatively small, so you should probably stick with four blocks. There is still an advantage of keeping the data local to the GPU for previous/subsequent kernels.
Fermi will allow concurrent kernels and it is exactly this case that stands to benefit the most since you can start the next kernel to occupy the remaining SMs while this kernel is executing. However this does assume that there are no dependencies between the two kernels - naturally you will not be able to start a new kernel that is dependent on the result of the previous kernel before the previous kernel has finished.
我认为您不需要一次需要所有 2048 个浮点数,如果必须减少,您可以将数组拆分为不同的部分,然后在块执行结束时合并结果。你能展示一些示例代码吗?
I don't think you need all the 2048 floats at once, and if you have to reduce you can split the arrays in different parts and then merge the result at the end of the block execution. Can you show some sample code ?