CUDA内核中的线程层次结构设计
假设一个块的线程限制为 512 个,假设我的内核需要超过 512 个线程来执行,那么应该如何设计线程层次结构以获得最佳性能? (情况 1)
第一个块 - 512 个线程 第二块 - 剩余线程
(情况 2)在某些块上分配相同数量的线程。
Assuming a block has limit of 512 threads, say my kernel needs more than 512 threads for execution, how should one design the thread hierarchy for optimal performance?
(case 1)
1st block - 512 threads
2nd block - remaining threads
(case 2) distribute equal number of threads across certain blocks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为这并不重要,但更重要的是对线程块进行逻辑分组,以便您能够使用其他 CUDA 优化(例如内存合并)
此链接提供了有关 CUDA 将(可能)如何组织线程的一些见解。
摘自摘要:
I don't think that it really matters, but it is more important to group the thread blocks logically, so that you are able to use other CUDA optimizations (like memory coalescing)
This link provides some insight into how CUDA will (likely) and organize your threads.
A quote from the summary:
最好将线程平均分为两个块,以便最大化计算/内存访问重叠。例如,当一个块中有 256 个线程时,它们不会同时计算所有线程,而是通过 32 个线程的 warp 在 SM 上进行调度。当一个warp正在等待全局内存数据时,另一个warp被调度。如果你有一个小线程块,你的全局内存访问会受到更多的惩罚。
此外,在您的示例中,您的 GPU 未得到充分利用。请记住,GPU 有数十个多处理器(例如,C1060 Tesla 有 30 个),并且一个块映射到一个多处理器。就您而言,您将仅使用 2 个多处理器。
It is preferable to divide equally the threads into two blocks, in order to maximize the computation / memory access overlap. When you have for instance 256 threads in a block, they do not compute all in the same time, there are scheduled on the SM by warp of 32 threads. When a warp is waiting for a global memory data, another warp is scheduled. If you have a small block of threads, your global memory accesses are a lot more penalizing.
Furthermore, in your example you underuse your GPU. Just remember that a GPU have dozens of multiprocessors (eg. 30 for the C1060 Tesla), and a block is mapped to a multiprocessor. In your case, you will only use 2 multiprocessors.