有效的最小 GPU 线程数
我将在 CUDA 上并行化本地搜索算法来解决一些优化问题。该问题非常难,因此实际可解决的问题规模很小。 我担心的是,计划在一个内核中运行的线程数量不足以在 GPU 上获得任何加速(即使假设所有线程都已合并、没有库冲突、非分支等)。 假设一个内核启动了 100 个线程。期望通过使用 GPU 获得利润是否合理?如果线程数是1000怎么办?分析案件需要哪些额外信息?
I'm going to parallelize on CUDA a local search algorithm for some optimization problem. The problem is very hard, so the size of the practically solvable problems is quite small.
My concern is that the number of threads planned to run in one kernel is insufficient to obtain any speedup on GPU (even assuming all threads are coalesced, free of bank conflicts, non-branching etc.).
Let's say a kernel is launched for 100 threads. Is it reasonable to expect any profit from using GPU? What if the number of threads is 1000? What additional information is needed to analyze the case?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
100 个线程确实不够。理想情况下,您需要的大小至少可以划分为与 GPU 上的多处理器 (SM) 一样多的线程块,否则您将使处理器闲置。出于同样的原因,每个线程块不应少于 32 个线程。理想情况下,每个块应该有 32 个线程的小倍数(比如 96-512 个线程),如果可能的话,每个 SM 有多个这样的块。
至少,您应该尝试拥有足够的线程来覆盖 SM 的算术延迟,这意味着在计算能力 2.0 GPU 上,每个 SM 需要大约 10-16 个 warp(32 个线程组)。不过,它们并不都需要来自同一个线程块。因此,这意味着,例如,在具有 14 个 SM 的 Tesla M2050 GPU 上,您将需要至少 4480 个线程,并分为至少 14 个块。
也就是说,比这更少的线程也可以提供加速——这取决于许多因素。例如,如果计算受带宽限制,并且您可以将数据保存在设备内存中,那么您可以获得加速,因为 GPU 设备内存带宽高于 CPU 内存带宽。或者,如果它是计算密集型的,并且存在大量指令级并行性(来自同一线程的独立指令),那么您将不需要那么多线程来隐藏延迟。 GTC 的 Vladimir Volkov 的“以更低的占用率提供更好的性能”演讲很好地描述了后一点2010。
最主要的是确保您使用所有的 SM:如果不这样做,您就不会使用 GPU 可以提供的所有计算性能或带宽。
100 threads is not really enough. Ideally you want a size that can be divided in to at least as many thread blocks as there are multiprocessors (SMs) on the GPU, otherwise you will be leaving processors idle. Each thread block should have no fewer than 32 threads, for the same reason. Ideally, you should have a small multiple of 32 threads per block (say 96-512 threads), and if possible, multiple of these blocks per SM.
At a minimum, you should try to have enough threads to cover the arithmetic latency of the SMs, which means that on a Compute Capability 2.0 GPU, you need about 10-16 warps (groups of 32 threads) per SM. They don't all need to come from the same thread block, though. So that means, for example, on a Tesla M2050 GPU with 14 SMs, you would need at least 4480 threads, divided into at least 14 blocks.
That said, fewer threads than this could also provide a speedup -- it depends on many factors. If the computation is bandwidth bound, for example, and you can keep the data in device memory, then you could get a speedup because GPU device memory bandwidth is higher than CPU memory bandwidth. Or, if it is compute bound, and there is a lot of instruction-level parallelism (independent instructions from the same thread), then you won't need as many threads to hide latency. This latter point is described very well in Vladimir Volkov's "Better performance at lower occupancy" talk from GTC 2010.
The main thing is to make sure you use all of the SMs: without doing so you aren't using all of the computation performance or bandwidth the GPU can provide.