CUDA 块和网格大小效率
在 cuda 中处理动态大小的数据集的建议方法是什么?
是“根据问题集设置块和网格大小”的情况,还是值得将块尺寸分配为 2 的因子并有一些内核逻辑来处理溢出?
我可以看出这对于块尺寸来说可能很重要,但是这对于网格尺寸有多大影响呢?据我了解,实际的硬件约束停止在块级别(即分配给具有一定数量的 SP 的 SM 的块,因此可以处理特定的扭曲大小)。
我仔细阅读了柯克的“编程大规模并行处理器”,但它并没有真正涉及这个领域。
What is the advised way of dealing with dynamically-sized datasets in cuda?
Is it a case of 'set the block and grid sizes based on the problem set' or is it worthwhile to assign block dimensions as factors of 2 and have some in-kernel logic to deal with the over-spill?
I can see how this probably matters alot for the block dimensions, but how much does this matter to the grid dimensions? As I understand it, the actual hardware constraints stop at the block level (i.e blocks assigned to SM's that have a set number of SP's, and so can handle a particular warp size).
I've perused Kirk's 'Programming Massively Parallel Processors' but it doesn't really touch on this area.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
通常是为了获得最佳性能而设置块大小,并根据工作总量设置网格大小。大多数内核在每 Mp 上都有一个“最佳点”数量的扭曲,在该点上它们工作得最好,您应该做一些基准测试/分析来看看它在哪里。您可能仍然需要内核中的溢出逻辑,因为问题大小很少是块大小的整数倍。
编辑:
给出一个具体示例,说明如何对简单内核执行此操作(在本例中,自定义 BLAS 1 级 dscal 类型操作作为压缩对称带矩阵 Cholesky 分解的一部分完成):
要启动此内核,执行参数为计算如下:
包含执行参数计算和内核启动的最终包装函数如下所示:
也许这给出了一些关于如何设计一个“通用”方案来根据输入数据大小设置执行参数的提示。
It s usually a case of setting block size for optimal performance, and grid size according to the total amount of work. Most kernels have a "sweet spot" number of warps per Mp where they work best, and you should do some benchmarking/profiling to see where that is. You probably still need over-spill logic in the kernel because problem sizes are rarely round multiples of block sizes.
EDIT:
To give a concrete example of how this might be done for a simple kernel (in this case a custom BLAS level 1 dscal type operation done as part of a Cholesky factorization of packed symmetric band matrices):
To launch this kernel, the execution parameters are calculated as follows:
The resulting wrapper function containing the execution parameter calculations and kernel launch look like this:
Perhaps this gives some hints about how to design a "universal" scheme for setting execution parameters against input data size.
好吧,我想我们在这里处理两个问题。
1)分配块大小(即线程数)的好方法
这通常取决于您正在处理的数据类型。你在处理向量吗?你在处理矩阵吗?建议的方法是将线程数保持在 32 的倍数。因此,在处理向量时,启动 256 x 1、512 x 1 块可能没问题。处理矩阵时类似,32 x 8、32 x 16。2
)分配网格大小(即块数)的好方法
这里有点棘手。因为我们可以而仅仅启动 10,000 个区块通常并不是最好的做法。将块移入或移出硬件的成本很高。需要考虑的两件事是每个块使用的共享内存和可用 SP 的总数,并求解最佳数量。
您可以从 推力。不过,可能需要一段时间才能弄清楚代码内部发生了什么。
Ok I guess we are dealing with two questions here.
1) Good way to assign block sizes (i.e. the number of threads)
This usually depends on the kind of data you are dealing with. Are you dealing with vectors ? Are you dealing with matrices ? The suggested way is to keep the number of threads in multiples of 32. So when dealing with vectors, launching 256 x 1, 512 x 1 blocks may be fine. And similariy when dealing with matrices, 32 x 8, 32 x 16.
2) Good way to assign grid sizes (i.e. the number of blocks)
It gets a bit tricky over here. Just launching 10,000 blocks because we can is not normally the best way to do things. Switching blocks in and out of hardware is costly. Two things to consider are the shared memory being used per block, and the total number of SPs available, and solve for the optimal number.
You can find a really good implementation of how to do that from thrust. It may take a while to figure out what's happening inside the code though.
我认为通常最好根据问题集设置块和网格大小,特别是出于优化目的。拥有不执行任何操作的额外线程实际上没有意义,并且可能会降低程序的性能。
I think it's usually best to set the block and grid sizes based on the problem set, especially for optimization purposes. Having extra threads that do nothing doesn't really make sense and can worsen the performance of your programs.
如果您有动态调整大小的数据集,那么当某些线程和块等待其他线程和块完成时,您可能会遇到一些延迟问题。
这个网站有一些伟大的启发法。一些一般要点:
选择每个网格的块
__syncthreads()
的使用越多,块就越多(这样一个块可以运行,而另一个块则等待同步)选择每个块的线程
线程数warp 大小的倍数(即通常为 32)
通常最好选择线程数,使得每个块的最大线程数(基于硬件)是一个倍数线程数。例如,最大线程数为 768,每个块使用 256 个线程往往会比 512 个线程更好,因为多个线程可以在一个块上同时运行。
If you have dynamically sized data sets then you will likely run into some issues with latency while some threads and blocks wait for others to complete.
This site has some great heuristics. Some general highlights:
Choosing Blocks Per Grid
__syncthreads()
in your kernels, the more blocks (so that one block can run while another waits to sync)Choosing Threads Per Block
Threads in multiples of warp size (i.e. generally 32)
Generally good to choose number of threads such that max number of threads per block (based on hardware) is a multiple of number of threads. E.g. with max threads of 768, using 256 threads per block will tend to be better than 512 because multiple threads can run simultaneously on a block.