CUDA 块和网格大小效率

发布于 2024-11-03 20:52:21 字数 240 浏览 0 评论 0原文

在 cuda 中处理动态大小的数据集的建议方法是什么?

是“根据问题集设置块和网格大小”的情况,还是值得将块尺寸分配为 2 的因子并有一些内核逻辑来处理溢出?

我可以看出这对于块尺寸来说可能很重要,但是这对于网格尺寸有多大影响呢?据我了解,实际的硬件约束停止在块级别(即分配给具有一定数量的 SP 的 SM 的块,因此可以处理特定的扭曲大小)。

我仔细阅读了柯克的“编程大规模并行处理器”,但它并没有真正涉及这个领域。

What is the advised way of dealing with dynamically-sized datasets in cuda?

Is it a case of 'set the block and grid sizes based on the problem set' or is it worthwhile to assign block dimensions as factors of 2 and have some in-kernel logic to deal with the over-spill?

I can see how this probably matters alot for the block dimensions, but how much does this matter to the grid dimensions? As I understand it, the actual hardware constraints stop at the block level (i.e blocks assigned to SM's that have a set number of SP's, and so can handle a particular warp size).

I've perused Kirk's 'Programming Massively Parallel Processors' but it doesn't really touch on this area.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

当梦初醒 2024-11-10 20:52:21

通常是为了获得最佳性能而设置块大小,并根据工作总量设置网格大小。大多数内核在每 Mp 上都有一个“最佳点”数量的扭曲,在该点上它们工作得最好,您应该做一些基准测试/分析来看看它在哪里。您可能仍然需要内核中的溢出逻辑,因为问题大小很少是块大小的整数倍。

编辑:
给出一个具体示例,说明如何对简单内核执行此操作(在本例中,自定义 BLAS 1 级 dscal 类型操作作为压缩对称带矩阵 Cholesky 分解的一部分完成):

// Fused square root and dscal operation
__global__ 
void cdivkernel(const int n, double *a)
{
    __shared__ double oneondiagv;

    int imin = threadIdx.x + blockDim.x * blockIdx.x;
    int istride = blockDim.x * gridDim.x;

    if (threadIdx.x == 0) {
        oneondiagv = rsqrt( a[0] );
    }
    __syncthreads();

    for(int i=imin; i<n; i+=istride) {
        a[i] *= oneondiagv;
    }
}

要启动此内核,执行参数为计算如下:

  1. 我们允许每个块最多 4 个扭曲(即 128 个线程)。通常,您会将其修复为最佳数量,但在这种情况下,通常会在非常小的向量上调用内核,因此使用可变的块大小是有意义的。
  2. 然后,我们根据总工作量计算块数,总共 112 个块,相当于 14 MP Fermi Telsa 上每个 MP 8 个块。如果工作量超过网格大小,内核将进行迭代。

包含执行参数计算和内核启动的最终包装函数如下所示:

// Fused the diagonal element root and dscal operation into
// a single "cdiv" operation
void fusedDscal(const int n, double *a)
{
    // The semibandwidth (column length) determines
    // how many warps are required per column of the 
    // matrix.
    const int warpSize = 32;
    const int maxGridSize = 112; // this is 8 blocks per MP for a Telsa C2050

    int warpCount = (n / warpSize) + (((n % warpSize) == 0) ? 0 : 1);
    int warpPerBlock = max(1, min(4, warpCount));

    // For the cdiv kernel, the block size is allowed to grow to
    // four warps per block, and the block count becomes the warp count over four
    // or the GPU "fill" whichever is smaller
    int threadCount = warpSize * warpPerBlock;
    int blockCount = min( maxGridSize, max(1, warpCount/warpPerBlock) );
    dim3 BlockDim = dim3(threadCount, 1, 1);
    dim3 GridDim  = dim3(blockCount, 1, 1);

    cdivkernel<<< GridDim,BlockDim >>>(n,a);
    errchk( cudaPeekAtLastError() );
}

也许这给出了一些关于如何设计一个“通用”方案来根据输入数据大小设置执行参数的提示。

It s usually a case of setting block size for optimal performance, and grid size according to the total amount of work. Most kernels have a "sweet spot" number of warps per Mp where they work best, and you should do some benchmarking/profiling to see where that is. You probably still need over-spill logic in the kernel because problem sizes are rarely round multiples of block sizes.

EDIT:
To give a concrete example of how this might be done for a simple kernel (in this case a custom BLAS level 1 dscal type operation done as part of a Cholesky factorization of packed symmetric band matrices):

// Fused square root and dscal operation
__global__ 
void cdivkernel(const int n, double *a)
{
    __shared__ double oneondiagv;

    int imin = threadIdx.x + blockDim.x * blockIdx.x;
    int istride = blockDim.x * gridDim.x;

    if (threadIdx.x == 0) {
        oneondiagv = rsqrt( a[0] );
    }
    __syncthreads();

    for(int i=imin; i<n; i+=istride) {
        a[i] *= oneondiagv;
    }
}

To launch this kernel, the execution parameters are calculated as follows:

  1. We allow up to 4 warps per block (so 128 threads). Normally you would fix this at an optimal number, but in this case the kernel is often called on very small vectors, so having a variable block size made some sense.
  2. We then compute the block count according to the total amount of work, up to 112 total blocks, which is the equivalent of 8 blocks per MP on a 14 MP Fermi Telsa. The kernel will iterate if the amount of work exceeds grid size.

The resulting wrapper function containing the execution parameter calculations and kernel launch look like this:

// Fused the diagonal element root and dscal operation into
// a single "cdiv" operation
void fusedDscal(const int n, double *a)
{
    // The semibandwidth (column length) determines
    // how many warps are required per column of the 
    // matrix.
    const int warpSize = 32;
    const int maxGridSize = 112; // this is 8 blocks per MP for a Telsa C2050

    int warpCount = (n / warpSize) + (((n % warpSize) == 0) ? 0 : 1);
    int warpPerBlock = max(1, min(4, warpCount));

    // For the cdiv kernel, the block size is allowed to grow to
    // four warps per block, and the block count becomes the warp count over four
    // or the GPU "fill" whichever is smaller
    int threadCount = warpSize * warpPerBlock;
    int blockCount = min( maxGridSize, max(1, warpCount/warpPerBlock) );
    dim3 BlockDim = dim3(threadCount, 1, 1);
    dim3 GridDim  = dim3(blockCount, 1, 1);

    cdivkernel<<< GridDim,BlockDim >>>(n,a);
    errchk( cudaPeekAtLastError() );
}

Perhaps this gives some hints about how to design a "universal" scheme for setting execution parameters against input data size.

我ぃ本無心為│何有愛 2024-11-10 20:52:21

好吧,我想我们在这里处理两个问题。

1)分配块大小(即线程数)的好方法
这通常取决于您正在处理的数据类型。你在处理向量吗?你在处理矩阵吗?建议的方法是将线程数保持在 32 的倍数。因此,在处理向量时,启动 256 x 1、512 x 1 块可能没问题。处理矩阵时类似,32 x 8、32 x 16。2

)分配网格大小(即块数)的好方法
这里有点棘手。因为我们可以而仅仅启动 10,000 个区块通常并不是最好的做法。将块移入或移出硬件的成本很高。需要考虑的两件事是每个块使用的共享内存和可用 SP 的总数,并求解最佳数量。

您可以从 推力。不过,可能需要一段时间才能弄清楚代码内部发生了什么。

Ok I guess we are dealing with two questions here.

1) Good way to assign block sizes (i.e. the number of threads)
This usually depends on the kind of data you are dealing with. Are you dealing with vectors ? Are you dealing with matrices ? The suggested way is to keep the number of threads in multiples of 32. So when dealing with vectors, launching 256 x 1, 512 x 1 blocks may be fine. And similariy when dealing with matrices, 32 x 8, 32 x 16.

2) Good way to assign grid sizes (i.e. the number of blocks)
It gets a bit tricky over here. Just launching 10,000 blocks because we can is not normally the best way to do things. Switching blocks in and out of hardware is costly. Two things to consider are the shared memory being used per block, and the total number of SPs available, and solve for the optimal number.

You can find a really good implementation of how to do that from thrust. It may take a while to figure out what's happening inside the code though.

农村范ル 2024-11-10 20:52:21

我认为通常最好根据问题集设置块和网格大小,特别是出于优化目的。拥有不执行任何操作的额外线程实际上没有意义,并且可能会降低程序的性能。

I think it's usually best to set the block and grid sizes based on the problem set, especially for optimization purposes. Having extra threads that do nothing doesn't really make sense and can worsen the performance of your programs.

夜司空 2024-11-10 20:52:21

如果您有动态调整大小的数据集,那么当某些线程和块等待其他线程和块完成时,您可能会遇到一些延迟问题。

这个网站有一些伟大的启发法。一些一般要点:

选择每个网格的块

  • 每个网格的块应该 >= 多处理器的数量。
  • 内核中 __syncthreads() 的使用越多,块就越多(这样一个块可以运行,而另一个块则等待同步)

选择每个块的线程

  • 线程数warp 大小的倍数(即通常为 32)

  • 通常最好选择线程数,使得每个块的最大线程数(基于硬件)是一个倍数线程数。例如,最大线程数为 768,每个块使用 256 个线程往往会比 512 个线程更好,因为多个线程可以在一个块上同时运行。

If you have dynamically sized data sets then you will likely run into some issues with latency while some threads and blocks wait for others to complete.

This site has some great heuristics. Some general highlights:

Choosing Blocks Per Grid

  • Blocks per grid should be >= number of multiprocessors.
  • The more use of __syncthreads() in your kernels, the more blocks (so that one block can run while another waits to sync)

Choosing Threads Per Block

  • Threads in multiples of warp size (i.e. generally 32)

  • Generally good to choose number of threads such that max number of threads per block (based on hardware) is a multiple of number of threads. E.g. with max threads of 768, using 256 threads per block will tend to be better than 512 because multiple threads can run simultaneously on a block.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文