当前位置：文江博客话题详情

CUDA 块和网格大小效率

发布于 2024-11-03 20:52:21 字数 240 浏览 3 评论 0原文

在 cuda 中处理动态大小的数据集的建议方法是什么？

是“根据问题集设置块和网格大小”的情况，还是值得将块尺寸分配为 2 的因子并有一些内核逻辑来处理溢出？

我可以看出这对于块尺寸来说可能很重要，但是这对于网格尺寸有多大影响呢？据我了解，实际的硬件约束停止在块级别（即分配给具有一定数量的 SP 的 SM 的块，因此可以处理特定的扭曲大小）。

我仔细阅读了柯克的“编程大规模并行处理器”，但它并没有真正涉及这个领域。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

当梦初醒 2024-11-10 20:52:21

通常是为了获得最佳性能而设置块大小，并根据工作总量设置网格大小。大多数内核在每 Mp 上都有一个“最佳点”数量的扭曲，在该点上它们工作得最好，您应该做一些基准测试/分析来看看它在哪里。您可能仍然需要内核中的溢出逻辑，因为问题大小很少是块大小的整数倍。

编辑：
给出一个具体示例，说明如何对简单内核执行此操作（在本例中，自定义 BLAS 1 级 dscal 类型操作作为压缩对称带矩阵 Cholesky 分解的一部分完成）：

// Fused square root and dscal operation
__global__ 
void cdivkernel(const int n, double *a)
{
    __shared__ double oneondiagv;

    int imin = threadIdx.x + blockDim.x * blockIdx.x;
    int istride = blockDim.x * gridDim.x;

    if (threadIdx.x == 0) {
        oneondiagv = rsqrt( a[0] );
    }
    __syncthreads();

    for(int i=imin; i<n; i+=istride) {
        a[i] *= oneondiagv;
    }
}

要启动此内核，执行参数为计算如下：

我们允许每个块最多 4 个扭曲（即 128 个线程）。通常，您会将其修复为最佳数量，但在这种情况下，通常会在非常小的向量上调用内核，因此使用可变的块大小是有意义的。
然后，我们根据总工作量计算块数，总共 112 个块，相当于 14 MP Fermi Telsa 上每个 MP 8 个块。如果工作量超过网格大小，内核将进行迭代。

包含执行参数计算和内核启动的最终包装函数如下所示：

// Fused the diagonal element root and dscal operation into
// a single "cdiv" operation
void fusedDscal(const int n, double *a)
{
    // The semibandwidth (column length) determines
    // how many warps are required per column of the 
    // matrix.
    const int warpSize = 32;
    const int maxGridSize = 112; // this is 8 blocks per MP for a Telsa C2050

    int warpCount = (n / warpSize) + (((n % warpSize) == 0) ? 0 : 1);
    int warpPerBlock = max(1, min(4, warpCount));

    // For the cdiv kernel, the block size is allowed to grow to
    // four warps per block, and the block count becomes the warp count over four
    // or the GPU "fill" whichever is smaller
    int threadCount = warpSize * warpPerBlock;
    int blockCount = min( maxGridSize, max(1, warpCount/warpPerBlock) );
    dim3 BlockDim = dim3(threadCount, 1, 1);
    dim3 GridDim  = dim3(blockCount, 1, 1);

    cdivkernel<<< GridDim,BlockDim >>>(n,a);
    errchk( cudaPeekAtLastError() );
}

也许这给出了一些关于如何设计一个“通用”方案来根据输入数据大小设置执行参数的提示。

It s usually a case of setting block size for optimal performance, and grid size according to the total amount of work. Most kernels have a "sweet spot" number of warps per Mp where they work best, and you should do some benchmarking/profiling to see where that is. You probably still need over-spill logic in the kernel because problem sizes are rarely round multiples of block sizes.

EDIT:
To give a concrete example of how this might be done for a simple kernel (in this case a custom BLAS level 1 dscal type operation done as part of a Cholesky factorization of packed symmetric band matrices):

// Fused square root and dscal operation
__global__ 
void cdivkernel(const int n, double *a)
{
    __shared__ double oneondiagv;

    int imin = threadIdx.x + blockDim.x * blockIdx.x;
    int istride = blockDim.x * gridDim.x;

    if (threadIdx.x == 0) {
        oneondiagv = rsqrt( a[0] );
    }
    __syncthreads();

    for(int i=imin; i<n; i+=istride) {
        a[i] *= oneondiagv;
    }
}

To launch this kernel, the execution parameters are calculated as follows:

We allow up to 4 warps per block (so 128 threads). Normally you would fix this at an optimal number, but in this case the kernel is often called on very small vectors, so having a variable block size made some sense.
We then compute the block count according to the total amount of work, up to 112 total blocks, which is the equivalent of 8 blocks per MP on a 14 MP Fermi Telsa. The kernel will iterate if the amount of work exceeds grid size.

The resulting wrapper function containing the execution parameter calculations and kernel launch look like this:

// Fused the diagonal element root and dscal operation into
// a single "cdiv" operation
void fusedDscal(const int n, double *a)
{
    // The semibandwidth (column length) determines
    // how many warps are required per column of the 
    // matrix.
    const int warpSize = 32;
    const int maxGridSize = 112; // this is 8 blocks per MP for a Telsa C2050

    int warpCount = (n / warpSize) + (((n % warpSize) == 0) ? 0 : 1);
    int warpPerBlock = max(1, min(4, warpCount));

    // For the cdiv kernel, the block size is allowed to grow to
    // four warps per block, and the block count becomes the warp count over four
    // or the GPU "fill" whichever is smaller
    int threadCount = warpSize * warpPerBlock;
    int blockCount = min( maxGridSize, max(1, warpCount/warpPerBlock) );
    dim3 BlockDim = dim3(threadCount, 1, 1);
    dim3 GridDim  = dim3(blockCount, 1, 1);

    cdivkernel<<< GridDim,BlockDim >>>(n,a);
    errchk( cudaPeekAtLastError() );
}

Perhaps this gives some hints about how to design a "universal" scheme for setting execution parameters against input data size.

回复收藏 0 原文