CUDA如何获取网格、块、线程大小以及并行化非方阵计算

发布于 2024-11-01 06:10:43 字数 510 浏览 0 评论 0原文

我是 CUDA 新手，需要帮助理解一些事情。我需要帮助并行化这两个 for 循环。具体来说，如何设置dimBlock 和dimGrid 以使运行速度更快。我知道这看起来像 sdk 中的向量相加示例，但该示例仅适用于方阵，当我尝试修改 128 x 1024 矩阵的代码时，它无法正常工作。

__global__ void mAdd(float* A, float* B, float* C)
{
    for(int i = 0; i < 128; i++)
    {
        for(int j = 0; j < 1024; j++)
        {
            C[i * 1024 + j] = A[i * 1024 + j] + B[i * 1024 + j];
        }
    }
}

这段代码是一个更大循环的一部分，也是代码中最简单的部分，所以我决定尝试并行化 thia，同时学习 CUDA。我已阅读指南，但仍然不明白如何获得正确的号码。网格/块/线程的运行并有效地使用它们。

原文

I am new to CUDA and need help understanding some things. I need help parallelizing these two for loops. Specifically how to setup the dimBlock and dimGrid to make this run faster. I know this looks like the vector add example in the sdk but that example is only for square matrices and when I try to modify that code for my 128 x 1024 matrix it doesn't work properly.

__global__ void mAdd(float* A, float* B, float* C)
{
    for(int i = 0; i < 128; i++)
    {
        for(int j = 0; j < 1024; j++)
        {
            C[i * 1024 + j] = A[i * 1024 + j] + B[i * 1024 + j];
        }
    }
}

This code is part of a larger loop and is the simplest portion of the code, so I decided to try to paralleize thia and learn CUDA at same time. I have read the guides but still do not understand how to get the proper no. of grids/block/threads going and use them effectively.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我恋#小黄人 2024-11-08 06:10:43

正如您所写的，该内核是完全串行的。启动执行它的每个线程都将执行相同的工作。

CUDA（以及 OpenCL 和其他类似的“单程序、多数据”类型编程模型）背后的主要思想是采用“数据并行”操作 - 因此必须多次执行相同的、很大程度上独立的操作 - 并且编写一个执行该操作的内核。然后启动大量（半）自治线程来跨输入数据集执行该操作。

在数组加法示例中，数据并行操作适用

C[k] = A[k] + B[k];

于 0 到 128 * 1024 之间的所有 k。每个加法操作完全独立，没有顺序要求，因此可以由不同的线程执行。为了在 CUDA 中表达这一点，可以这样编写内核：

__global__ void mAdd(float* A, float* B, float* C, int n)
{
    int k = threadIdx.x + blockIdx.x * blockDim.x;

    if (k < n)
        C[k] = A[k] + B[k];
}

[免责声明：在浏览器中编写的代码，未经测试，使用风险自负]

这里，串行代码中的内部和外部循环被每个操作的一个 CUDA 线程替换，我在代码中添加了限制检查，以便在启动的线程多于所需操作的情况下，不会发生缓冲区溢出。如果内核像这样启动：

const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / blocksize; // value determine by block size and total work

madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);

那么 256 个块（每个块包含 512 个线程）将被启动到 GPU 硬件上以并行执行数组加法操作。请注意，如果输入数据大小无法表示为块大小的整数倍，则需要对块数进行舍入以覆盖完整的输入数据集。

以上所有内容都是针对非常琐碎的操作的 CUDA 范式的极大简化的概述，但也许它为您提供了足够的洞察力，让您可以继续自己的工作。如今 CUDA 已经相当成熟，网络上有很多好的免费教育材料，您可以使用它来进一步阐明我在这个答案中掩盖的编程模型的许多方面。

As you have written it, that kernel is completely serial. Every thread launched to execute it is going to performing the same work.

The main idea behind CUDA (and OpenCL and other similar "single program, multiple data" type programming models) is that you take a "data parallel" operation - so one where the same, largely independent, operation must be performed many times - and write a kernel which performs that operation. A large number of (semi)autonomous threads are then launched to perform that operation across the input data set.

In your array addition example, the data parallel operation is

C[k] = A[k] + B[k];

for all k between 0 and 128 * 1024. Each addition operation is completely independent and has no ordering requirements, and therefore can be performed by a different thread. To express this in CUDA, one might write the kernel like this:

__global__ void mAdd(float* A, float* B, float* C, int n)
{
    int k = threadIdx.x + blockIdx.x * blockDim.x;

    if (k < n)
        C[k] = A[k] + B[k];
}

[disclaimer: code written in browser, not tested, use at own risk]

Here, the inner and outer loop from the serial code are replaced by one CUDA thread per operation, and I have added a limit check in the code so that in cases where more threads are launched than required operations, no buffer overflow can occur. If the kernel is then launched like this:

const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / blocksize; // value determine by block size and total work

madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);

Then 256 blocks, each containing 512 threads will be launched onto the GPU hardware to perform the array addition operation in parallel. Note that if the input data size was not expressible as a nice round multiple of the block size, the number of blocks would need to be rounded up to cover the full input data set.

All of the above is a hugely simplified overview of the CUDA paradigm for a very trivial operation, but perhaps it gives enough insight for you to continue yourself. CUDA is rather mature these days and there is a lot of good, free educational material floating around the web you can probably use to further illuminate many of the aspects of the programming model I have glossed over in this answer.

回复收藏 0 原文

~没有更多了~