弄清楚 cuda 内核有多少个块和线程，以及如何使用它们

发布于 2024-10-13 16:51:58 字数 1540 浏览 9 评论 0原文

我一直在试图弄清楚如何制作一个我认为简单的内核来取二维矩阵中值的平均值，但我在直接思考过程中遇到了一些问题。

根据我的 deviceQuery 输出，我的 GPU 有 16MP，32cores/mp，块最大为 1024x1024x64，最大线程/块=1024。

所以，我正在处理一些大图像。也许 5000px x 3500px 或类似的东西。我的内核之一是对图像中所有像素的某些值取平均值。

现有代码将图像存储为二维数组 [行] [列]。因此，C 语言中的内核看起来就像您所期望的那样，其中包含行循环和列循环，计算位于中间。

那么如何在 CUDA 中设置此代码的维度计算部分呢？我已经查看了 SDK 中的缩减代码，但那是针对单维数组的。它没有提到当你有 2D 的东西时如何设置块和线程的数量。

我想我实际上需要像这样设置它，这就是我希望有人插话并提供帮助的地方：

num_threads=1024;
blocksX = num_cols/sqrt(num_threads);
blocksY = num_rows/sqrt(num_threads);
num_blocks = (num_rows*num_cols)/(blocksX*blocksY);

dim3 dimBlock(blocksX, blocksY, 1);
dim3 dimGrid(num_blocks, 1, 1);

这对于设置来说似乎有意义吗？

然后在内核中，要处理特定的行或列，我必须使用

rowidx = (blockIdx.x*blockDim.x)+threadId.x colidx = (blockIdx.y*blockDim.y)+threadId.y

至少我认为这可以用于获取行和列。

然后我将如何访问内核中特定的 r 行和 c 列？在 cuda 编程指南中，我找到了以下代码：

// Host code int width = 64, height = 64;
float* devPtr; size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code __global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r)
{
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c)
{
float element = row[c];
}
}
}

这看起来类似于您在 C 中使用 malloc 来声明 2D 数组的方式，但它没有提到在您自己的内核中访问该数组。我想在我的代码中，我将使用 cudaMallocPitch 调用，然后执行 memcpy 将我的数据放入设备上的 2D 数组中？

任何提示表示赞赏！谢谢！

原文

I have been trying to figure out how to make what I thought would be a simple kernel to take the average of the values in a 2d matrix, but I am having some issues getting my thought process straight on it.

According to my deviceQuery output, my GPU has 16MP, 32cores/mp, blocks max is 1024x1024x64 and I have a max threads/block=1024.

So, I am working on processings some large images. Maybe 5000px x 3500px or something like that. One of my kernels is taking an average of some values across all pixels in the image.

The existing code has the images stored as a 2D array [rows][cols]. So that kernel, in C, looks like you'd expect, wtih a loop over rows, and a loop over cols, with the calculation in the middle.

So how do I set up the dimension calculation portion of this code in CUDA? I have looked at the reduction code int he SDK, but that is for a single dimension array. It doesnt have any mention of how to set up number of blocks and threads for when you have soemthing 2D.

I am thinking I'd actually need to set it up like so, and this is where I'd like someone to chime in and help:

num_threads=1024;
blocksX = num_cols/sqrt(num_threads);
blocksY = num_rows/sqrt(num_threads);
num_blocks = (num_rows*num_cols)/(blocksX*blocksY);

dim3 dimBlock(blocksX, blocksY, 1);
dim3 dimGrid(num_blocks, 1, 1);

Does this seem to make sense for the setup?

And then in the kernel, to work on a particular row or column, i'd have to use

rowidx = (blockIdx.x*blockDim.x)+threadId.x
colidx = (blockIdx.y*blockDim.y)+threadId.y

At least I think that would work for getting a row and column.

How would I then access that particular row r and column c in the kernel? In the cuda programming guide I found the following code:

// Host code int width = 64, height = 64;
float* devPtr; size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code __global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r)
{
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c)
{
float element = row[c];
}
}
}

Which looks similar to how you'd use malloc in C to declare a 2D array, but it doesnt have any mention of accessing that array in your own kernel. I guess in my code, I will use that cudaMallocPitch call, and then perform a memcpy to get my data into the 2D array on the device?

Any tips appreciated! Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

宛菡 2024-10-20 16:51:58

最近，我以以下方式思考了这个问题。

// Grid and block size
const dim3 blockSize(16,16,1);
const dim3 gridSize(numRows, numCols, 1); 
// kernel call
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols

gridsize = 块数
blocksize = 每块线程数

这是相应的内核

__global__ void rgba_to_greyscale(const uchar4* const rgbaImage,
                       unsigned char* const greyImage,
                       int numRows, int numCols)
{ 
    int idx = blockIdx.x + blockIdx.y * numRows;
    uchar4 pixel     = rgbaImage[idx]; 
    float  intensity = 0.299f * pixel.x + 0.587f * pixel.y + 0.114f * pixel.z;  
    greyImage[idx]   = static_cast<unsigned char>(intensity);   
}

祝你好运！！！

Recently, I figured this question in the following fashion.

// Grid and block size
const dim3 blockSize(16,16,1);
const dim3 gridSize(numRows, numCols, 1); 
// kernel call
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols

gridsize = Number of Block
blocksize = Threads per Block

Here is the corresponding kernel

__global__ void rgba_to_greyscale(const uchar4* const rgbaImage,
                       unsigned char* const greyImage,
                       int numRows, int numCols)
{ 
    int idx = blockIdx.x + blockIdx.y * numRows;
    uchar4 pixel     = rgbaImage[idx]; 
    float  intensity = 0.299f * pixel.x + 0.587f * pixel.y + 0.114f * pixel.z;  
    greyImage[idx]   = static_cast<unsigned char>(intensity);   
}

Good luck!!!

回复收藏 0 原文

寒冷纷飞旳雪 2024-10-20 16:51:58

对于这样的性能应用程序，您需要将 2D 矩阵信息作为单个数组存储在内存中。因此，如果您有一个 M x N 矩阵，那么您可以将其存储在长度为 M*N 的单个数组中。

因此，如果您想存储 2x2 矩阵

(1 , 2)
(3 , 4)

，那么您可以创建一个数组，并使用以下代码初始化第 i 行和 j 列的元素。

int rows=2;
int cols=2;
float* matrix = malloc(sizeof(float)*rows*cols);
matrix[i*cols+j]=yourValue;
//element 0,0
matrix[0*cols+0]=1.0;
//element 0,1
matrix[0*cols+1]=2.0;
//element 1,0
matrix[1*cols+0]=3.0;
//element 1,1
matrix[1*cols+1]=4.0;

这种获取二维数组并将其存储在单个连续内存中的方式称为按行优先顺序存储数据。请参阅此处的维基百科文章。将数据布局更改为这种格式后，您可以使用 SDK 中显示的减少方式，并且您的代码应该会快得多，因为您将能够在 GPU 内核代码中进行更多合并读取。

For performance applications like this you need to store 2D matrix information as a single array in memory. So if you have an M x N matrix then you can store it in a single array of length M*N.

So if you want to store the 2x2 matrix

(1 , 2)
(3 , 4)

Then you create a single array an initialize the elements at row i, and column j using the following.

int rows=2;
int cols=2;
float* matrix = malloc(sizeof(float)*rows*cols);
matrix[i*cols+j]=yourValue;
//element 0,0
matrix[0*cols+0]=1.0;
//element 0,1
matrix[0*cols+1]=2.0;
//element 1,0
matrix[1*cols+0]=3.0;
//element 1,1
matrix[1*cols+1]=4.0;

This way of taking an 2D array and storing it a single continuous piece of memory in this way is called storing data in Row-major order. See Wikipedia article here. Once you change the layout of your data to this kind of format you can use the reduction that was shown in the SDK and your code should be a lot faster as you will be able to do more coalesced reads in the GPU kernel code.

回复收藏 0 原文

浮华 2024-10-20 16:51:58

下面是我自己的代码中的一个简短的片段，其中包含一个简单的内核。浮点指针都是设备指针。希望这有帮助。

定义和帮助函数：

#define BLOCK_SIZE 16

int iDivUp(int a, int b){
    return (a % b != 0) ? (a / b + 1) : (a / b);
}

块大小计算：

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGridProj(iDivUp(width,BLOCK_SIZE), iDivUp(height,BLOCK_SIZE));

主机调用：

calc_residual<<<dimGridProj, dimBlock>>>(d_image1, d_proj1, d_raynorm1, d_resid1, width, height);

内核：

__global__ void calc_residual(float *d_imagep, float *d_projp, float *d_raysump, float *d_residualp, int width, int height)
{
    int iy = blockDim.y * blockIdx.y + threadIdx.y;
if (iy >= height) {
    return;
}
int ix = blockDim.x * blockIdx.x + threadIdx.x;
if (ix >= width) {
    return;
}
int idx = iy * width + ix;
float raysumv = d_raysump[idx];
if (raysumv > 0.001) {
    d_residualp[idx] = (d_projp[idx]-d_imagep[idx])/raysumv;
} 
else{
    d_residualp[idx] = 0;
}
}

Below is a short snippet with a simple kernel from my own code. The float pointers are all device pointers. Hope this is helpful.

Defines and help functions:

#define BLOCK_SIZE 16

int iDivUp(int a, int b){
    return (a % b != 0) ? (a / b + 1) : (a / b);
}

Block size calculation:

dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGridProj(iDivUp(width,BLOCK_SIZE), iDivUp(height,BLOCK_SIZE));

Host call:

calc_residual<<<dimGridProj, dimBlock>>>(d_image1, d_proj1, d_raynorm1, d_resid1, width, height);

Kernel:

__global__ void calc_residual(float *d_imagep, float *d_projp, float *d_raysump, float *d_residualp, int width, int height)
{
    int iy = blockDim.y * blockIdx.y + threadIdx.y;
if (iy >= height) {
    return;
}
int ix = blockDim.x * blockIdx.x + threadIdx.x;
if (ix >= width) {
    return;
}
int idx = iy * width + ix;
float raysumv = d_raysump[idx];
if (raysumv > 0.001) {
    d_residualp[idx] = (d_projp[idx]-d_imagep[idx])/raysumv;
} 
else{
    d_residualp[idx] = 0;
}
}

回复收藏 0 原文

~没有更多了~