弄清楚 cuda 内核有多少个块和线程,以及如何使用它们
我一直在试图弄清楚如何制作一个我认为简单的内核来取二维矩阵中值的平均值,但我在直接思考过程中遇到了一些问题。
根据我的 deviceQuery 输出,我的 GPU 有 16MP,32cores/mp,块最大为 1024x1024x64,最大线程/块=1024。
所以,我正在处理一些大图像。也许 5000px x 3500px 或类似的东西。我的内核之一是对图像中所有像素的某些值取平均值。
现有代码将图像存储为二维数组 [行] [列]。因此,C 语言中的内核看起来就像您所期望的那样,其中包含行循环和列循环,计算位于中间。
那么如何在 CUDA 中设置此代码的维度计算部分呢?我已经查看了 SDK 中的缩减代码,但那是针对单维数组的。它没有提到当你有 2D 的东西时如何设置块和线程的数量。
我想我实际上需要像这样设置它,这就是我希望有人插话并提供帮助的地方:
num_threads=1024;
blocksX = num_cols/sqrt(num_threads);
blocksY = num_rows/sqrt(num_threads);
num_blocks = (num_rows*num_cols)/(blocksX*blocksY);
dim3 dimBlock(blocksX, blocksY, 1);
dim3 dimGrid(num_blocks, 1, 1);
这对于设置来说似乎有意义吗?
然后在内核中,要处理特定的行或列,我必须使用
rowidx = (blockIdx.x*blockDim.x)+threadId.x colidx = (blockIdx.y*blockDim.y)+threadId.y
至少我认为这可以用于获取行和列。
然后我将如何访问内核中特定的 r 行和 c 列?在 cuda 编程指南中,我找到了以下代码:
// Host code int width = 64, height = 64;
float* devPtr; size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code __global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r)
{
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c)
{
float element = row[c];
}
}
}
这看起来类似于您在 C 中使用 malloc 来声明 2D 数组的方式,但它没有提到在您自己的内核中访问该数组。我想在我的代码中,我将使用 cudaMallocPitch 调用,然后执行 memcpy 将我的数据放入设备上的 2D 数组中?
任何提示表示赞赏!谢谢!
I have been trying to figure out how to make what I thought would be a simple kernel to take the average of the values in a 2d matrix, but I am having some issues getting my thought process straight on it.
According to my deviceQuery output, my GPU has 16MP, 32cores/mp, blocks max is 1024x1024x64 and I have a max threads/block=1024.
So, I am working on processings some large images. Maybe 5000px x 3500px or something like that. One of my kernels is taking an average of some values across all pixels in the image.
The existing code has the images stored as a 2D array [rows][cols]. So that kernel, in C, looks like you'd expect, wtih a loop over rows, and a loop over cols, with the calculation in the middle.
So how do I set up the dimension calculation portion of this code in CUDA? I have looked at the reduction code int he SDK, but that is for a single dimension array. It doesnt have any mention of how to set up number of blocks and threads for when you have soemthing 2D.
I am thinking I'd actually need to set it up like so, and this is where I'd like someone to chime in and help:
num_threads=1024;
blocksX = num_cols/sqrt(num_threads);
blocksY = num_rows/sqrt(num_threads);
num_blocks = (num_rows*num_cols)/(blocksX*blocksY);
dim3 dimBlock(blocksX, blocksY, 1);
dim3 dimGrid(num_blocks, 1, 1);
Does this seem to make sense for the setup?
And then in the kernel, to work on a particular row or column, i'd have to use
rowidx = (blockIdx.x*blockDim.x)+threadId.x
colidx = (blockIdx.y*blockDim.y)+threadId.y
At least I think that would work for getting a row and column.
How would I then access that particular row r and column c in the kernel? In the cuda programming guide I found the following code:
// Host code int width = 64, height = 64;
float* devPtr; size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code __global__ void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r)
{
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c)
{
float element = row[c];
}
}
}
Which looks similar to how you'd use malloc in C to declare a 2D array, but it doesnt have any mention of accessing that array in your own kernel. I guess in my code, I will use that cudaMallocPitch call, and then perform a memcpy to get my data into the 2D array on the device?
Any tips appreciated! Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
最近,我以以下方式思考了这个问题。
gridsize = 块数
blocksize = 每块线程数
这是相应的内核
祝你好运!!!
Recently, I figured this question in the following fashion.
gridsize = Number of Block
blocksize = Threads per Block
Here is the corresponding kernel
Good luck!!!
对于这样的性能应用程序,您需要将 2D 矩阵信息作为单个数组存储在内存中。因此,如果您有一个 M x N 矩阵,那么您可以将其存储在长度为 M*N 的单个数组中。
因此,如果您想存储 2x2 矩阵
,那么您可以创建一个数组,并使用以下代码初始化第 i 行和 j 列的元素。
这种获取二维数组并将其存储在单个连续内存中的方式称为按行优先顺序存储数据。请参阅此处的维基百科文章。将数据布局更改为这种格式后,您可以使用 SDK 中显示的减少方式,并且您的代码应该会快得多,因为您将能够在 GPU 内核代码中进行更多合并读取。
For performance applications like this you need to store 2D matrix information as a single array in memory. So if you have an M x N matrix then you can store it in a single array of length M*N.
So if you want to store the 2x2 matrix
Then you create a single array an initialize the elements at row i, and column j using the following.
This way of taking an 2D array and storing it a single continuous piece of memory in this way is called storing data in Row-major order. See Wikipedia article here. Once you change the layout of your data to this kind of format you can use the reduction that was shown in the SDK and your code should be a lot faster as you will be able to do more coalesced reads in the GPU kernel code.
下面是我自己的代码中的一个简短的片段,其中包含一个简单的内核。浮点指针都是设备指针。希望这有帮助。
定义和帮助函数:
块大小计算:
主机调用:
内核:
Below is a short snippet with a simple kernel from my own code. The float pointers are all device pointers. Hope this is helpful.
Defines and help functions:
Block size calculation:
Host call:
Kernel: