我的内核仅在块 (0,0) 中工作
我正在尝试编写一个简单的矩阵乘法应用程序,使用 CUDA 将两个方阵相乘。我遇到一个问题,我的内核只能在网格的块 (0,0) 中正确计算。
这是我的调用代码:
dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);
这是我的内核函数
__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int bx = blockIdx.x;
const int by = blockIdx.y;
const int row = (by * blockDim.y + ty);
const int col = (bx * blockDim.x + tx);
//Pvalue stores the Pd element that is computed by the thread
int Pvalue = 0;
for (int k = 0; k < Width; k++)
{
Pvalue += Md[row * Width + k] * Nd[k * Width + col];
}
__syncthreads();
//Write the matrix to device memory each thread writes one element
Pd[row * Width + col] = Pvalue;
}
我认为问题可能与内存有关,但我有点迷失了。我应该怎么做才能使这段代码跨多个块工作?
I am trying to write a simple matrixMultiplication application that multiplies two square matrices using CUDA. I am having a problem where my kernel is only computing correctly in block (0,0) of the grid.
This is my invocation code:
dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);
This is my Kernel function
__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int bx = blockIdx.x;
const int by = blockIdx.y;
const int row = (by * blockDim.y + ty);
const int col = (bx * blockDim.x + tx);
//Pvalue stores the Pd element that is computed by the thread
int Pvalue = 0;
for (int k = 0; k < Width; k++)
{
Pvalue += Md[row * Width + k] * Nd[k * Width + col];
}
__syncthreads();
//Write the matrix to device memory each thread writes one element
Pd[row * Width + col] = Pvalue;
}
I think the problem may have something to do with memory but I'm a bit lost. What should I do to make this code work across several blocks?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
问题出在我的 CUDA 内核调用上。对于正在处理的矩阵来说,网格太小了。
The problem was with my CUDA kernel invocation. The grid was far too small for the matrices being processed.