我的内核仅在块 (0,0) 中工作

发布于 2024-09-05 04:13:09 字数 1032 浏览 0 评论 0原文

我正在尝试编写一个简单的矩阵乘法应用程序，使用 CUDA 将两个方阵相乘。我遇到一个问题，我的内核只能在网格的块 (0,0) 中正确计算。

这是我的调用代码：

dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);

这是我的内核函数

__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
        const int tx = threadIdx.x; 
        const int ty = threadIdx.y;
        const int bx = blockIdx.x;
        const int by = blockIdx.y;
        const int row = (by * blockDim.y + ty);
        const int col = (bx * blockDim.x + tx);

        //Pvalue stores the Pd element that is computed by the thread
        int Pvalue = 0;

        for (int k = 0; k < Width; k++)
        {
            Pvalue += Md[row * Width + k] * Nd[k * Width + col];
        }
        __syncthreads();
        //Write the matrix to device memory each thread writes one element
        Pd[row * Width + col] = Pvalue;

    }

我认为问题可能与内存有关，但我有点迷失了。我应该怎么做才能使这段代码跨多个块工作？

原文

I am trying to write a simple matrixMultiplication application that multiplies two square matrices using CUDA. I am having a problem where my kernel is only computing correctly in block (0,0) of the grid.

This is my invocation code:

dim3 dimBlock(4,4,1);
dim3 dimGrid(4,4,1);
//Launch the kernel;
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,Width);

This is my Kernel function

__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
        const int tx = threadIdx.x; 
        const int ty = threadIdx.y;
        const int bx = blockIdx.x;
        const int by = blockIdx.y;
        const int row = (by * blockDim.y + ty);
        const int col = (bx * blockDim.x + tx);

        //Pvalue stores the Pd element that is computed by the thread
        int Pvalue = 0;

        for (int k = 0; k < Width; k++)
        {
            Pvalue += Md[row * Width + k] * Nd[k * Width + col];
        }
        __syncthreads();
        //Write the matrix to device memory each thread writes one element
        Pd[row * Width + col] = Pvalue;

    }

I think the problem may have something to do with memory but I'm a bit lost. What should I do to make this code work across several blocks?

分享到QQ

分享到微博