矩阵乘法 CUDA

发布于 2024-12-25 19:54:18 字数 3091 浏览 1 评论 0原文

我已经阅读了几个网站,甚至使用 NVIDA 的代码作为指导,但我仍然得到错误的答案。 main 将询问用户大小,并显示 A 和 B,然后显示结果矩阵 C。但是,假设我为 A 和 B 运行 2x2 矩阵,这是我的示例输出:

Matrix A
0.000000 8.000000
2.000000 2.000000


Matrix B
3.000000 1.000000
5.000000 7.000000


Matrix C (Results)
0.000000 9.000000
7.000000 4.000000

但这是不正确的。应该是:

40.000 56.000
16.000 16.000

我把小数改成了整数,这样方便检查,结果发现不对。我不明白为什么它是不正确的,特别是即使我直接从他们的代码示例中获取它。

#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_

#include <stdio.h>

// Thread block size
#define BLOCK_SIZE 16
#define TILE_SIZE  16



// CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;

// Index of the first sub-matrix of A processed 
// by the block
int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed 
// by the block
int aEnd   = aBegin + wA - 1;

// Step size used to iterate through the 
// sub-matrices of A
int aStep  = BLOCK_SIZE;

// Index of the first sub-matrix of B processed 
// by the block
int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the 
// sub-matrices of B
int bStep  = BLOCK_SIZE * wB;
float Csub=0;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
{
    // Declaration of the shared memory array As 
    // used to store the sub-matrix of A
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

    // Declaration of the shared memory array Bs 
    // used to store the sub-matrix of B
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

    // Load the matrices from global memory
    // to shared memory; each thread loads
    // one element of each matrix
    As[ty][tx] = A[a + wA * ty + tx];
    Bs[ty][tx] = B[b + wB * ty + tx];

    // Synchronize to make sure the matrices 
    // are loaded
    __syncthreads();

    // Multiply the two matrices together;
    // each thread computes one element
    // of the block sub-matrix
    for (int k = 0; k < BLOCK_SIZE; ++k)
        Csub += As[ty][k] * Bs[k][tx];

    // Synchronize to make sure that the preceding
    // computation is done before loading two new
    // sub-matrices of A and B in the next iteration
    __syncthreads();
}
// Write the block sub-matrix to device memory;
// each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
}

#endif // #ifndef _MATRIXMUL_KERNEL_H_

楼主代码:

    //perform the calculation
    //setup execution parameters
    dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
    dim3 grid(c.colSize / threads.x, c.rowSize / threads.y);

    //   execute the kernel
    matrixMul<<< grid, threads >>>(deviceMatrixC, deviceMatrixA, deviceMatrixB, a.colSize, b.colSize);

感谢您的帮助, 担

I have been reading through several websites and even used NVIDA's code as a guide but I am still getting the wrong answer. The main will ask the user for size, and will display A and B then display the resulting matrix C. However say I run a 2x2 matrix for both A and B this is my sample output:

Matrix A
0.000000 8.000000
2.000000 2.000000


Matrix B
3.000000 1.000000
5.000000 7.000000


Matrix C (Results)
0.000000 9.000000
7.000000 4.000000

But that's incorrect. It should be:

40.000 56.000
16.000 16.000

I changed it from decimals to whole numbers so that it would be easier to check, and I found that it's incorrect. I do not understand why it would be incorrect, especially even though I took it right from their code sample.

#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_

#include <stdio.h>

// Thread block size
#define BLOCK_SIZE 16
#define TILE_SIZE  16



// CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB)
{
    // Block index
    int bx = blockIdx.x;
    int by = blockIdx.y;

// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;

// Index of the first sub-matrix of A processed 
// by the block
int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed 
// by the block
int aEnd   = aBegin + wA - 1;

// Step size used to iterate through the 
// sub-matrices of A
int aStep  = BLOCK_SIZE;

// Index of the first sub-matrix of B processed 
// by the block
int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the 
// sub-matrices of B
int bStep  = BLOCK_SIZE * wB;
float Csub=0;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) 
{
    // Declaration of the shared memory array As 
    // used to store the sub-matrix of A
    __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

    // Declaration of the shared memory array Bs 
    // used to store the sub-matrix of B
    __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

    // Load the matrices from global memory
    // to shared memory; each thread loads
    // one element of each matrix
    As[ty][tx] = A[a + wA * ty + tx];
    Bs[ty][tx] = B[b + wB * ty + tx];

    // Synchronize to make sure the matrices 
    // are loaded
    __syncthreads();

    // Multiply the two matrices together;
    // each thread computes one element
    // of the block sub-matrix
    for (int k = 0; k < BLOCK_SIZE; ++k)
        Csub += As[ty][k] * Bs[k][tx];

    // Synchronize to make sure that the preceding
    // computation is done before loading two new
    // sub-matrices of A and B in the next iteration
    __syncthreads();
}
// Write the block sub-matrix to device memory;
// each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
}

#endif // #ifndef _MATRIXMUL_KERNEL_H_

host code:

    //perform the calculation
    //setup execution parameters
    dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
    dim3 grid(c.colSize / threads.x, c.rowSize / threads.y);

    //   execute the kernel
    matrixMul<<< grid, threads >>>(deviceMatrixC, deviceMatrixA, deviceMatrixB, a.colSize, b.colSize);

Thanks for your help,
Dan

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

怎言笑 2025-01-01 19:54:18

您使用的代码隐式要求矩阵的大小是块大小的整数倍(在本例中为 16x16)。内积计算一次处理一个图块宽度,而不检查越界内存访问。因此,2x2 矩阵将不起作用。

如果您尝试使用 16x16 输入运行内核(例如将 2x2 大小写零填充为 16x16),您应该能够确认结果。

The code you are using implicitly requires that the size of the matrices are round multiples of the block size (16x16 in this case). The inner product calculation processes a tile width at a time without checking for out of bounds memory access. For this reason, 2x2 matrices will not work.

If you try running kernel with a 16x16 input (for example zero padding your 2x2 case to 16x16), you should be able to confirm the result.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文