矩阵乘法 CUDA
我已经阅读了几个网站,甚至使用 NVIDA 的代码作为指导,但我仍然得到错误的答案。 main 将询问用户大小,并显示 A 和 B,然后显示结果矩阵 C。但是,假设我为 A 和 B 运行 2x2 矩阵,这是我的示例输出:
Matrix A
0.000000 8.000000
2.000000 2.000000
Matrix B
3.000000 1.000000
5.000000 7.000000
Matrix C (Results)
0.000000 9.000000
7.000000 4.000000
但这是不正确的。应该是:
40.000 56.000
16.000 16.000
我把小数改成了整数,这样方便检查,结果发现不对。我不明白为什么它是不正确的,特别是即使我直接从他们的代码示例中获取它。
#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_
#include <stdio.h>
// Thread block size
#define BLOCK_SIZE 16
#define TILE_SIZE 16
// CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first sub-matrix of A processed
// by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed
// by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the
// sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed
// by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the
// sub-matrices of B
int bStep = BLOCK_SIZE * wB;
float Csub=0;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep)
{
// Declaration of the shared memory array As
// used to store the sub-matrix of A
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
// Declaration of the shared memory array Bs
// used to store the sub-matrix of B
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load the matrices from global memory
// to shared memory; each thread loads
// one element of each matrix
As[ty][tx] = A[a + wA * ty + tx];
Bs[ty][tx] = B[b + wB * ty + tx];
// Synchronize to make sure the matrices
// are loaded
__syncthreads();
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += As[ty][k] * Bs[k][tx];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to device memory;
// each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
}
#endif // #ifndef _MATRIXMUL_KERNEL_H_
楼主代码:
//perform the calculation
//setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(c.colSize / threads.x, c.rowSize / threads.y);
// execute the kernel
matrixMul<<< grid, threads >>>(deviceMatrixC, deviceMatrixA, deviceMatrixB, a.colSize, b.colSize);
感谢您的帮助, 担
I have been reading through several websites and even used NVIDA's code as a guide but I am still getting the wrong answer. The main will ask the user for size, and will display A and B then display the resulting matrix C. However say I run a 2x2 matrix for both A and B this is my sample output:
Matrix A
0.000000 8.000000
2.000000 2.000000
Matrix B
3.000000 1.000000
5.000000 7.000000
Matrix C (Results)
0.000000 9.000000
7.000000 4.000000
But that's incorrect. It should be:
40.000 56.000
16.000 16.000
I changed it from decimals to whole numbers so that it would be easier to check, and I found that it's incorrect. I do not understand why it would be incorrect, especially even though I took it right from their code sample.
#ifndef _MATRIXMUL_KERNEL_H_
#define _MATRIXMUL_KERNEL_H_
#include <stdio.h>
// Thread block size
#define BLOCK_SIZE 16
#define TILE_SIZE 16
// CUDA Kernel
__global__ void matrixMul( float* C, float* A, float* B, int wA, int wB)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first sub-matrix of A processed
// by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed
// by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the
// sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed
// by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the
// sub-matrices of B
int bStep = BLOCK_SIZE * wB;
float Csub=0;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep)
{
// Declaration of the shared memory array As
// used to store the sub-matrix of A
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
// Declaration of the shared memory array Bs
// used to store the sub-matrix of B
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load the matrices from global memory
// to shared memory; each thread loads
// one element of each matrix
As[ty][tx] = A[a + wA * ty + tx];
Bs[ty][tx] = B[b + wB * ty + tx];
// Synchronize to make sure the matrices
// are loaded
__syncthreads();
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += As[ty][k] * Bs[k][tx];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to device memory;
// each thread writes one element
int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;
C[c + wB * ty + tx] = Csub;
}
#endif // #ifndef _MATRIXMUL_KERNEL_H_
host code:
//perform the calculation
//setup execution parameters
dim3 threads(BLOCK_SIZE, BLOCK_SIZE);
dim3 grid(c.colSize / threads.x, c.rowSize / threads.y);
// execute the kernel
matrixMul<<< grid, threads >>>(deviceMatrixC, deviceMatrixA, deviceMatrixB, a.colSize, b.colSize);
Thanks for your help,
Dan
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您使用的代码隐式要求矩阵的大小是块大小的整数倍(在本例中为 16x16)。内积计算一次处理一个图块宽度,而不检查越界内存访问。因此,2x2 矩阵将不起作用。
如果您尝试使用 16x16 输入运行内核(例如将 2x2 大小写零填充为 16x16),您应该能够确认结果。
The code you are using implicitly requires that the size of the matrices are round multiples of the block size (16x16 in this case). The inner product calculation processes a tile width at a time without checking for out of bounds memory access. For this reason, 2x2 matrices will not work.
If you try running kernel with a 16x16 input (for example zero padding your 2x2 case to 16x16), you should be able to confirm the result.