CUDA乘法

发布于 2024-12-11 04:33:14 字数 403 浏览 0 评论 0原文

串行代码片段如下所示：

int i, j;
for(j=0; j<ny; j++)
{
    for(i=0; i<nx; i++)
    {
        x[i + j*nx] *= y[i];
    }
}

我使用此内核将其转换为 CUDA：

int tid = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(tid = 0; tid <nx*ny; tid++)
{
    j = tid/nx;
    i = tid - j*nx;
    x[tid] *= y[i];
}

但是 GPU 内核没有提供任何加速改进？关于更好的解决方案有什么建议吗？提前致谢

原文

Serial code snippet looks like this:

int i, j;
for(j=0; j<ny; j++)
{
    for(i=0; i<nx; i++)
    {
        x[i + j*nx] *= y[i];
    }
}

I converted this to CUDA using this kernel:

int tid = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(tid = 0; tid <nx*ny; tid++)
{
    j = tid/nx;
    i = tid - j*nx;
    x[tid] *= y[i];
}

However the GPU kernel does not give any speedup improvement? Any suggestions on a better solution?? Thanks in advance

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

も星光 2024-12-18 04:33:14

如果这是串行代码：

  int i, j;
  for(j=0; j<ny; j++)
  {
      for(i=0; i<nx; i++)
      {
          x[i + j*nx] *= y[i];
      }
  }

那么您应该这样做：

  __global__ void fn(float *x, int nx)
  {
     int tid = blockIdx.x * blockDim.x + threadIdx.x;
     int j = tid/nx, i = tid - j * nx;
     x[tid] *= y[i];
  }

  fn<<<nx*ny/B, B>>>(x, nx); // with B = 256, 512, etc.

您正在做的事情相当奇怪：您指示 CUDA 内核的每个线程迭代 0 到之间的所有 tid 值nx*ny，并计算与您的 CPU 版本相同的函数！此外，您执行循环的效率实际上不及 CPU 版本，而不仅仅是迭代索引。换句话说，您在每个线程中执行相同的操作，只是效率低于在 CPU 上的 1 个线程中执行的操作。难怪这样会慢一些；它应该慢得多。您的 CUDA 内核是：

  int **tid** = blockIdx.x * blockDim.x + threadIdx.x;
  int i,j;
  for(**tid** = 0; **tid** <nx*ny; **tid**++)
  {
      j = tid/nx;
      i = tid - j*nx;
      x[tid] *= y[i];
  }

这对每个线程执行 nx*ny 次迭代，与您的主机代码相同；你失去了并行性的所有好处，因为每个线程都在做同样的事情；使用 GPU 上的一个线程可以获得相同的性能和相同的结果！

如果这是 CUDA 源文件中的逐字代码，则需要更改它并重新进行比较；如果这是您编写的代码，旨在帮助非 CUDA 受众解释您的代码的作用，那么您需要展示您的实际 CUDA 代码，以便我们可以看到发生了什么……事实上，性能分析我所做的——一件微不足道的事——就是你所期望的一切。

If this is the serial code:

  int i, j;
  for(j=0; j<ny; j++)
  {
      for(i=0; i<nx; i++)
      {
          x[i + j*nx] *= y[i];
      }
  }

then you should be doing this:

  __global__ void fn(float *x, int nx)
  {
     int tid = blockIdx.x * blockDim.x + threadIdx.x;
     int j = tid/nx, i = tid - j * nx;
     x[tid] *= y[i];
  }

  fn<<<nx*ny/B, B>>>(x, nx); // with B = 256, 512, etc.

What you're doing is fairly bizarre: you're instructing each thread of the CUDA kernel to iterate over all values of tid between 0 and nx*ny, and compute the same function as your CPU version! Moreover, instead of just iterating over the indices, you're actually doing the loop less efficiently than you did for the CPU version; in other words, you do the same thing in each thread, just less efficiently, than you are doing in 1 thread on the CPU. It's no wonder that this is slower; it should be much, much slower. Your CUDA kernel is:

  int **tid** = blockIdx.x * blockDim.x + threadIdx.x;
  int i,j;
  for(**tid** = 0; **tid** <nx*ny; **tid**++)
  {
      j = tid/nx;
      i = tid - j*nx;
      x[tid] *= y[i];
  }

This does nx*ny iterations, same as your host code, for each thread; you lose all benefit of the parallelism, since each thread is doing the same thing; you would get the same performance using one thread on the GPU, and the same result!

If this is the verbatim code from your CUDA source file, you need to change it and redo the comparison; if this is code you have written to help explain what your code is doing for a lay non-CUDA audience, then you need to present your actual CUDA code so that we can see what's going on... as it is, the performance analysis I have done - the trivial one - is all you can expect.

回复收藏 0 原文

悍妇囚夫 2024-12-18 04:33:14

鉴于您对此答案的评论：

nx * ny = 2205；所以我用了没有。块数 =
(nx*ny+(threads-1))/线程数和线程数 = 64。

这意味着您打算每次计算启动一个线程，正确的 CUDA 实现将是：

int tid = blockIdx.x * blockDim.x + threadIdx.x;
int j = tid/nx;
int i = tid - j*nx;

if (tid < (nx*ny))
    x[tid] *= y[i];

如果您打算让每个线程计算多个计算每次内核启动，那么您将调整网格大小以“填充”目标 GPU 上的每个 SM，而不是使用与输入大小相同数量的线程，然后执行以下操作：

int tid = blockIdx.x * blockDim.x + threadIdx.x;
int gsize = blockDim.x * gridDim.x;
int i,j;

for(; tid <nx*ny; tid+=gsize)
{
    j = tid/nx;
    i = tid - j*nx;
    x[tid] *= y[i];
}

这至少会让您合并读取和写信给x，并删除您发布的版本中大量的冗余计算。可以进行许多进一步的优化，但这需要比问题和后续评论中提供的更多有关问题的信息。您的索引方案包含一个整数除法，然后每个计算包含一个整数乘加。对于每个输入值的单个 FLOP 来说，这是很大的开销。然而，话虽如此，如果我引用的问题大小是您感兴趣的实际问题大小，那么 GPU 永远不会比普通的主机 CPU 更快。您需要解决许多数量级的问题才能使用 GPU 来实现此类低算术强度操作的有用加速。

Given your comment to this answer:

the nx * ny = 2205; so I used no. of blocks =
(nx*ny+(threads-1))/threads and threads = 64.

is implying you are intending to launch one thread per computation, the correct CUDA implementation would just be:

int tid = blockIdx.x * blockDim.x + threadIdx.x;
int j = tid/nx;
int i = tid - j*nx;

if (tid < (nx*ny))
    x[tid] *= y[i];

If you were intending for each thread to compute more than one computation per kernel launch, then you would size the grid to "fill" each of the SM on the target GPU, not use the same number of threads as the input size, and then do something like:

int tid = blockIdx.x * blockDim.x + threadIdx.x;
int gsize = blockDim.x * gridDim.x;
int i,j;

for(; tid <nx*ny; tid+=gsize)
{
    j = tid/nx;
    i = tid - j*nx;
    x[tid] *= y[i];
}

That would get you at least coalesced reads and writes to x, and remove the enormous number of redundant calculations in your posted version. There are a number of further optimizations that could be made, but it would require more information about the problem than has been supplied in the question and subsequent comments. Your indexing scheme contains an integer division and then an integer multiply-add per calculation. That is a lot of overhead for a single FLOP per input value. However, having said all of that, if the problem size I quoted is that actual problem size you are interested in, the GPU will never be faster than even a modest host CPU. You would require many orders of magnitude larger problems to realize useful speed up using the GPU for this sort low arithmetic intensity operation.

回复收藏 0 原文

逆光飞翔i 2024-12-18 04:33:14

该块有多大？可能是复制少量数据到GPU和设置环境所需的时间比计算时间长得多。

还要记住，CUDA 在第一次运行时会进行即时编译，因此为了获得准确的基准测试，您需要多次运行它。

回复收藏 0 原文

笑咖 2024-12-18 04:33:14

使用共享内存尝试此操作。最好的实现之一：

// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
   int width;
   int height;
   int stride; // In number of elements
   float *elements;
} Matrix;

// Thread block size
#define BLOCK_SIZE 16

// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
   return A.elements[row * A.stride + col];
}

// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col, float value)
{
   A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
   Matrix Asub;
   Asub.width = BLOCK_SIZE; Asub.height = BLOCK_SIZE;
   Asub.stride = A.stride;
   Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row + 
                               BLOCK_SIZE * col];
   return Asub;
}

// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
   // Same as in previous example, except the followings:
   // d_A.width = d_A.stride = A.width;
   // d_B.width = d_B.stride = B.width;
   // d_C.width = d_C.stride = C.width;
}
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
   // Block row and column
   int blockRow = blockIdx.y;
   int blockCol = blockIdx.x;

   // Each thread block computes one sub-matrix Csub of C
   Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

   // Each thread computes one element of Csub
   // by accumulating results into Cvalue
   float Cvalue = 0;

   // Thread row and column within Csub
   int row = threadIdx.y;
   int col = threadIdx.x;
// Loop over all the sub-matrices of A and B that are
   // required to compute Csub
   // Multiply each pair of sub-matrices together
   // and accumulate the results
   for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) 
   {
      // Get sub-matrix Asub of A and Bsub of B
      Matrix Asub = GetSubMatrix(A, blockRow, m);
      Matrix Bsub = GetSubMatrix(B, m, blockCol);

      // Shared memory used to store Asub and Bsub respectively
      __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
      __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

      // Load Asub and Bsub from device memory to shared memory
      // Each thread loads one element of each sub-matrix
      As[row][col] = GetElement(Asub, row, col);
      Bs[row][col] = GetElement(Bsub, row, col);

      // Synchronize to make sure the sub-matrices are loaded
      // before starting the computation
      __syncthreads();
      // Multiply Asub and Bsub together
      for (int e = 0; e < BLOCK_SIZE; ++e)
         Cvalue += As[row][e] * Bs[e][col];

      // Synchronize to make sure that the preceding
      // computation is done before loading two new
      // sub-matrices of A and B in the next iteration
      __syncthreads();
   }

   // Write Csub to device memory
   // Each thread writes one element
   SetElement(Csub, row, col, Cvalue);
}

Try this using shared memory. One of the best implementations around:

// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
   int width;
   int height;
   int stride; // In number of elements
   float *elements;
} Matrix;

// Thread block size
#define BLOCK_SIZE 16

// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
   return A.elements[row * A.stride + col];
}

// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col, float value)
{
   A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
   Matrix Asub;
   Asub.width = BLOCK_SIZE; Asub.height = BLOCK_SIZE;
   Asub.stride = A.stride;
   Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row + 
                               BLOCK_SIZE * col];
   return Asub;
}

// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
   // Same as in previous example, except the followings:
   // d_A.width = d_A.stride = A.width;
   // d_B.width = d_B.stride = B.width;
   // d_C.width = d_C.stride = C.width;
}
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
   // Block row and column
   int blockRow = blockIdx.y;
   int blockCol = blockIdx.x;

   // Each thread block computes one sub-matrix Csub of C
   Matrix Csub = GetSubMatrix(C, blockRow, blockCol);

   // Each thread computes one element of Csub
   // by accumulating results into Cvalue
   float Cvalue = 0;

   // Thread row and column within Csub
   int row = threadIdx.y;
   int col = threadIdx.x;
// Loop over all the sub-matrices of A and B that are
   // required to compute Csub
   // Multiply each pair of sub-matrices together
   // and accumulate the results
   for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) 
   {
      // Get sub-matrix Asub of A and Bsub of B
      Matrix Asub = GetSubMatrix(A, blockRow, m);
      Matrix Bsub = GetSubMatrix(B, m, blockCol);

      // Shared memory used to store Asub and Bsub respectively
      __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
      __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

      // Load Asub and Bsub from device memory to shared memory
      // Each thread loads one element of each sub-matrix
      As[row][col] = GetElement(Asub, row, col);
      Bs[row][col] = GetElement(Bsub, row, col);

      // Synchronize to make sure the sub-matrices are loaded
      // before starting the computation
      __syncthreads();
      // Multiply Asub and Bsub together
      for (int e = 0; e < BLOCK_SIZE; ++e)
         Cvalue += As[row][e] * Bs[e][col];

      // Synchronize to make sure that the preceding
      // computation is done before loading two new
      // sub-matrices of A and B in the next iteration
      __syncthreads();
   }

   // Write Csub to device memory
   // Each thread writes one element
   SetElement(Csub, row, col, Cvalue);
}

回复收藏 0 原文

~没有更多了~