CUDA Warp 同步问题

发布于 2024-10-19 11:28:01 字数 4452 浏览 5 评论 0原文

在概括将 2D 数组的值向右移动一个空格（环绕行边界）的内核时，我遇到了扭曲同步问题。完整的代码已附在下面。

该代码适用于任意数组宽度、数组高度、线程块数以及每个块的线程数。当选择线程大小为 33（即比完整扭曲多一个线程）时，第 33 个线程不会与调用的 __syncthreads() 同步。这会导致输出数据出现问题。仅当存在多个扭曲且数组的宽度大于线程数（例如，宽度=35 和34 个线程）时，才会出现该问题。

以下是所发生情况的缩小示例（实际上，数组需要有更多元素才能让内核产生错误）。

初始数组：

0 1 2 3 4 
5 6 7 8 9

预期结果：

4 0 1 2 3
9 5 6 7 8

内核生成：

4 0 1 2 3
8 5 6 7 8

第一行正确完成（对于每个块，如果有多个块），所有后续行都重复倒数第二个值。我测试了两张不同的卡（8600GT 和 GTX280）并得到了相同的结果。我想知道这是否只是我的内核的一个错误，或者是一个无法通过调整我的代码来解决的问题？

完整的源文件包含在下面。

谢谢。

#include <cstdio>
#include <cstdlib>

// A method to ensure all reads use the same logical layout.
inline __device__ __host__ int loc(int x, int y, int width)
{
  return y*width + x;
}

//kernel to shift all items in a 2D array one position to the right (wrapping around rows)
__global__ void shiftRight ( int* globalArray, int width, int height)
{
  int temp1=0;          //temporary swap variables
  int temp2=0;

  int blockRange=0;     //the number of rows that a single block will shift

  if (height%gridDim.x==0)  //logic to account for awkward array sizes
    blockRange = height/gridDim.x;
  else
    blockRange = (1+height/gridDim.x);

  int yStart = blockIdx.x*blockRange;
  int yEnd = yStart+blockRange; //the end condition for the y-loop
  yEnd = min(height,yEnd);              //make sure that the array doesn't go out of bounds

  for (int y = yStart; y < yEnd ; ++y)
  {
    //do the first read so the swap variables are loaded for the x-loop
    temp1 = globalArray[loc(threadIdx.x,y,width)];
    //Each block shifts an entire row by itself, even if there are more columns than threads
    for (int threadXOffset = threadIdx.x  ; threadXOffset < width ; threadXOffset+=blockDim.x)
    {
      //blockDim.x is added so that we store the next round of values
      //this has to be done now, because the next operation will
      //overwrite one of these values
      temp2 = globalArray[loc((threadXOffset + blockDim.x)%width,y,width)];
      __syncthreads();  //sync before the write to ensure all the values have been read
      globalArray[loc((threadXOffset +1)%width,y,width)] = temp1;
      __syncthreads();  //sync after the write so ensure all the values have been written
      temp1 = temp2;        //swap the storage variables.
    }
    if (threadIdx.x == 0 && y == 0)
      globalArray[loc(12,2,width)]=globalArray[67];
  }
}


int main (int argc, char* argv[])
{
  //set the parameters to be used
  int width = 34;
  int height = 3;
  int threadsPerBlock=33;
  int numBlocks = 1;

  int memSizeInBytes = width*height*sizeof(int);

  //create the host data and assign each element of the array to equal its index
  int* hostData = (int*) malloc (memSizeInBytes);
  for (int y = 0 ; y < height ; ++y)
    for (int x = 0 ; x < width ; ++x)
      hostData [loc(x,y,width)] = loc(x,y,width);

  //create an allocate the device pointers
  int* deviceData;
  cudaMalloc ( &deviceData  ,memSizeInBytes);
  cudaMemset (  deviceData,0,memSizeInBytes);
  cudaMemcpy (  deviceData, hostData, memSizeInBytes, cudaMemcpyHostToDevice);
  cudaThreadSynchronize();

  //launch the kernel
  shiftRight<<<numBlocks,threadsPerBlock>>> (deviceData, width, height);
  cudaThreadSynchronize();

  //copy the device data to a host array
  int* hostDeviceOutput = (int*) malloc (memSizeInBytes);
  cudaMemcpy (hostDeviceOutput, deviceData, memSizeInBytes, cudaMemcpyDeviceToHost); 
  cudaFree (deviceData);

  //Print out the expected/desired device output
  printf("---- Expected Device Output ----\n");
  printf("   | ");
  for (int x = 0 ; x < width ; ++x)
    printf("%4d ",x);
  printf("\n---|-");
  for (int x = 0 ; x < width ; ++x)
    printf("-----");
  for (int y = 0 ; y < height ; ++y)
  {
    printf("\n%2d | ",y);
    for (int x = 0 ; x < width ; ++x)
      printf("%4d ",hostData[loc((x-1+width)%width,y,width)]);
  }
  printf("\n\n");

  printf("---- Actual Device Output ----\n");
  printf("   | ");
  for (int x = 0 ; x < width ; ++x)
    printf("%4d ",x);
  printf("\n---|-");
  for (int x = 0 ; x < width ; ++x)
    printf("-----");
  for (int y = 0 ; y < height ; ++y)
  {
    printf("\n%2d | ",y);
    for (int x = 0 ; x < width ; ++x)
      printf("%4d ",hostDeviceOutput[loc(x,y,width)]);
  }
  printf("\n\n");
}

原文

In generalizing a kernel thats shifts the values of a 2D array one space to the right (wrapping around the row boundaries), I have come across a warp synchronization problem. The full code is attached and included below.

The code is meant to work for arbitrary array width, array height, number of thread blocks, and number of threads per block. When choosing a thread size of 33 (i.e. one more thread than a full warp), the 33rd thread doesn't synchronize with __syncthreads() is called. This causes problems with the output data. The problem is only present when there is more than one warp, and the width of the array is more than the number of threads (e.g. with width=35 and 34 threads).

The following is a downsized example of what happens (in reality the array would need to have more elements for the kernel to produce the error).

Initial array:

0 1 2 3 4 
5 6 7 8 9

Expected Result:

4 0 1 2 3
9 5 6 7 8

Kernel Produces:

4 0 1 2 3
8 5 6 7 8

The first line is done correctly (for each block if there are more than one), with all subsequent lines having the second last value repeated. I have tested this one two different cards (8600GT and GTX280) and get the same results. I would like to know if this is just a bug with my kernel, or a problem that can't be fixed by adjusting my code?

The full source file is included below.

Thank you.

#include <cstdio>
#include <cstdlib>

// A method to ensure all reads use the same logical layout.
inline __device__ __host__ int loc(int x, int y, int width)
{
  return y*width + x;
}

//kernel to shift all items in a 2D array one position to the right (wrapping around rows)
__global__ void shiftRight ( int* globalArray, int width, int height)
{
  int temp1=0;          //temporary swap variables
  int temp2=0;

  int blockRange=0;     //the number of rows that a single block will shift

  if (height%gridDim.x==0)  //logic to account for awkward array sizes
    blockRange = height/gridDim.x;
  else
    blockRange = (1+height/gridDim.x);

  int yStart = blockIdx.x*blockRange;
  int yEnd = yStart+blockRange; //the end condition for the y-loop
  yEnd = min(height,yEnd);              //make sure that the array doesn't go out of bounds

  for (int y = yStart; y < yEnd ; ++y)
  {
    //do the first read so the swap variables are loaded for the x-loop
    temp1 = globalArray[loc(threadIdx.x,y,width)];
    //Each block shifts an entire row by itself, even if there are more columns than threads
    for (int threadXOffset = threadIdx.x  ; threadXOffset < width ; threadXOffset+=blockDim.x)
    {
      //blockDim.x is added so that we store the next round of values
      //this has to be done now, because the next operation will
      //overwrite one of these values
      temp2 = globalArray[loc((threadXOffset + blockDim.x)%width,y,width)];
      __syncthreads();  //sync before the write to ensure all the values have been read
      globalArray[loc((threadXOffset +1)%width,y,width)] = temp1;
      __syncthreads();  //sync after the write so ensure all the values have been written
      temp1 = temp2;        //swap the storage variables.
    }
    if (threadIdx.x == 0 && y == 0)
      globalArray[loc(12,2,width)]=globalArray[67];
  }
}


int main (int argc, char* argv[])
{
  //set the parameters to be used
  int width = 34;
  int height = 3;
  int threadsPerBlock=33;
  int numBlocks = 1;

  int memSizeInBytes = width*height*sizeof(int);

  //create the host data and assign each element of the array to equal its index
  int* hostData = (int*) malloc (memSizeInBytes);
  for (int y = 0 ; y < height ; ++y)
    for (int x = 0 ; x < width ; ++x)
      hostData [loc(x,y,width)] = loc(x,y,width);

  //create an allocate the device pointers
  int* deviceData;
  cudaMalloc ( &deviceData  ,memSizeInBytes);
  cudaMemset (  deviceData,0,memSizeInBytes);
  cudaMemcpy (  deviceData, hostData, memSizeInBytes, cudaMemcpyHostToDevice);
  cudaThreadSynchronize();

  //launch the kernel
  shiftRight<<<numBlocks,threadsPerBlock>>> (deviceData, width, height);
  cudaThreadSynchronize();

  //copy the device data to a host array
  int* hostDeviceOutput = (int*) malloc (memSizeInBytes);
  cudaMemcpy (hostDeviceOutput, deviceData, memSizeInBytes, cudaMemcpyDeviceToHost); 
  cudaFree (deviceData);

  //Print out the expected/desired device output
  printf("---- Expected Device Output ----\n");
  printf("   | ");
  for (int x = 0 ; x < width ; ++x)
    printf("%4d ",x);
  printf("\n---|-");
  for (int x = 0 ; x < width ; ++x)
    printf("-----");
  for (int y = 0 ; y < height ; ++y)
  {
    printf("\n%2d | ",y);
    for (int x = 0 ; x < width ; ++x)
      printf("%4d ",hostData[loc((x-1+width)%width,y,width)]);
  }
  printf("\n\n");

  printf("---- Actual Device Output ----\n");
  printf("   | ");
  for (int x = 0 ; x < width ; ++x)
    printf("%4d ",x);
  printf("\n---|-");
  for (int x = 0 ; x < width ; ++x)
    printf("-----");
  for (int y = 0 ; y < height ; ++y)
  {
    printf("\n%2d | ",y);
    for (int x = 0 ; x < width ; ++x)
      printf("%4d ",hostDeviceOutput[loc(x,y,width)]);
  }
  printf("\n\n");
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

巷子口的你 2024-10-26 11:28:01

因为并非所有线程都执行相同数量的循环迭代，所以同步是一个问题！所有线程应该始终命中相同的 __syncthreads()-s。

我建议将最里面的 for 循环转换为如下所示：

for(int blockXOffset=0; blockXOffset < width; blockXOffset+=blockDim.x) {
  int threadXOffset=blockXOffset+threadIdx.x;
  bool isActive=(threadXOffset < width);
  if (isActive) temp2 = globalArray[loc((threadXOffset + blockDim.x)%width,y,width)];
  __syncthreads();
  if (isActive) globalArray[loc((threadXOffset +1)%width,y,width)] = temp1;
  __syncthreads();
  temp1 = temp2;
}

Because not all threads are executing the same number of loop iterations, synchronisation is a problem! All threads should hit the same __syncthreads()-s all the time.

I would suggest transforming your innermost for loop into something like this:

for(int blockXOffset=0; blockXOffset < width; blockXOffset+=blockDim.x) {
  int threadXOffset=blockXOffset+threadIdx.x;
  bool isActive=(threadXOffset < width);
  if (isActive) temp2 = globalArray[loc((threadXOffset + blockDim.x)%width,y,width)];
  __syncthreads();
  if (isActive) globalArray[loc((threadXOffset +1)%width,y,width)] = temp1;
  __syncthreads();
  temp1 = temp2;
}

回复收藏 0 原文