CUDA超时？ / 费米 / gtx465

发布于 2024-09-10 03:33:19 字数 1415 浏览 9 评论 0原文

我在 MS VS2005 上使用 CUDA SDK 3.1，GPU GTX465 1 GB。我有这样一个内核函数：

__global__ void CRT_GPU_2(float *A, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{


  int holo_x = blockIdx.x*20 + threadIdx.x;
  int holo_y = blockIdx.y*20 + threadIdx.y;

  float k=2.0f*3.14f/0.000000054f;

  if (firstTime[0]==1.0f)
  {
   pIntensity[holo_x+holo_y*MAX_FINAL_X]=0.0f; 
  }

  for (int i=0; i<pointsNumber[0]; i++)
  {
   pIntensity[holo_x+holo_y*MAX_FINAL_X]=pIntensity[holo_x+holo_y*MAX_FINAL_X]+A[i]*cosf(k*sqrtf(pow(holo_x-X[i],2.0f)+pow(holo_y-Y[i],2.0f)+pow(Z[i],2.0f)));
  }

  __syncthreads(); 


}

这是调用内核函数的函数：

extern "C" void go2(float *pDATA, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
 dim3 blockGridRows(MAX_FINAL_X/20,MAX_FINAL_Y/20);
 dim3 threadBlockRows(20, 20);

 CRT_GPU_2<<<blockGridRows, threadBlockRows>>>(pDATA, X, Y, Z, pIntensity,firstTime, pointsNumber); 
 CUT_CHECK_ERROR("multiplyNumbersGPU() execution failed\n");
 CUDA_SAFE_CALL( cudaThreadSynchronize() );
}

我正在循环中加载该函数的所有参数（例如，一次循环迭代中每个参数有 4096 个元素）。总的来说，我希望在所有循环迭代之后为每个参数创建包含 32768 个元素的内核。

MAX_FINAL_X 是 1920，MAX_FINAL_Y 是 1080。

当我开始算法时，第一次迭代进行得非常快，经过一两次迭代后，我得到有关 CUDA 超时错误的信息。我在 GPU gtx260 上使用了这个算法，据我所知，它做得更好...

你能帮助我吗..也许我根据这个算法中的新费米拱门犯了一些错误？

原文

I am using CUDA SDK 3.1 on MS VS2005 with GPU GTX465 1 GB. I have such a kernel function:

__global__ void CRT_GPU_2(float *A, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{


  int holo_x = blockIdx.x*20 + threadIdx.x;
  int holo_y = blockIdx.y*20 + threadIdx.y;

  float k=2.0f*3.14f/0.000000054f;

  if (firstTime[0]==1.0f)
  {
   pIntensity[holo_x+holo_y*MAX_FINAL_X]=0.0f; 
  }

  for (int i=0; i<pointsNumber[0]; i++)
  {
   pIntensity[holo_x+holo_y*MAX_FINAL_X]=pIntensity[holo_x+holo_y*MAX_FINAL_X]+A[i]*cosf(k*sqrtf(pow(holo_x-X[i],2.0f)+pow(holo_y-Y[i],2.0f)+pow(Z[i],2.0f)));
  }

  __syncthreads(); 


}

and this is function which calls kernel function:

extern "C" void go2(float *pDATA, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
 dim3 blockGridRows(MAX_FINAL_X/20,MAX_FINAL_Y/20);
 dim3 threadBlockRows(20, 20);

 CRT_GPU_2<<<blockGridRows, threadBlockRows>>>(pDATA, X, Y, Z, pIntensity,firstTime, pointsNumber); 
 CUT_CHECK_ERROR("multiplyNumbersGPU() execution failed\n");
 CUDA_SAFE_CALL( cudaThreadSynchronize() );
}

I am loading in loop all the paramteres to this function (for example 4096 elements for each parameter in one loop iteration). In total I want to make this kernel for 32768 elements for each parameter after all loop iterations.

The MAX_FINAL_X is 1920 and MAX_FINAL_Y is 1080.

When I am starting alghoritm first iteration goes very fast and after one or two iteration more I get information about CUDA timeout error. I used this alghoritm on GPU gtx260 and it was doing better as far as I remember...

Could You help me.. maybe I am doing some mistake according to new Fermi arch in this algorithm?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沒落の蓅哖 2024-09-17 03:33:19

打电话会比较好
CUT_CHECK_ERROR 之后
cudaThreadSynchronize()。因为
内核异步运行，你必须
等待内核结束才知道
错误...也许在第二次迭代中您会收到错误
从第一次使用内核开始。
务必
你在最有趣的变量中有一些有效的数字
pointsNumber[0]（这可能会导致
长的内部循环）。
你也可以
提高内核的速度
功能：
- 使用更好的块。线程配置 20x20 将导致内存使用非常慢（请参阅编程指南和最佳实践）。尝试使用 16x16 的块。
- 不要使用pow(..., 2.0)函数。使用 SQR 宏速度更快 (#define SQR(x) (x)*(x))
- 您不使用共享内存，因此不需要 __syncthreads()。