CUDA超时? / 费米 / gtx465

发布于 2024-09-10 03:33:19 字数 1415 浏览 9 评论 0原文

我在 MS VS2005 上使用 CUDA SDK 3.1,GPU GTX465 1 GB。我有这样一个内核函数:

__global__ void CRT_GPU_2(float *A, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{


  int holo_x = blockIdx.x*20 + threadIdx.x;
  int holo_y = blockIdx.y*20 + threadIdx.y;

  float k=2.0f*3.14f/0.000000054f;

  if (firstTime[0]==1.0f)
  {
   pIntensity[holo_x+holo_y*MAX_FINAL_X]=0.0f; 
  }

  for (int i=0; i<pointsNumber[0]; i++)
  {
   pIntensity[holo_x+holo_y*MAX_FINAL_X]=pIntensity[holo_x+holo_y*MAX_FINAL_X]+A[i]*cosf(k*sqrtf(pow(holo_x-X[i],2.0f)+pow(holo_y-Y[i],2.0f)+pow(Z[i],2.0f)));
  }

  __syncthreads(); 


}

这是调用内核函数的函数:

extern "C" void go2(float *pDATA, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
 dim3 blockGridRows(MAX_FINAL_X/20,MAX_FINAL_Y/20);
 dim3 threadBlockRows(20, 20);

 CRT_GPU_2<<<blockGridRows, threadBlockRows>>>(pDATA, X, Y, Z, pIntensity,firstTime, pointsNumber); 
 CUT_CHECK_ERROR("multiplyNumbersGPU() execution failed\n");
 CUDA_SAFE_CALL( cudaThreadSynchronize() );
}

我正在循环中加载该函数的所有参数(例如,一次循环迭代中每个参数有 4096 个元素)。总的来说,我希望在所有循环迭代之后为每个参数创建包含 32768 个元素的内核。

MAX_FINAL_X 是 1920,MAX_FINAL_Y 是 1080。

当我开始算法时,第一次迭代进行得非常快,经过一两次迭代后,我得到有关 CUDA 超时错误的信息。我在 GPU gtx260 上使用了这个算法,据我所知,它做得更好...

你能帮助我吗..也许我根据这个算法中的新费米拱门犯了一些错误?

I am using CUDA SDK 3.1 on MS VS2005 with GPU GTX465 1 GB. I have such a kernel function:

__global__ void CRT_GPU_2(float *A, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{


  int holo_x = blockIdx.x*20 + threadIdx.x;
  int holo_y = blockIdx.y*20 + threadIdx.y;

  float k=2.0f*3.14f/0.000000054f;

  if (firstTime[0]==1.0f)
  {
   pIntensity[holo_x+holo_y*MAX_FINAL_X]=0.0f; 
  }

  for (int i=0; i<pointsNumber[0]; i++)
  {
   pIntensity[holo_x+holo_y*MAX_FINAL_X]=pIntensity[holo_x+holo_y*MAX_FINAL_X]+A[i]*cosf(k*sqrtf(pow(holo_x-X[i],2.0f)+pow(holo_y-Y[i],2.0f)+pow(Z[i],2.0f)));
  }

  __syncthreads(); 


}

and this is function which calls kernel function:

extern "C" void go2(float *pDATA, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
 dim3 blockGridRows(MAX_FINAL_X/20,MAX_FINAL_Y/20);
 dim3 threadBlockRows(20, 20);

 CRT_GPU_2<<<blockGridRows, threadBlockRows>>>(pDATA, X, Y, Z, pIntensity,firstTime, pointsNumber); 
 CUT_CHECK_ERROR("multiplyNumbersGPU() execution failed\n");
 CUDA_SAFE_CALL( cudaThreadSynchronize() );
}

I am loading in loop all the paramteres to this function (for example 4096 elements for each parameter in one loop iteration). In total I want to make this kernel for 32768 elements for each parameter after all loop iterations.

The MAX_FINAL_X is 1920 and MAX_FINAL_Y is 1080.

When I am starting alghoritm first iteration goes very fast and after one or two iteration more I get information about CUDA timeout error. I used this alghoritm on GPU gtx260 and it was doing better as far as I remember...

Could You help me.. maybe I am doing some mistake according to new Fermi arch in this algorithm?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

沒落の蓅哖 2024-09-17 03:33:19
  1. 打电话会比较好
    CUT_CHECK_ERROR 之后
    cudaThreadSynchronize()。因为
    内核异步运行,你必须
    等待内核结束才知道
    错误...也许在第二次迭代中您会收到错误
    从第一次使用内核开始。
  2. 务必
    你在最有趣的变量中有一些有效的数字
    pointsNumber[0](这可能会导致
    长的内部循环)。
  3. 你也可以
    提高内核的速度
    功能:

    • 使用更好的块​​。线程配置 20x20 将导致内存使用非常慢(请参阅编程指南和最佳实践)。尝试使用 16x16 的块。
    • 不要使用pow(..., 2.0)函数。使用 SQR 宏速度更快 (#define SQR(x) (x)*(x))
    • 您不使用共享内存,因此不需要 __syncthreads()

PS:您还可以将值参数传递给 CUDA 函数,而不仅仅是指针。速度将是相同的。

PPS:请提高代码的可读性...现在您必须编辑六个位置来更改块配置...在内核内部您可以使用 blockDim 变量,并且可以在 go2 函数中使用常量。
您还可以使用 bool firstTime - 它会比 float 好得多。

  1. It will be better to call
    CUT_CHECK_ERROR after
    cudaThreadSynchronize(). Because
    kernel run asynchronous and you must
    wait for kernel ending to know about
    errors... Maybe in second iteration you receive an error
    from first kernel usage.
  2. Be sure
    that you have some valid number in the most interesting variable
    pointsNumber[0] (it might cause a
    long internal loop).
  3. You could also
    improve speed of your kernel
    function:

    • Use better blocks. Threads configuration 20x20 will cause very slow memory usage (see Programming Guide and Best Practices). Try to use blocks 16x16.
    • Do not use pow(..., 2.0) function. It's faster to use SQR macro (#define SQR(x) (x)*(x))
    • You don't use shared mem, so __syncthreads() is not required.

PS: You could also pass value parameters to CUDA functions, not only pointers. Speed will be the same.

PPS: please improve code's readability... Now you must edit six places to change block configuration... Inside the kernel you could use blockDim variable and you could use constants in go2 function.
You could also use bool firstTime - it will be MUCH better then float.

原来是傀儡 2024-09-17 03:33:19

您的 GPU 是否连接到显示器?如果是这样,我相信默认情况下内核执行将在 5 秒后中止。您可以使用cudaGetDeviceProperties检查内核执行是否超时 - 请参阅参考页面

Is your GPU connected to a display? If so, I believe the default is that kernel execution will be aborted after 5 seconds. You can check whether kernel execution will timeout by using cudaGetDeviceProperties - see reference page

阳光下慵懒的猫 2024-09-17 03:33:19

在内核周期中,您在同一个数组中写入,并从中读取 - 对于全局内存使用来说,这是最糟糕的,因为来自不同块的扭曲相互等待。

In kernel's cycle you write in the same array, from which you read - for global memory usage it is the worst, because warps from different blocks wait for each other.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文