CUDA超时? / 费米 / gtx465
我在 MS VS2005 上使用 CUDA SDK 3.1,GPU GTX465 1 GB。我有这样一个内核函数:
__global__ void CRT_GPU_2(float *A, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
int holo_x = blockIdx.x*20 + threadIdx.x;
int holo_y = blockIdx.y*20 + threadIdx.y;
float k=2.0f*3.14f/0.000000054f;
if (firstTime[0]==1.0f)
{
pIntensity[holo_x+holo_y*MAX_FINAL_X]=0.0f;
}
for (int i=0; i<pointsNumber[0]; i++)
{
pIntensity[holo_x+holo_y*MAX_FINAL_X]=pIntensity[holo_x+holo_y*MAX_FINAL_X]+A[i]*cosf(k*sqrtf(pow(holo_x-X[i],2.0f)+pow(holo_y-Y[i],2.0f)+pow(Z[i],2.0f)));
}
__syncthreads();
}
这是调用内核函数的函数:
extern "C" void go2(float *pDATA, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
dim3 blockGridRows(MAX_FINAL_X/20,MAX_FINAL_Y/20);
dim3 threadBlockRows(20, 20);
CRT_GPU_2<<<blockGridRows, threadBlockRows>>>(pDATA, X, Y, Z, pIntensity,firstTime, pointsNumber);
CUT_CHECK_ERROR("multiplyNumbersGPU() execution failed\n");
CUDA_SAFE_CALL( cudaThreadSynchronize() );
}
我正在循环中加载该函数的所有参数(例如,一次循环迭代中每个参数有 4096 个元素)。总的来说,我希望在所有循环迭代之后为每个参数创建包含 32768 个元素的内核。
MAX_FINAL_X 是 1920,MAX_FINAL_Y 是 1080。
当我开始算法时,第一次迭代进行得非常快,经过一两次迭代后,我得到有关 CUDA 超时错误的信息。我在 GPU gtx260 上使用了这个算法,据我所知,它做得更好...
你能帮助我吗..也许我根据这个算法中的新费米拱门犯了一些错误?
I am using CUDA SDK 3.1 on MS VS2005 with GPU GTX465 1 GB. I have such a kernel function:
__global__ void CRT_GPU_2(float *A, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
int holo_x = blockIdx.x*20 + threadIdx.x;
int holo_y = blockIdx.y*20 + threadIdx.y;
float k=2.0f*3.14f/0.000000054f;
if (firstTime[0]==1.0f)
{
pIntensity[holo_x+holo_y*MAX_FINAL_X]=0.0f;
}
for (int i=0; i<pointsNumber[0]; i++)
{
pIntensity[holo_x+holo_y*MAX_FINAL_X]=pIntensity[holo_x+holo_y*MAX_FINAL_X]+A[i]*cosf(k*sqrtf(pow(holo_x-X[i],2.0f)+pow(holo_y-Y[i],2.0f)+pow(Z[i],2.0f)));
}
__syncthreads();
}
and this is function which calls kernel function:
extern "C" void go2(float *pDATA, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
dim3 blockGridRows(MAX_FINAL_X/20,MAX_FINAL_Y/20);
dim3 threadBlockRows(20, 20);
CRT_GPU_2<<<blockGridRows, threadBlockRows>>>(pDATA, X, Y, Z, pIntensity,firstTime, pointsNumber);
CUT_CHECK_ERROR("multiplyNumbersGPU() execution failed\n");
CUDA_SAFE_CALL( cudaThreadSynchronize() );
}
I am loading in loop all the paramteres to this function (for example 4096 elements for each parameter in one loop iteration). In total I want to make this kernel for 32768 elements for each parameter after all loop iterations.
The MAX_FINAL_X is 1920 and MAX_FINAL_Y is 1080.
When I am starting alghoritm first iteration goes very fast and after one or two iteration more I get information about CUDA timeout error. I used this alghoritm on GPU gtx260 and it was doing better as far as I remember...
Could You help me.. maybe I am doing some mistake according to new Fermi arch in this algorithm?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
CUT_CHECK_ERROR
之后cudaThreadSynchronize()
。因为内核异步运行,你必须
等待内核结束才知道
错误...也许在第二次迭代中您会收到错误
从第一次使用内核开始。
你在最有趣的变量中有一些有效的数字
pointsNumber[0]
(这可能会导致长的内部循环)。
提高内核的速度
功能:
pow(..., 2.0)
函数。使用 SQR 宏速度更快 (#define SQR(x) (x)*(x)
)__syncthreads()
。PS:您还可以将值参数传递给 CUDA 函数,而不仅仅是指针。速度将是相同的。
PPS:请提高代码的可读性...现在您必须编辑六个位置来更改块配置...在内核内部您可以使用
blockDim
变量,并且可以在 go2 函数中使用常量。您还可以使用
bool firstTime
- 它会比float
好得多。CUT_CHECK_ERROR
aftercudaThreadSynchronize()
. Becausekernel run asynchronous and you must
wait for kernel ending to know about
errors... Maybe in second iteration you receive an error
from first kernel usage.
that you have some valid number in the most interesting variable
pointsNumber[0]
(it might cause along internal loop).
improve speed of your kernel
function:
pow(..., 2.0)
function. It's faster to use SQR macro (#define SQR(x) (x)*(x)
)__syncthreads()
is not required.PS: You could also pass value parameters to CUDA functions, not only pointers. Speed will be the same.
PPS: please improve code's readability... Now you must edit six places to change block configuration... Inside the kernel you could use
blockDim
variable and you could use constants in go2 function.You could also use
bool firstTime
- it will be MUCH better thenfloat
.您的 GPU 是否连接到显示器?如果是这样,我相信默认情况下内核执行将在 5 秒后中止。您可以使用
cudaGetDeviceProperties
检查内核执行是否超时 - 请参阅参考页面Is your GPU connected to a display? If so, I believe the default is that kernel execution will be aborted after 5 seconds. You can check whether kernel execution will timeout by using
cudaGetDeviceProperties
- see reference page在内核周期中,您在同一个数组中写入,并从中读取 - 对于全局内存使用来说,这是最糟糕的,因为来自不同块的扭曲相互等待。
In kernel's cycle you write in the same array, from which you read - for global memory usage it is the worst, because warps from different blocks wait for each other.