Fermi GPU 的 CUDA 程序内存错误
我在 GeForce GTX 580(费米级)GPU 上遇到以下问题。
为了给您提供一些背景知识,我正在读取按以下方式打包在文件中的单字节样本:实数(信号 1)、虚数(信号 1)、实数(信号 2)、虚数(信号 2)。 (每个字节都是一个有符号字符,取值在 -128 和 127 之间。)我将它们读入 char4 数组,并使用下面给出的内核将它们复制到与每个信号对应的两个 float2 数组中。 (这只是较大程序的一个独立部分。)
当我使用 cuda-memcheck 运行程序时,我要么收到不合格的 unspecified launch failure
,要么收到相同的消息以及 User Stack溢出或断点命中
或随机线程和块索引处的大小为 8 的无效 __global__ 写入
。
主要内核和启动相关的代码复制如下。 奇怪的是,这段代码在我有权访问的非 Fermi 级 GPU 上运行(并且 cuda-memcheck 没有抛出错误)。我观察到的另一件事是,费米对于小于 16384 的 N
没有给出任何错误。
#define N 32768
int main(int argc, char *argv[])
{
char4 *pc4Buf_h = NULL;
char4 *pc4Buf_d = NULL;
float2 *pf2InX_d = NULL;
float2 *pf2InY_d = NULL;
dim3 dimBCopy(1, 1, 1);
dim3 dimGCopy(1, 1);
...
/* i do check for errors in the actual code */
pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
(void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
(void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
(void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
...
dimBCopy.x = 1024; /* number of threads in a block, for my GPU */
dimGCopy.x = N / 1024;
CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
pf2InX_d,
pf2InY_d);
...
}
__global__ void CopyDataForFFT(char4 *pc4Data,
float2 *pf2FFTInX,
float2 *pf2FFTInY)
{
int i = (blockIdx.x * blockDim.x) + threadIdx.x;
pf2FFTInX[i].x = (float) pc4Data[i].x;
pf2FFTInX[i].y = (float) pc4Data[i].y;
pf2FFTInY[i].x = (float) pc4Data[i].z;
pf2FFTInY[i].y = (float) pc4Data[i].w;
return;
}
我在程序中注意到的另一件事是,如果我注释掉任何两个字符 -我在程序中注意到的另一件事是,如果我注释掉内核中的前两个或最后两个 char-to-float 赋值语句,则不会出现内存错误。 ,有没有内存错误。如果我注释掉前两个中的一个 (pf2FFTInX
),以及后两个中的另一个 (pf2FFTInY
),错误仍然会出现,但频率会降低。内核使用 6 个寄存器,其中所有 4 个赋值语句均未注释,并使用 5 4 个寄存器,其中两个赋值语句被注释掉。
我尝试了用 32 位工具包代替 64 位工具包、使用 -m32
编译器选项进行 32 位编译、在没有 X windows 的情况下运行等,但程序行为是相同的。
我在 RHEL 5.6 上使用 CUDA 4.0 驱动程序和运行时(也尝试过 CUDA 3.2)。 GPU计算能力为2.0。
请帮忙!如果有人有兴趣在他们的费米卡上运行它,我可以发布整个代码。
更新:只是为了它,我在 pf2FFTInX
和 pf2FFTInY
赋值语句和内存之间插入了 __syncthreads()
对于 N
= 32768,错误消失了。但在 N
= 65536 时,我仍然遇到错误。<--
这并没有持续多久。仍然出现错误。
更新:在继续奇怪的行为时,当我使用 cuda-memcheck 运行程序时,我得到这些 16x16 多色像素块随机分布在我的屏幕上。如果我直接运行程序就不会发生这种情况。
I am facing the following problem on a GeForce GTX 580 (Fermi-class) GPU.
Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, and use the kernel given below to copy them to two float2 arrays corresponding to each signal. (This is just an isolated part of a larger program.)
When I run the program using cuda-memcheck, I get either an unqualified unspecified launch failure
, or the same message along with User Stack Overflow or Breakpoint Hit
or Invalid __global__ write of size 8
at random thread and block indices.
The main kernel and launch-related code is reproduced below. The strange thing is that this code works (and cuda-memcheck throws no error) on a non-Fermi-class GPU that I have access to. Another thing that I observed is that the Fermi gives no error for N
less than 16384.
#define N 32768
int main(int argc, char *argv[])
{
char4 *pc4Buf_h = NULL;
char4 *pc4Buf_d = NULL;
float2 *pf2InX_d = NULL;
float2 *pf2InY_d = NULL;
dim3 dimBCopy(1, 1, 1);
dim3 dimGCopy(1, 1);
...
/* i do check for errors in the actual code */
pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
(void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
(void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
(void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
...
dimBCopy.x = 1024; /* number of threads in a block, for my GPU */
dimGCopy.x = N / 1024;
CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
pf2InX_d,
pf2InY_d);
...
}
__global__ void CopyDataForFFT(char4 *pc4Data,
float2 *pf2FFTInX,
float2 *pf2FFTInY)
{
int i = (blockIdx.x * blockDim.x) + threadIdx.x;
pf2FFTInX[i].x = (float) pc4Data[i].x;
pf2FFTInX[i].y = (float) pc4Data[i].y;
pf2FFTInY[i].x = (float) pc4Data[i].z;
pf2FFTInY[i].y = (float) pc4Data[i].w;
return;
}
One other thing I noticed in my program is that if I comment out any two char-to-float assignment statements in my kernel, there's no memory error. One other thing I noticed in my program is that if I comment out either the first two or the last two char-to-float assignment statements in my kernel, there's no memory error. If I comment out one from the first two (pf2FFTInX
), and another from the second two (pf2FFTInY
), errors still crop up, but less frequently. The kernel uses 6 registers with all four assignment statements uncommented, and uses 5 4 registers with two assignment statements commented out.
I tried the 32-bit toolkit in place of the 64-bit toolkit, 32-bit compilation with the -m32
compiler option, running without X windows, etc. but the program behaviour is the same.
I use CUDA 4.0 driver and runtime (also tried CUDA 3.2) on RHEL 5.6. The GPU compute capability is 2.0.
Please help! I could post the entire code if anybody is interested in running it on their Fermi cards.
UPDATE: Just for the heck of it, I inserted a __syncthreads()
between the pf2FFTInX
and the pf2FFTInY
assignment statements, and memory errors disappeared for N
= 32768. But at N
= 65536, I still get errors.<--
This didn't last long. Still getting errors.
UPDATE: In continuing with the weird behaviour, when I run the program using cuda-memcheck, I get these 16x16 blocks of multi-coloured pixels distributed randomly all over my screen. This does not happen if I run the program directly.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
问题是 GPU 卡坏了(参见评论)。 [我添加此答案是为了从未回答的列表中删除该问题并使其更有用。]
The problem was a bad GPU card (see the comments). [I'm Adding this answer to remove the question from the unanswered list and make it more useful.]