Fermi GPU 的 CUDA 程序内存错误

发布于 2024-11-25 13:52:08 字数 2627 浏览 1 评论 0原文

我在 GeForce GTX 580（费米级）GPU 上遇到以下问题。

为了给您提供一些背景知识，我正在读取按以下方式打包在文件中的单字节样本：实数（信号 1）、虚数（信号 1）、实数（信号 2）、虚数（信号 2）。（每个字节都是一个有符号字符，取值在 -128 和 127 之间。）我将它们读入 char4 数组，并使用下面给出的内核将它们复制到与每个信号对应的两个 float2 数组中。（这只是较大程序的一个独立部分。）

当我使用 cuda-memcheck 运行程序时，我要么收到不合格的 unspecified launch failure，要么收到相同的消息以及 User Stack溢出或断点命中或随机线程和块索引处的大小为 8 的无效 __global__ 写入。

主要内核和启动相关的代码复制如下。奇怪的是，这段代码在我有权访问的非 Fermi 级 GPU 上运行（并且 cuda-memcheck 没有抛出错误）。我观察到的另一件事是，费米对于小于 16384 的 N 没有给出任何错误。

#define N   32768

int main(int argc, char *argv[])
{
    char4 *pc4Buf_h = NULL;
    char4 *pc4Buf_d = NULL;
    float2 *pf2InX_d = NULL;
    float2 *pf2InY_d = NULL;
    dim3 dimBCopy(1, 1, 1);
    dim3 dimGCopy(1, 1);
    ...
    /* i do check for errors in the actual code */
    pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
    (void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
    (void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
    (void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
    ...
    dimBCopy.x = 1024;  /* number of threads in a block, for my GPU */
    dimGCopy.x = N / 1024;
    CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
                                           pf2InX_d,
                                           pf2InY_d);
    ...
}

__global__ void CopyDataForFFT(char4 *pc4Data,
                               float2 *pf2FFTInX,
                               float2 *pf2FFTInY)
{
    int i = (blockIdx.x * blockDim.x) + threadIdx.x;

    pf2FFTInX[i].x = (float) pc4Data[i].x;
    pf2FFTInX[i].y = (float) pc4Data[i].y;
    pf2FFTInY[i].x = (float) pc4Data[i].z;
    pf2FFTInY[i].y = (float) pc4Data[i].w;

    return;
}

我在程序中注意到的另一件事是，如果我注释掉任何两个字符 -我在程序中注意到的另一件事是，如果我注释掉内核中的前两个或最后两个 char-to-float 赋值语句，则不会出现内存错误。，有没有内存错误。如果我注释掉前两个中的一个 (pf2FFTInX)，以及后两个中的另一个 (pf2FFTInY)，错误仍然会出现，但频率会降低。内核使用 6 个寄存器，其中所有 4 个赋值语句均未注释，并使用 5 4 个寄存器，其中两个赋值语句被注释掉。

我尝试了用 32 位工具包代替 64 位工具包、使用 -m32 编译器选项进行 32 位编译、在没有 X windows 的情况下运行等，但程序行为是相同的。

我在 RHEL 5.6 上使用 CUDA 4.0 驱动程序和运行时（也尝试过 CUDA 3.2）。 GPU计算能力为2.0。

请帮忙！如果有人有兴趣在他们的费米卡上运行它，我可以发布整个代码。

更新：只是为了它，我在 pf2FFTInX 和 pf2FFTInY 赋值语句和内存之间插入了 __syncthreads()对于 N = 32768，错误消失了。但在 N = 65536 时，我仍然遇到错误。 <-- 这并没有持续多久。仍然出现错误。

更新：在继续奇怪的行为时，当我使用 cuda-memcheck 运行程序时，我得到这些 16x16 多色像素块随机分布在我的屏幕上。如果我直接运行程序就不会发生这种情况。

原文

I am facing the following problem on a GeForce GTX 580 (Fermi-class) GPU.

Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, and use the kernel given below to copy them to two float2 arrays corresponding to each signal. (This is just an isolated part of a larger program.)

When I run the program using cuda-memcheck, I get either an unqualified unspecified launch failure, or the same message along with User Stack Overflow or Breakpoint Hit or Invalid __global__ write of size 8 at random thread and block indices.

The main kernel and launch-related code is reproduced below. The strange thing is that this code works (and cuda-memcheck throws no error) on a non-Fermi-class GPU that I have access to. Another thing that I observed is that the Fermi gives no error for N less than 16384.

#define N   32768

int main(int argc, char *argv[])
{
    char4 *pc4Buf_h = NULL;
    char4 *pc4Buf_d = NULL;
    float2 *pf2InX_d = NULL;
    float2 *pf2InY_d = NULL;
    dim3 dimBCopy(1, 1, 1);
    dim3 dimGCopy(1, 1);
    ...
    /* i do check for errors in the actual code */
    pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
    (void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
    (void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
    (void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
    ...
    dimBCopy.x = 1024;  /* number of threads in a block, for my GPU */
    dimGCopy.x = N / 1024;
    CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
                                           pf2InX_d,
                                           pf2InY_d);
    ...
}

__global__ void CopyDataForFFT(char4 *pc4Data,
                               float2 *pf2FFTInX,
                               float2 *pf2FFTInY)
{
    int i = (blockIdx.x * blockDim.x) + threadIdx.x;

    pf2FFTInX[i].x = (float) pc4Data[i].x;
    pf2FFTInX[i].y = (float) pc4Data[i].y;
    pf2FFTInY[i].x = (float) pc4Data[i].z;
    pf2FFTInY[i].y = (float) pc4Data[i].w;

    return;
}

~~One other thing I noticed in my program is that if I comment out any two char-to-float assignment statements in my kernel, there's no memory error.~~ One other thing I noticed in my program is that if I comment out either the first two or the last two char-to-float assignment statements in my kernel, there's no memory error. If I comment out one from the first two (pf2FFTInX), and another from the second two (pf2FFTInY), errors still crop up, but less frequently. The kernel uses 6 registers with all four assignment statements uncommented, and uses 5 4 registers with two assignment statements commented out.

I tried the 32-bit toolkit in place of the 64-bit toolkit, 32-bit compilation with the -m32 compiler option, running without X windows, etc. but the program behaviour is the same.

I use CUDA 4.0 driver and runtime (also tried CUDA 3.2) on RHEL 5.6. The GPU compute capability is 2.0.

Please help! I could post the entire code if anybody is interested in running it on their Fermi cards.

UPDATE: Just for the heck of it, I inserted a __syncthreads() between the pf2FFTInX and the pf2FFTInY assignment statements, and memory errors disappeared for N = 32768. But at N = 65536, I still get errors. <-- This didn't last long. Still getting errors.

UPDATE: In continuing with the weird behaviour, when I run the program using cuda-memcheck, I get these 16x16 blocks of multi-coloured pixels distributed randomly all over my screen. This does not happen if I run the program directly.

分享到QQ

分享到微博