Fermi GPU 的 CUDA 程序内存错误

发布于 2024-11-25 13:52:08 字数 2627 浏览 1 评论 0原文

我在 GeForce GTX 580(费米级)GPU 上遇到以下问题。

为了给您提供一些背景知识,我正在读取按以下方式打包在文件中的单字节样本:实数(信号 1)、虚数(信号 1)、实数(信号 2)、虚数(信号 2)。 (每个字节都是一个有符号字符,取值在 -128 和 127 之间。)我将它们读入 char4 数组,并使用下面给出的内核将它们复制到与每个信号对应的两个 float2 数组中。 (这只是较大程序的一个独立部分。)

当我使用 cuda-memcheck 运行程序时,我要么收到不合格的 unspecified launch failure,要么收到相同的消息以及 User Stack溢出或断点命中或随机线程和块索引处的大小为 8 的无效 __global__ 写入

主要内核和启动相关的代码复制如下。 奇怪的是,这段代码在我有权访问的非 Fermi 级 GPU 上运行(并且 cuda-memcheck 没有抛出错误)。我观察到的另一件事是,费米对于小于 16384 的 N 没有给出任何错误。

#define N   32768

int main(int argc, char *argv[])
{
    char4 *pc4Buf_h = NULL;
    char4 *pc4Buf_d = NULL;
    float2 *pf2InX_d = NULL;
    float2 *pf2InY_d = NULL;
    dim3 dimBCopy(1, 1, 1);
    dim3 dimGCopy(1, 1);
    ...
    /* i do check for errors in the actual code */
    pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
    (void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
    (void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
    (void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
    ...
    dimBCopy.x = 1024;  /* number of threads in a block, for my GPU */
    dimGCopy.x = N / 1024;
    CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
                                           pf2InX_d,
                                           pf2InY_d);
    ...
}

__global__ void CopyDataForFFT(char4 *pc4Data,
                               float2 *pf2FFTInX,
                               float2 *pf2FFTInY)
{
    int i = (blockIdx.x * blockDim.x) + threadIdx.x;

    pf2FFTInX[i].x = (float) pc4Data[i].x;
    pf2FFTInX[i].y = (float) pc4Data[i].y;
    pf2FFTInY[i].x = (float) pc4Data[i].z;
    pf2FFTInY[i].y = (float) pc4Data[i].w;

    return;
}

我在程序中注意到的另一件事是,如果我注释掉任何两个字符 -我在程序中注意到的另一件事是,如果我注释掉内核中的前两个或最后两个 char-to-float 赋值语句,则不会出现内存错误。 ,有没有内存错误。如果我注释掉前两个中的一个 (pf2FFTInX),以及后两个中的另一个 (pf2FFTInY),错误仍然会出现,但频率会降低。内核使用 6 个寄存器,其中所有 4 个赋值语句均未注释,并使用 5 4 个寄存器,其中两个赋值语句被注释掉。

我尝试了用 32 位工具包代替 64 位工具包、使用 -m32 编译器选项进行 32 位编译、在没有 X windows 的情况下运行等,但程序行为是相同的。

我在 RHEL 5.6 上使用 CUDA 4.0 驱动程序和运行时(也尝试过 CUDA 3.2)。 GPU计算能力为2.0。

请帮忙!如果有人有兴趣在他们的费米卡上运行它,我可以发布整个代码。

更新:只是为了它,我在 pf2FFTInXpf2FFTInY 赋值语句和内存之间插入了 __syncthreads()对于 N = 32768,错误消失了。但在 N = 65536 时,我仍然遇到错误。 <-- 这并没有持续多久。仍然出现错误。

更新:在继续奇怪的行为时,当我使用 cuda-memcheck 运行程序时,我得到这些 16x16 多色像素块随机分布在我的屏幕上。如果我直接运行程序就不会发生这种情况。

I am facing the following problem on a GeForce GTX 580 (Fermi-class) GPU.

Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, and use the kernel given below to copy them to two float2 arrays corresponding to each signal. (This is just an isolated part of a larger program.)

When I run the program using cuda-memcheck, I get either an unqualified unspecified launch failure, or the same message along with User Stack Overflow or Breakpoint Hit or Invalid __global__ write of size 8 at random thread and block indices.

The main kernel and launch-related code is reproduced below. The strange thing is that this code works (and cuda-memcheck throws no error) on a non-Fermi-class GPU that I have access to. Another thing that I observed is that the Fermi gives no error for N less than 16384.

#define N   32768

int main(int argc, char *argv[])
{
    char4 *pc4Buf_h = NULL;
    char4 *pc4Buf_d = NULL;
    float2 *pf2InX_d = NULL;
    float2 *pf2InY_d = NULL;
    dim3 dimBCopy(1, 1, 1);
    dim3 dimGCopy(1, 1);
    ...
    /* i do check for errors in the actual code */
    pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
    (void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
    (void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
    (void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
    ...
    dimBCopy.x = 1024;  /* number of threads in a block, for my GPU */
    dimGCopy.x = N / 1024;
    CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
                                           pf2InX_d,
                                           pf2InY_d);
    ...
}

__global__ void CopyDataForFFT(char4 *pc4Data,
                               float2 *pf2FFTInX,
                               float2 *pf2FFTInY)
{
    int i = (blockIdx.x * blockDim.x) + threadIdx.x;

    pf2FFTInX[i].x = (float) pc4Data[i].x;
    pf2FFTInX[i].y = (float) pc4Data[i].y;
    pf2FFTInY[i].x = (float) pc4Data[i].z;
    pf2FFTInY[i].y = (float) pc4Data[i].w;

    return;
}

One other thing I noticed in my program is that if I comment out any two char-to-float assignment statements in my kernel, there's no memory error. One other thing I noticed in my program is that if I comment out either the first two or the last two char-to-float assignment statements in my kernel, there's no memory error. If I comment out one from the first two (pf2FFTInX), and another from the second two (pf2FFTInY), errors still crop up, but less frequently. The kernel uses 6 registers with all four assignment statements uncommented, and uses 5 4 registers with two assignment statements commented out.

I tried the 32-bit toolkit in place of the 64-bit toolkit, 32-bit compilation with the -m32 compiler option, running without X windows, etc. but the program behaviour is the same.

I use CUDA 4.0 driver and runtime (also tried CUDA 3.2) on RHEL 5.6. The GPU compute capability is 2.0.

Please help! I could post the entire code if anybody is interested in running it on their Fermi cards.

UPDATE: Just for the heck of it, I inserted a __syncthreads() between the pf2FFTInX and the pf2FFTInY assignment statements, and memory errors disappeared for N = 32768. But at N = 65536, I still get errors. <-- This didn't last long. Still getting errors.

UPDATE: In continuing with the weird behaviour, when I run the program using cuda-memcheck, I get these 16x16 blocks of multi-coloured pixels distributed randomly all over my screen. This does not happen if I run the program directly.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

稍尽春風 2024-12-02 13:52:08

问题是 GPU 卡坏了(参见评论)。 [我添加此答案是为了从未回答的列表中删除该问题并使其更有用。]

The problem was a bad GPU card (see the comments). [I'm Adding this answer to remove the question from the unanswered list and make it more useful.]

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文