数组的顺序求和返回错误值

发布于 2024-12-04 18:20:30 字数 4738 浏览 0 评论 0原文

在我开始之前,这是正在发生的事情的总体思路:

总体思路是我有 x 个浮点数组,并且我想将每个浮点数按顺序添加到另一个数组(标量添加):

t = array;

a = 数组的数组;

t = 零

t += a[0]

t += a[1]

...

t += a[N]

其中 += 表示标量加法。

这很简单。我尝试缩小代码,使其尽可能紧凑并保留功能。这里的问题是,对于某些大小的数组 - 我在任何大于 128 x 128 x 108 的数组上都看到了问题。基本上,复制回主机的内存总和与我计算的值不同。我已经被困在这个问题上一整天了,所以我不会再浪费时间了。我真的无法解释为什么会发生这种情况。我的理由是:

  • 使用太多的常量空间(不使用任何)
  • 使用太多的寄存器(否)
  • 内核中检查 idx、idy、idz 是否在边界内的条件不正确(这仍然可能是它)
  • gpu 的一些有趣的事情(尝试过) gt280,以及特斯拉 C1060 和 C2060)
  • printf 格式不正确(我希望就是这样) *...

这个清单还可以继续列下去。如果您有时间,感谢您浏览此内容。该问题几乎似乎与内存相关(即> 128*128*108 的内存大小不起作用。因此 64*128*256工作,或其任何排列)。

这是完整的源代码(应该可以用 nvcc 编译):

#include <cuda.h>
#include <iostream>
#include <stdio.h>
#include <assert.h>

#define BSIZE 8

void cudaCheckError(cudaError_t e,const char * msg) {
    if (e != cudaSuccess){
        printf("Error number: %d\n",e);
        printf("%s\n",msg);
    }
};

__global__ void accumulate(float * in,float * out, int3 gdims, int zlevel) {

    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    int idy = blockIdx.y*blockDim.y + threadIdx.y;
    int idz = threadIdx.z;

    long int index = (zlevel*((int)BSIZE)+idz)*gdims.x*gdims.y+ \
        idy*gdims.x+ \
        idx;

    if ( idx < gdims.x && idy < gdims.y && (idz + zlevel*(int)BSIZE) < gdims.z) {

        out[index] += in[index];
    }
};

int main(int argc, char * argv[]) {

    int width, 
    height,
    depth; 

    if (argc != 4) {
        printf("Must have 3 inputs: width height depth\n");
        exit(0);
    }
    float tempsum;
    int count =0;
    width = atoi(argv[1]);
    height = atoi(argv[2]);
    depth = atoi(argv[3]);

    printf("Dimensions (%d,%d,%d)\n",width,height,depth);

    int3 dFull;

    dFull.x = width+2;
    dFull.y = height+2;
    dFull.z = depth+2;

    printf("Dimensions (%d,%d,%d)\n",dFull.x,dFull.y,dFull.z);

    int fMemSize=dFull.x*dFull.y*dFull.z;

    int nHostF=9;

    float * f_hostZero;

    float ** f_dev;

    float * f_temp_host;
    float * f_temp_dev;

    dim3 grid( dFull.x/(int)BSIZE+1, dFull.y/(int)BSIZE + 1);

    dim3 threads((int)BSIZE,(int)BSIZE,(int)BSIZE);
    printf("Threads (x,y) : (%d,%d)\nGrid (x,y) : (%d,%d)\n",threads.x,threads.y,grid.x,grid.y);

    int num_zsteps=dFull.z/(int)BSIZE + 1;
    printf("Number of z steps to take : %d\n",num_zsteps);
    // Host array allocation
    f_temp_host = new float[fMemSize];
    f_hostZero = new float[fMemSize];


    // Allocate nHostF address on host 
    f_dev = new float*[nHostF];

    // Host array assignment
    for(int i=0; i < fMemSize; i++){
        f_temp_host[i] = 1.0;
        f_hostZero[i] = 0.0;
    }

    // Device allocations - allocated for array size + 2
    for(int i=0; i<nHostF; i++){
        cudaMalloc((void**)&f_dev[i],sizeof(float)*fMemSize);
    }


    // Allocate the decive pointer
    cudaMalloc( (void**)&f_temp_dev, sizeof(float)*fMemSize);

    cudaCheckError(cudaMemcpy((void *)f_temp_dev,(const void *)f_hostZero,
        sizeof(float)*fMemSize,cudaMemcpyHostToDevice),"At first mem copy");

    printf("Memory regions allocated\n");

    // Copy memory to each array
    for(int i=0; i<nHostF; i++){
        cudaCheckError(cudaMemcpy((void *)(f_dev[i]),(const void *)f_temp_host,
            sizeof(float)*fMemSize,cudaMemcpyHostToDevice),"At first mem copy");
    }

    // Add value 1.0 (from each array n f_dev[i]) to f_temp_dev
    for (int i=0; i<nHostF; i++){
        for (int zLevel=0; zLevel<num_zsteps; zLevel++){
            accumulate<<<grid,threads>>>(f_dev[i],f_temp_dev,dFull,zLevel);
            cudaThreadSynchronize();
        }
        cudaCheckError(cudaMemcpy((void *)f_temp_host,(const void *)f_temp_dev,
            sizeof(float)*fMemSize,cudaMemcpyDeviceToHost),"At mem copy back");
        tempsum=0.f;
        count =0;
        for(int k = 0 ; k< fMemSize; k++){
            tempsum += f_temp_host[k];

            assert ( (int)f_temp_host[k] == (i+1) );
            if ( f_temp_host[k] !=(float)(i+1) ) {
                printf("Found invalid return value\n");
                exit(0);
            }
            count++;
        }
        printf("Total Count: %d\n",count);
        printf("Real Array sum: %18f\nTotal values counted : %d\n",tempsum,count*(i+1));
        printf("Calculated Array sum: %ld\n\n",(i+1)*fMemSize );
    }

    for(int i=0; i<nHostF; i++){
        cudaFree(f_dev[i]);
    }

    cudaFree(f_temp_dev);
    printf("Memory free. Program successfully complete\n");
    delete f_dev;
    delete f_temp_host;
}

Before I get into, this is the general idea of what is going on:

The general idea is I have x arrays of floats and I want to add each one sequentially to another array (scalar add):

t = array;

a = array of array;

t = zeros

t += a[0]

t += a[1]

...

t += a[N]

where += represents scalar addition.

This is straight forward. I tried to shrink down the code I have to be as compact as possible and preserve the functionality. The issue here is that for certain size arrays - I'm seeing issues at anything larger than 128 x 128 x 108. Basically the summation of the memory copied back to the host is not the same as what I have calculated it to be. I've been stuck on this all day so I'm going to stop wasting my time. I really can't explain why this is happening. Ive reason through:

  • Using too much constant space (not using any)
  • Using too many registers (no)
  • Incorrect condition in kernel to check if idx,idy,idz are within bounds ( this still could be it)
  • Something funny with gpu (tried on gt280, and the tesla C1060 and C2060)
  • Incorrect printf format ( I hope this is it )
    *...

That list could go on. Thanks for browsing through this if you have time. The issue almost seems to be memory related (i.e. memory sizes that are > 128*128*108 don't work. So 64*128*256 will work, or any permutation thereof).

Here Is the full source code (that should be compilable with nvcc):

#include <cuda.h>
#include <iostream>
#include <stdio.h>
#include <assert.h>

#define BSIZE 8

void cudaCheckError(cudaError_t e,const char * msg) {
    if (e != cudaSuccess){
        printf("Error number: %d\n",e);
        printf("%s\n",msg);
    }
};

__global__ void accumulate(float * in,float * out, int3 gdims, int zlevel) {

    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    int idy = blockIdx.y*blockDim.y + threadIdx.y;
    int idz = threadIdx.z;

    long int index = (zlevel*((int)BSIZE)+idz)*gdims.x*gdims.y+ \
        idy*gdims.x+ \
        idx;

    if ( idx < gdims.x && idy < gdims.y && (idz + zlevel*(int)BSIZE) < gdims.z) {

        out[index] += in[index];
    }
};

int main(int argc, char * argv[]) {

    int width, 
    height,
    depth; 

    if (argc != 4) {
        printf("Must have 3 inputs: width height depth\n");
        exit(0);
    }
    float tempsum;
    int count =0;
    width = atoi(argv[1]);
    height = atoi(argv[2]);
    depth = atoi(argv[3]);

    printf("Dimensions (%d,%d,%d)\n",width,height,depth);

    int3 dFull;

    dFull.x = width+2;
    dFull.y = height+2;
    dFull.z = depth+2;

    printf("Dimensions (%d,%d,%d)\n",dFull.x,dFull.y,dFull.z);

    int fMemSize=dFull.x*dFull.y*dFull.z;

    int nHostF=9;

    float * f_hostZero;

    float ** f_dev;

    float * f_temp_host;
    float * f_temp_dev;

    dim3 grid( dFull.x/(int)BSIZE+1, dFull.y/(int)BSIZE + 1);

    dim3 threads((int)BSIZE,(int)BSIZE,(int)BSIZE);
    printf("Threads (x,y) : (%d,%d)\nGrid (x,y) : (%d,%d)\n",threads.x,threads.y,grid.x,grid.y);

    int num_zsteps=dFull.z/(int)BSIZE + 1;
    printf("Number of z steps to take : %d\n",num_zsteps);
    // Host array allocation
    f_temp_host = new float[fMemSize];
    f_hostZero = new float[fMemSize];


    // Allocate nHostF address on host 
    f_dev = new float*[nHostF];

    // Host array assignment
    for(int i=0; i < fMemSize; i++){
        f_temp_host[i] = 1.0;
        f_hostZero[i] = 0.0;
    }

    // Device allocations - allocated for array size + 2
    for(int i=0; i<nHostF; i++){
        cudaMalloc((void**)&f_dev[i],sizeof(float)*fMemSize);
    }


    // Allocate the decive pointer
    cudaMalloc( (void**)&f_temp_dev, sizeof(float)*fMemSize);

    cudaCheckError(cudaMemcpy((void *)f_temp_dev,(const void *)f_hostZero,
        sizeof(float)*fMemSize,cudaMemcpyHostToDevice),"At first mem copy");

    printf("Memory regions allocated\n");

    // Copy memory to each array
    for(int i=0; i<nHostF; i++){
        cudaCheckError(cudaMemcpy((void *)(f_dev[i]),(const void *)f_temp_host,
            sizeof(float)*fMemSize,cudaMemcpyHostToDevice),"At first mem copy");
    }

    // Add value 1.0 (from each array n f_dev[i]) to f_temp_dev
    for (int i=0; i<nHostF; i++){
        for (int zLevel=0; zLevel<num_zsteps; zLevel++){
            accumulate<<<grid,threads>>>(f_dev[i],f_temp_dev,dFull,zLevel);
            cudaThreadSynchronize();
        }
        cudaCheckError(cudaMemcpy((void *)f_temp_host,(const void *)f_temp_dev,
            sizeof(float)*fMemSize,cudaMemcpyDeviceToHost),"At mem copy back");
        tempsum=0.f;
        count =0;
        for(int k = 0 ; k< fMemSize; k++){
            tempsum += f_temp_host[k];

            assert ( (int)f_temp_host[k] == (i+1) );
            if ( f_temp_host[k] !=(float)(i+1) ) {
                printf("Found invalid return value\n");
                exit(0);
            }
            count++;
        }
        printf("Total Count: %d\n",count);
        printf("Real Array sum: %18f\nTotal values counted : %d\n",tempsum,count*(i+1));
        printf("Calculated Array sum: %ld\n\n",(i+1)*fMemSize );
    }

    for(int i=0; i<nHostF; i++){
        cudaFree(f_dev[i]);
    }

    cudaFree(f_temp_dev);
    printf("Memory free. Program successfully complete\n");
    delete f_dev;
    delete f_temp_host;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

云柯 2024-12-11 18:20:30

您的设备代码没有问题。所发生的情况是,在问题规模较大时,您会耗尽单精度浮点的能力来精确计算代码在运行规模较大时产生的大整数值。如果您将主机端求和代码替换为 Kahan 求和,如下所示:

    tempsum=0.f;
    count =0;
    float c=0.f;
    for(int k = 0 ; k< fMemSize; k++){
        float y = f_temp_host[k] - c;
        float t = tempsum + y;
        c = (t - tempsum) - y;
        tempsum = t;

        assert ( (int)f_temp_host[k] == (i+1) );
        if ( f_temp_host[k] !=(float)(i+1) ) {
            printf("Found invalid return value\n");
            exit(0);
        }
        count++;
    }

您应该会发现代码运行正如在较大尺寸下所预期的那样。或者,主机端求和可以用双精度算术来完成。如果您还没有阅读过,我强烈推荐 每个计算机科学家都应该了解浮点运算。它将有助于解释您在此示例中哪里出错了,并且它所传授的智慧可能有助于防止将来犯下类似的错误。

There is nothing wrong with your device code. All that is happening is that at large problems sizes, you are exhausting the capacity of single precision floating point to exactly calculate the large integer values which the code produces at large run sizes. If you replace your host side summation code with Kahan summation, like this:

    tempsum=0.f;
    count =0;
    float c=0.f;
    for(int k = 0 ; k< fMemSize; k++){
        float y = f_temp_host[k] - c;
        float t = tempsum + y;
        c = (t - tempsum) - y;
        tempsum = t;

        assert ( (int)f_temp_host[k] == (i+1) );
        if ( f_temp_host[k] !=(float)(i+1) ) {
            printf("Found invalid return value\n");
            exit(0);
        }
        count++;
    }

you should find the code runs as expected at larger sizes. Alternatively, the host side summation could be done with double precision arithmetic instead. If you have not already read it, I highly recommend What Every Computer Scientist Should Know About Floating-Point Arithmetic. It will help explain where you went wrong in this example, and the wisdom it imparts might help prevent committing similar faux pas in the future.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文