CUDA 中的卷积、带滤波器的数组

发布于 2024-09-25 14:38:45 字数 1483 浏览 4 评论 0原文

我正在尝试使用共享内存在 GPU 上对 256x256 数据数组与 3x3 滤波器进行卷积。我知道我要将数组分成块，然后在每个块中应用过滤器。这最终意味着块沿边缘重叠，并且需要在没有数据的边缘周围进行一些填充，以便过滤器正常工作。

int 网格 = (256/(16+3-1))*(256/(16+3-1)) 其中 256 是我的数组的长度或宽度，16 是我的共享内存中的块的长度或宽度，3 是我的过滤器的长度或宽度，我减一以使其均匀。

int thread = (16+3-1)*(16+3-1)

现在我调用我的内核<<>>(output, input, 256) 输入和输出是大小为 256*256 的数组

__global__ void kernel(float *input, float *output, int size)
{
    __shared__ float tile[16+3-1][16+3-1];
    blockIdx.x = bIdx;
    blockIdy.y = bIdy;
    threadIdx.x = tIdx;
    threadIdy.y = tIdy

    //i is for input
    unsigned int iX = bIdx * 3 + tIdx;
    unsigned int iY = bIdy * 3 + tIdy;

    if (tIdx == 0 || tIdx == width || tIdy == 0 || tIdy == height)
    {
        //this will pad the outside edges
        block[tIdy][tIdx] = 0;
    }
    else 
    {
        //This will fill in the block with real data
        unsigned int iin = iY * size + iX;
        block[tIdy][tIdx] = idata[iin];
    }

    __syncthreads();

    //I believe is above is correct; below, where I do the convolution, I feel is wrong
    float result = 0;
    for(int fX=-N/2; fX<=N/2; fX++){
        for(int fY=-N/2; fY<=N/2; fY++){
            if(iY+fX>=0 && iY+fX<size && iX+fY>=0 && iX+fY<size)
                result+=tile[tIdx+fX][tIdy+fY];
        }
    }
    output[iY*size+iX] = result/(3*3);
}

当我运行代码时，如果运行卷积部分，则会出现内核错误。有什么见解吗？或者建议？

原文

I'm trying to take the convolution of an array of data, 256x256, with a filter, 3x3 on a GPU using shared memory. I understand that I'm to break the array up in blocks, and then apply the filter within each block. This ultimately means that blocks with overlap along the edges, and some padding will need to be done around the edges where there is no data so that the filter works properly.

int grid = (256/(16+3-1))*(256/(16+3-1))
where 256 is the length or width of my array, 16 is the length or wide of my block in shared memory, 3 is the length or width of my filter, and I minus one to make it so it's even.

int thread = (16+3-1)*(16+3-1)

Now I call my kernel <<>>(output, input, 256)
input and output are an array of size 256*256

__global__ void kernel(float *input, float *output, int size)
{
    __shared__ float tile[16+3-1][16+3-1];
    blockIdx.x = bIdx;
    blockIdy.y = bIdy;
    threadIdx.x = tIdx;
    threadIdy.y = tIdy

    //i is for input
    unsigned int iX = bIdx * 3 + tIdx;
    unsigned int iY = bIdy * 3 + tIdy;

    if (tIdx == 0 || tIdx == width || tIdy == 0 || tIdy == height)
    {
        //this will pad the outside edges
        block[tIdy][tIdx] = 0;
    }
    else 
    {
        //This will fill in the block with real data
        unsigned int iin = iY * size + iX;
        block[tIdy][tIdx] = idata[iin];
    }

    __syncthreads();

    //I believe is above is correct; below, where I do the convolution, I feel is wrong
    float result = 0;
    for(int fX=-N/2; fX<=N/2; fX++){
        for(int fY=-N/2; fY<=N/2; fY++){
            if(iY+fX>=0 && iY+fX<size && iX+fY>=0 && iX+fY<size)
                result+=tile[tIdx+fX][tIdy+fY];
        }
    }
    output[iY*size+iX] = result/(3*3);
}

When I run the code, if I run the convolution part, I get a kernel error. Any insights? Or suggestions?

分享到QQ

分享到微博