CUDA版本比CPU版本慢？

发布于 2024-10-16 04:34:24 字数 1168 浏览 0 评论 0原文

我正在 CUDA 中编写一个图像子采样器，并使用线程执行平均操作。但是，如果我在不调用内核的情况下执行此操作，那么与我实际调用 CUDA 内核时相比，它的运行速度要快得多。现在的图像大小是 1280x1024。内核调用通常需要花费大量时间还是我的实现有问题？

PS 我尝试仅调用内核（删除了代码），并且它与带有代码的内核几乎相同。此外，我的代码（不带内核调用）运行大约 350 毫秒，而带内核调用运行接近 1000 毫秒。

__global__ void subsampler(int *r_d,int *g_d,int *b_d, int height,int width,int *f_r,int*f_g,int*f_b){ 
        int id=blockIdx.x * blockDim.x*blockDim.y+ threadIdx.y*blockDim.x+threadIdx.x+blockIdx.y*gridDim.x*blockDim.x*blockDim.y;
        if (id<height*width/4){
        f_r[id]=(r_d[4*id]+r_d[4*id+1]+r_d[4*id+2]+r_d[4*id+3])/4;
        f_g[id]=(g_d[4*id]+g_d[4*id+1]+g_d[4*id+2]+g_d[4*id+3])/4;
        f_b[id]=(b_d[4*id]+b_d[4*id+1]+b_d[4*id+2]+b_d[4*id+3])/4;
        }
        }

我将 blockSizeX 和 blockSizeY 定义为 1 和 1（我尝试将它们设为 4,16），但不知怎的，这是最快的

 dim3 blockSize(blocksizeX,blocksizeY);
  int new_width=img_width/2;
  int new_height=img_height/2;

  int n_blocks_x=new_width/blocksizeX+(new_width/blocksizeY == 0 ?0:1);
  int n_blocks_y=new_height/blocksizeX+(new_height/blocksizeY == 0 ?0:1);
  dim3 gridSize(n_blocks_x,n_blocks_y);

，然后我用 gridSize,BlockSize 调用内核。

原文

I am writing a image subsampler in CUDA and use the threads to perform the averaging operation.However if I do this without calling the kernel it runs much faster compared to when I actually call the CUDA kernel.Image size right now is 1280x1024.
Does the kernel call usually take substantial time or is there something wrong with my implementation?

P.S I tried calling just the kernel(with code removed) and it is pretty much same time as kernel with code.Also my code w/o kernel call runs approx 350 ms whereas with Kernel call runs close to 1000ms.

__global__ void subsampler(int *r_d,int *g_d,int *b_d, int height,int width,int *f_r,int*f_g,int*f_b){ 
        int id=blockIdx.x * blockDim.x*blockDim.y+ threadIdx.y*blockDim.x+threadIdx.x+blockIdx.y*gridDim.x*blockDim.x*blockDim.y;
        if (id<height*width/4){
        f_r[id]=(r_d[4*id]+r_d[4*id+1]+r_d[4*id+2]+r_d[4*id+3])/4;
        f_g[id]=(g_d[4*id]+g_d[4*id+1]+g_d[4*id+2]+g_d[4*id+3])/4;
        f_b[id]=(b_d[4*id]+b_d[4*id+1]+b_d[4*id+2]+b_d[4*id+3])/4;
        }
        }

I define blockSizeX and blockSizeY to be 1 and 1 ( i tried making them 4,16) but somehow this is the fastest

 dim3 blockSize(blocksizeX,blocksizeY);
  int new_width=img_width/2;
  int new_height=img_height/2;

  int n_blocks_x=new_width/blocksizeX+(new_width/blocksizeY == 0 ?0:1);
  int n_blocks_y=new_height/blocksizeX+(new_height/blocksizeY == 0 ?0:1);
  dim3 gridSize(n_blocks_x,n_blocks_y);

and then I call the kernel with gridSize,BlockSize.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

匿名。 2024-10-23 04:34:24

可能是内核实现得不是很好，或者可能是从 GPU 卡移入和移出数据的开销淹没了任何计算优势。尝试单独对内核进行基准测试（没有 CPU <-> GPU 内存传输），看看内核占用了总时间，内存传输占用了多少时间。然后，您可以根据这些测量结果决定是否需要在内核上做更多工作。

回复收藏 0 原文

素染倾城色 2024-10-23 04:34:24

虽然我不确定您运行的是什么硬件，但您应该能够使该内核的执行速度接近 1000 fps，而不是 1000 毫秒/帧:)

建议 1：如果此处理通过 OpenGL 与可视化有任何交互/DirectX 或类似的，只需将其作为着色器执行即可 - 网格/块大小、内存布局等所有细节都会为您处理。如果您确实需要自己在 CUDA 中实现此功能，请继续阅读：

首先，我假设您在每个方向上对 1280x1024 图像进行 2 倍子采样，从而生成 640x512 图像。生成图像中的每个像素都是源图像中四个像素的平均值。图像具有三个通道：RGB。

问题 1：您真的想要每通道 32 位还是想要 RGB888（每通道 8 位）？ RGB888 相当常见——我假设这就是你的意思。

问题 2：您的数据实际上是平面的，还是从交错格式中提取的？ RGB888 是一种交错格式，其中像素在内存中显示为 RGBRGBRGB。我会编写你的内核来以其本机格式处理图像。我假设您的数据实际上是平面的，因此您有三个平面：R8、G8 和 B8。

首先要做的是考虑内存布局。您需要为目标图像中的每个像素分配一个线程。鉴于子采样的内存访问模式未合并，您将需要将像素数据读入共享内存。考虑 32x8 线程的块大小。这允许每个块以 3bpp 读取 40*8*4 像素，即 3072 字节。实际上，您将读入稍微多一点的内容，以保持负载合并，每个块总共 4096 字节。现在给你：

dim3 block(32, 8);
dim3 grid(1280 / 2 / 32, 1024 / 2 / 8); // 20x64 blocks of 256 threads

现在有趣的部分来了：共享内存。你的内核可能看起来像这样：

__global__ void subsample(uchar* r, uchar* g, uchar* b,    // in
                          uchar* ro, uchar* go, uchar* bo) // out
{
    /* Global offset into output pixel arrays */
    int gid = blockIdx.y * gridDim.x * blockDim.x + blockIdx.x * blockDim.x;

    /* Global offset into input pixel arrays */
    int gidin = gid * 2;

    __shared__ uchar* rc[1024];
    __shared__ uchar* gc[1024];
    __shared__ uchar* bc[1024];

    /* Read r, g, and b, into shmem cache */
    ((int*)rc)[threadIdx.x] = ((int*)r)[gidin + threadIdx.x];
    ((int*)gc)[threadIdx.x] = ((int*)g)[gidin + threadIdx.x];
    ((int*)bc)[threadIdx.x] = ((int*)b)[gidin + threadIdx.x];

    __syncthreads();

    /* Shared memory for output */
    __shared__ uchar* roc[256];
    __shared__ uchar* goc[256];
    __shared__ uchar* boc[256];

    /* Do the subsampling, one pixel per thread. Store into the output shared memory */

    ...

    __syncthreads();

    /* Finally, write the result to global memory with coalesced stores */
    if (threadIdx.x < 64) {
        ((int*)ro)[gid + threadIdx.x] =  ((int*)roc)[threadIdx.x];
    } else if (threadIdx.x < 128) {
        ((int*)go)[gid + threadIdx.x-64] =  ((int*)goc)[threadIdx.x-64];
    } else if (threadIdx.x < 192) {
        ((int*)bo)[gid + threadIdx.x-128] =  ((int*)boc)[threadIdx.x-128];
    }
}

哇！那里有很多东西，抱歉代码转储。需要记住的一些原则：

1) 当您使用合并加载/存储时，内存速度会很快。这意味着对于 32 个 warp 中的每个线程，每个线程访问 32 个字节。如果 32 字节索引与 warp 中的线程索引匹配，则所有 32 次访问都会放入一个 128 事务中。这就是获得 GPU 100GB/s 带宽的方法。

2) 进行子采样时的内存访问模式不会合并，因为它依赖于原始内存所不具备的二维空间局部性。（也可以为此使用纹理内存...）通过将输入存储在共享内存中，然后进行处理，可以最大限度地减少对计算性能的影响。

我希望这会有所帮助——如果您愿意，我可以回复有关某些部分的更多详细信息。

While I'm not sure what hardware you're running this one, you should be able to make this kernel perform closer to 1000 fps, rather than 1000ms/frame :)

Suggestion 1: If this processing has any interaction with visualization, through OpenGL/DirectX or similar, just do this as a shader -- all the details of grid/block size, memory layout, etc., is handled for you. If you really need to implement this yourself in CUDA, then keep reading:

First, I assume you're subsampling your 1280x1024 image by a factor of 2 in each direction, yielding a 640x512 image. Each pixel in the resulting image is the average of four pixels in the source image. The images have three channels, RGB.

Question 1: Do you really want 32 bits per channel or did you want RGB888 (8 bits per channel)? RGB888 is fairly common -- I will assume this is what you meant.

Question 2: Is your data actually planar, or are you extracting it from an interleaved format? RGB888 is an interleaved format, where pixels appear in memory as RGBRGBRGB. I would write your kernel to process the image in its native format. I will assume your data is actually planar, so you have three planes, R8, G8, and B8.

The first thing to do is consider memory layout. You will want one thread for every pixel in the destination image. Given that the memory access pattern for subsampling is not coalesced, you will want to read the pixel data into shared memory. Consider a block size of 32x8 threads. This allows each block to read in 40*8*4 pixels, or 3072 bytes at 3bpp. You will actually read in slightly more than that, to keep the loads coalesced, for a total of 4096 bytes per block. This now gives you:

dim3 block(32, 8);
dim3 grid(1280 / 2 / 32, 1024 / 2 / 8); // 20x64 blocks of 256 threads

Now comes the fun part: doing the shared memory. Your kernel could look like this:

__global__ void subsample(uchar* r, uchar* g, uchar* b,    // in
                          uchar* ro, uchar* go, uchar* bo) // out
{
    /* Global offset into output pixel arrays */
    int gid = blockIdx.y * gridDim.x * blockDim.x + blockIdx.x * blockDim.x;

    /* Global offset into input pixel arrays */
    int gidin = gid * 2;

    __shared__ uchar* rc[1024];
    __shared__ uchar* gc[1024];
    __shared__ uchar* bc[1024];

    /* Read r, g, and b, into shmem cache */
    ((int*)rc)[threadIdx.x] = ((int*)r)[gidin + threadIdx.x];
    ((int*)gc)[threadIdx.x] = ((int*)g)[gidin + threadIdx.x];
    ((int*)bc)[threadIdx.x] = ((int*)b)[gidin + threadIdx.x];

    __syncthreads();

    /* Shared memory for output */
    __shared__ uchar* roc[256];
    __shared__ uchar* goc[256];
    __shared__ uchar* boc[256];

    /* Do the subsampling, one pixel per thread. Store into the output shared memory */

    ...

    __syncthreads();

    /* Finally, write the result to global memory with coalesced stores */
    if (threadIdx.x < 64) {
        ((int*)ro)[gid + threadIdx.x] =  ((int*)roc)[threadIdx.x];
    } else if (threadIdx.x < 128) {
        ((int*)go)[gid + threadIdx.x-64] =  ((int*)goc)[threadIdx.x-64];
    } else if (threadIdx.x < 192) {
        ((int*)bo)[gid + threadIdx.x-128] =  ((int*)boc)[threadIdx.x-128];
    }
}

Whew! Lot of stuff there, sorry for the code dump. Some principles to keep in mind:

1) Memory is fast when you use coalesced loads/stores. That means for each thread in a warp of 32, each accesses 32 bytes. If the 32byte index matches the thread index in the warp, then all 32 accesses get put into one 128 transaction. This is how you get the 100GB/s bandwidth of the GPU.

2) The pattern of memory access when doing subsampling is not coalesced, since it relies on 2D spatial locality, which the raw memory does not have. (Could use texture memory for this as well...) By storing the input in shared memory, then processing, you minimize the impact of upon your compute performance.

I hope this helps -- I can reply with more detail on some parts if you'd like.

回复收藏 0 原文

~没有更多了~