将整个全局内存缓冲区多次复制到共享内存缓冲区

发布于 2024-12-14 12:50:22 字数 332 浏览 5 评论 0原文

我在全局内存中有一个缓冲区，我想将其复制到每个块的共享内存中，以加快只读访问速度。每个块中的每个线程将同时在不同位置使用整个缓冲区。

如何做到这一点？

我仅在运行时知道缓冲区的大小：

__global__ void foo( int *globalMemArray, int N )
{
    extern __shared__ int s_array[];

    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if( idx < N )
    {

       ...?
    }
}

原文

I have a buffer in global memory that I want to copy in shared memory for each block as to speed up my read-only access. Each thread in each block will use the whole buffer at different positions concurrently.

How does one do that?

I know the size of the buffer only at run time:

__global__ void foo( int *globalMemArray, int N )
{
    extern __shared__ int s_array[];

    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if( idx < N )
    {

       ...?
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

何时共饮酒 2024-12-21 12:50:22

首先要说明的是，每个流式多处理器 (SM) 的共享内存最大限制为 16kb 或 48kb，具体取决于您使用的 GPU 及其配置方式，因此除非您的全局内存缓冲区非常小，否则您将无法使用共享内存。将无法同时将所有内容加载到共享内存中。

第二点是共享内存的内容仅具有与其关联的块的范围和生命周期。您的示例内核只有一个全局内存参数，这让我认为您要么误解了共享内存分配的内容可以在填充它的块的生命周期之外保留，要么您打算编写块计算的结果返回到读取输入数据的同一全局存储器阵列中。第一种可能性是错误的，第二种可能会导致内存竞争和结果不一致。最好将共享内存视为完全由程序员管理的小型块作用域 L1 缓存，而不是某种更快版本的全局内存。

考虑到这些要点，加载大型输入数组的连续段、处理它们然后将每个线程的一些最终结果写回输入全局内存的内核可能看起来像这样：

template <int blocksize>
__global__ void foo( int *globalMemArray, int *globalMemOutput, int N ) 
{ 
    __shared__ int s_array[blocksize]; 
    int npasses = (N / blocksize) + (((N % blocksize) > 0) ? 1 : 0);

    for(int pos = threadIdx.x; pos < (blocksize*npasses); pos += blocksize) { 
        if( pos < N ) { 
            s_array[threadIdx.x] = globalMemArray[pos];
        }
        __syncthreads(); 

        // Calculations using partial buffer contents
        .......

        __syncthreads(); 
    }

    // write final per thread result to output
    globalMemOutput[threadIdx.x + blockIdx.x*blockDim.x] = .....;
}

在这种情况下，我指定了共享内存数组size 作为模板参数，因为实际上并不需要在运行时动态分配共享内存数组大小，并且当编译时已知共享内存数组大小时（可能在最坏的情况下），编译器有更好的机会执行优化在这种情况下，可以在运行时在不同的内核实例之间进行选择时间）。

CUDA SDK 包含许多很好的示例代码，这些代码演示了在内核中使用共享内存以提高内存读写性能的不同方式。矩阵转置、归约和 3D 有限差分方法示例都是共享内存使用的良好模型。每个人都有一篇很好的论文，讨论了代码中共享内存使用背后的优化策略。研究它们直到您了解它们的工作原理和原因，会对您大有裨益。

The first point to make is that shared memory is limited to a maximum of either 16kb or 48kb per streaming multiprocessor (SM), depending on which GPU you are using and how it is configured, so unless your global memory buffer is very small, you will not be able to load all of it into shared memory at the same time.

The second point to make is that the contents of shared memory only has the scope and lifetime of the block it is associated with. Your sample kernel only has a single global memory argument, which makes me think that you are either under the misapprehension that the contents of a shared memory allocation can be preserved beyond the life span of the block that filled it, or that you intend to write the results of the block calculations back into same global memory array from which the input data was read. The first possibility is wrong and the second will result in memory races and inconsistant results. It is probably better to think of shared memory as a small, block scope L1 cache which is fully programmer managed than some sort of faster version of global memory.

With those points out of the way, a kernel which loaded sucessive segments of a large input array, processed them and then wrote some per thread final result back input global memory might look something like this:

template <int blocksize>
__global__ void foo( int *globalMemArray, int *globalMemOutput, int N ) 
{ 
    __shared__ int s_array[blocksize]; 
    int npasses = (N / blocksize) + (((N % blocksize) > 0) ? 1 : 0);

    for(int pos = threadIdx.x; pos < (blocksize*npasses); pos += blocksize) { 
        if( pos < N ) { 
            s_array[threadIdx.x] = globalMemArray[pos];
        }
        __syncthreads(); 

        // Calculations using partial buffer contents
        .......

        __syncthreads(); 
    }

    // write final per thread result to output
    globalMemOutput[threadIdx.x + blockIdx.x*blockDim.x] = .....;
}

In this case I have specified the shared memory array size as a template parameter, because it isn't really necessary to dynamically allocate the shared memory array size at runtime, and the compiler has a better chance at performing optimizations when the shared memory array size is known at compile time (perhaps in the worst case there could be selection between different kernel instances done at run time).

The CUDA SDK contains a number of good example codes which demonstrate different ways that shared memory can be used in kernels to improve memory read and write performance. The matrix transpose, reduction and 3D finite difference method examples are all good models of shared memory usage. Each also has a good paper which discusses the optimization strategies behind the shared memory use in the codes. You would be well served by studying them until you understand how and why they work.

回复收藏 0 原文

~没有更多了~