CUDA：具有重叠边界的共享内存分配

发布于 2024-10-30 18:32:31 字数 2076 浏览 5 评论 0原文

有没有一种简单的方法（谷歌尚未提供...）从单个输入数组中分配每块共享内存区域，以便可以存在重叠？

简单的例子就是字符串搜索；看到我想将输入文本切块，让每个块中的每个线程搜索从 text[thread_id] 开始的模式，但希望分配给每个块的数据按模式长度重叠，以便匹配仍然发现跨境病例。

即分配给每个块上共享内存的总内存大小

(blocksize+patternlength)*sizeof(char)

我可能错过了一些简单的东西，目前正在深入研究 CUDA 指南，但希望得到一些指导。

更新：我怀疑有些人误解了我的问题（或者我错误地解释了它）。

假设我有一个数据集 QWERTYUIOP，并且我想搜索 3 个字符的匹配项，并且我将每个线程块的数据集（任意）分成 4 个； QWER TYUI OPxx

这很容易完成，但如果 3 个字符匹配实际上是在寻找 IOP，则算法会失败。

在这种情况下，我想要的是每个块都位于共享内存中：

QWERTY TYUIOP OPxxxx

即每个块都被分配了 blocksize+patternlength-1 字符，这样就不会发生内存边界问题。

希望能更好地解释事情。

由于@jmilloy 是持久的... :P

//VERSION 1: Simple
__global__ void gpuSearchSimple(char *T, int lenT, char *P, int lenP, int *pFound)
{
  int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
    int fMatch = 1;
    for (int i=0; i < lenP; i++)
    {
      if (T[startIndex+i] != P[i]) fMatch = 0;
    }
    if (fMatch) atomicMin(pFound, startIndex);
}
//VERSION 2: Texture
__global__ void gpuSearchTexture(int lenT, int lenP, int *pFound)
{
  int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
    int fMatch = 1;
    for (int i=0; i < lenP; i++)
    {
      if (tex1Dfetch(texT,startIndex+i) != tex1Dfetch(texP,i)) fMatch = 0;
    }
    if (fMatch) atomicMin(pFound, startIndex);
}
//Version 3: Shared
__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
  extern __shared__ char shaP[];
  for (int i=0;threadIdx.x+i<lenP; i+=blockDim.x){
    shaP[threadIdx.x+i]= tex1Dfetch(texP,threadIdx.x+i);
  }
  __syncthreads();

  //At this point shaP is populated with the pattern
  int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
    // only continue if an earlier instance hasn't already been found
    int fMatch = 1;
    for (int i=0; i < lenP; i++)
    {
      if (tex1Dfetch(texT,startIndex+i) != shaP[i]) fMatch = 0;
    }
    if (fMatch) atomicMin(pFound, startIndex);
}

我想做的是将文本放入共享内存块中，如问题其余部分所述，而不是将文本保留在纹理内存中以供后续版本使用。

原文

Is there an easy way (google hasn't delivered...) to allocate per-block shared memory regions from a single input array such that there can be an overlap?

The simple example is string searching; Saw I want to dice up the input text, have each thread in each block search for a pattern starting from text[thread_id], but want the data assigned to each block to overlap by the pattern length so matching cases that fall across the border are still found.

I.e the total memory size allocated to shared memory on each block is

(blocksize+patternlength)*sizeof(char)

I'm probably missing something simple and am currently diving through the CUDA guide, but would appreciate some guidance.

UPDATE: I suspect some people have misunderstood my question (or I miss-explained it).

Say I have a dataset QWERTYUIOP, and I want to search for a 3 character match, and I dice up the dataset (arbitrarily) into 4's for each thread block; QWER TYUI OPxx

This is simple enough to accomplish but the algorithm fails if the 3 character match is actually looking for IOP.

In this case, what I want is for each block to have in shared memory:

QWERTY TYUIOP OPxxxx

i.e. each block gets assigned the blocksize+patternlength-1 characters so no memory border issues occur.

Hope that explains things better.

Since @jmilloy is being persistent... :P

//VERSION 1: Simple
__global__ void gpuSearchSimple(char *T, int lenT, char *P, int lenP, int *pFound)
{
  int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
    int fMatch = 1;
    for (int i=0; i < lenP; i++)
    {
      if (T[startIndex+i] != P[i]) fMatch = 0;
    }
    if (fMatch) atomicMin(pFound, startIndex);
}
//VERSION 2: Texture
__global__ void gpuSearchTexture(int lenT, int lenP, int *pFound)
{
  int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
    int fMatch = 1;
    for (int i=0; i < lenP; i++)
    {
      if (tex1Dfetch(texT,startIndex+i) != tex1Dfetch(texP,i)) fMatch = 0;
    }
    if (fMatch) atomicMin(pFound, startIndex);
}
//Version 3: Shared
__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
  extern __shared__ char shaP[];
  for (int i=0;threadIdx.x+i<lenP; i+=blockDim.x){
    shaP[threadIdx.x+i]= tex1Dfetch(texP,threadIdx.x+i);
  }
  __syncthreads();

  //At this point shaP is populated with the pattern
  int startIndex = blockDim.x*blockIdx.x + threadIdx.x;
    // only continue if an earlier instance hasn't already been found
    int fMatch = 1;
    for (int i=0; i < lenP; i++)
    {
      if (tex1Dfetch(texT,startIndex+i) != shaP[i]) fMatch = 0;
    }
    if (fMatch) atomicMin(pFound, startIndex);
}

What I would like to have done is to put the text into shared memory chunks, as described in the rest of the question, instead of keeping the text in texture memory for the later versions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦初启 2024-11-06 18:32:31

我不确定这个问题是否有意义。您可以在运行时动态调整共享分配内存的大小，如下所示：

__global__ void kernel()
{
    extern __shared__ int buffer[];
    ....
}

kernel<<< gridsize, blocksize, buffersize >>>();

但缓冲区的内容在内核开始时未定义。您必须在内核中设计一个方案，从全局内存中加载您想要的重叠部分，以确保您的模式匹配能够按您希望的方式工作。

I am not sure that question makes all that much sense. You can dynamically size a shared allocation memory at runtime like this:

__global__ void kernel()
{
    extern __shared__ int buffer[];
    ....
}

kernel<<< gridsize, blocksize, buffersize >>>();

but the contents of the buffer are undefined at the beginning of the kernel. You will have to devise a scheme in the kernel to load from global memory with the overlap that you want to ensure that your pattern matching will work as you want it to.

回复收藏 0 原文

左耳近心 2024-11-06 18:32:31

不可以。共享内存在块中的线程之间共享，并且只能由分配给它的块访问。您不能拥有可用于两个不同块的共享内存。

据我所知，共享内存实际上驻留在多处理器上，并且线程只能从其运行的多处理器访问共享内存。所以这是一个物理限制。（我猜想如果两个块驻留在一个 mp 上，则一个块中的线程可能能够不可预测地访问分配给另一块的共享内存）。

请记住，您需要显式地将数据从全局内存复制到共享内存。将字符串的重叠区域复制到非重叠共享内存是一件简单的事情。

我认为在需要的地方获取数据是开发 CUDA 程序所需的大部分工作。我的指导是，您从一个可以解决问题的版本开始，而无需首先使用任何共享内存。为了实现这一点，您将解决重叠问题，并且共享内存的实现将很容易！

编辑2
答案被标记为正确后

__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
    extern __shared__ char* shared;

    char* shaP = &shared[0];
    char* shaT = &shared[lenP];

    //copy pattern into shaP in parallel
    if(threadIdx.x < lenP)
        shaP[threadIdx.x] = tex1Dfetch(texP,threadIdx.x);

    //determine texT start and length for this block
    blockStartIndex = blockIdx.x * gridDim.x/lenT;
    lenS = gridDim.x/lenT + lenP - 1;

    //copy text into shaT in parallel
    shaT[threadIdx.x] = tex1Dfetch(texT,blockStartIndex + threadIdx.x);
    if(threadIdx.x < lenP)
        shaP[blockDim.x + threadIdx.x] = text1Dfetch(texT,blockStartIndex + blockDim.x + threadIdx.x)

    __syncthreads();

    //We have one pattern in shaP for each thread in the block
    //We have the necessary portion of the text (with overlaps) in shaT

    int fMatch = 1;
    for (int i=0; i < lenP; i++)
    {
        if (shaT[threadIdx.x+i] != shaP[i]) fMatch = 0;
    }
    if (fMatch) atomicMin(pFound, blockStartIndex + threadIdx.x);
}

关键说明：

我们只需要每个块共享内存中的模式的一个副本 - 他们都可以使用它
每个块所需的共享内存是lenP + lenS （其中 lenS 是您的块大小 + 模式长度）
内核假设 gridDim.x * blockDim.x = lenT （与版本 1 相同）
我们可以并行复制到共享内存中（不需要循环如果你有足够的线程）

No. Shared memory is shared between threads in a block, and is ONLY accessible to the block it is assigned to. You cannot have shared memory that is available to two different blocks.

As far as I know, shared memory actually resides on the multiprocessors, and a thread can only access the shared memory from the multiprocessor that it is running on. So this is a physical limitation. (I guess if two blocks reside on one mp, a thread from one block may be able to unpredictably access the shared memory that was allocated to the other block).

Remember that you need to explicitly copy the data from global memory to shared memory. It is a simple matter to copy overlapping regions of the string to non-overlapping shared memory.

I think getting your data where you need it is the majority of the work required in developing CUDA programs. My guidance is that you start with a version that solves the problem without using any shared memory first. In order for that to work, you will solve your overlapping problem and the shared memory implementation will be easy!

edit 2
after answer was marked as correct

__global__ void gpuSearchTexSha(int lenT, int lenP, int *pFound)
{
    extern __shared__ char* shared;

    char* shaP = &shared[0];
    char* shaT = &shared[lenP];

    //copy pattern into shaP in parallel
    if(threadIdx.x < lenP)
        shaP[threadIdx.x] = tex1Dfetch(texP,threadIdx.x);

    //determine texT start and length for this block
    blockStartIndex = blockIdx.x * gridDim.x/lenT;
    lenS = gridDim.x/lenT + lenP - 1;

    //copy text into shaT in parallel
    shaT[threadIdx.x] = tex1Dfetch(texT,blockStartIndex + threadIdx.x);
    if(threadIdx.x < lenP)
        shaP[blockDim.x + threadIdx.x] = text1Dfetch(texT,blockStartIndex + blockDim.x + threadIdx.x)

    __syncthreads();

    //We have one pattern in shaP for each thread in the block
    //We have the necessary portion of the text (with overlaps) in shaT

    int fMatch = 1;
    for (int i=0; i < lenP; i++)
    {
        if (shaT[threadIdx.x+i] != shaP[i]) fMatch = 0;
    }
    if (fMatch) atomicMin(pFound, blockStartIndex + threadIdx.x);
}

key notes:

we only need one copy of the pattern in shared memory per block - they can all use it
shared memory needed per block is lenP + lenS (where lenS is your blocksize + patternlength)
the kernel assumes that gridDim.x * blockDim.x = lenT (the same as version 1)
we can copy into shared memory in parallel (no need for for loops if you have enough threads)

回复收藏 0 原文