CUDA内存问题

发布于 2024-07-09 02:31:50 字数 1753 浏览 12 评论 0原文

我有一个 CUDA 内核，我正在将其编译为 cubin 文件，没有任何特殊标志：

nvcc text.cu -cubin

它可以编译，但会显示以下消息：

建议：无法判断指针指向什么，假设全局内存空间

和对某些临时 cpp 文件中的行的引用。我可以通过注释掉一些看似随意的代码来实现这一点，这些代码对我来说毫无意义。

内核如下：

__global__ void string_search(char** texts, int* lengths, char* symbol, int* matches, int symbolLength)
{
    int localMatches = 0;
    int blockId = blockIdx.x + blockIdx.y * gridDim.x;
    int threadId = threadIdx.x + threadIdx.y * blockDim.x;
    int blockThreads = blockDim.x * blockDim.y;

    __shared__ int localMatchCounts[32];

    bool breaking = false;
    for(int i = 0; i < (lengths[blockId] - (symbolLength - 1)); i += blockThreads)
    {
        if(texts[blockId][i] == symbol[0])
        {
            for(int j = 1; j < symbolLength; j++)
            {
                if(texts[blockId][i + j] != symbol[j])
                {
                    breaking = true;
                    break;
                }
            }
            if (breaking) continue;
            localMatches++;
        }
    }

    localMatchCounts[threadId] = localMatches;

    __syncthreads();

    if(threadId == 0)
    {
        int sum = 0;
        for(int i = 0; i < 32; i++)
        {
            sum += localMatchCounts[i];
        }
        matches[blockId] = sum;
    }
}

行，

localMatchCounts[threadId] = localMatches;

如果我用这行替换第一个 for 循环之后的

localMatchCounts[threadId] = 5;

它将在没有任何通知的情况下进行编译。这也可以通过注释掉该行上方循环中看似随机的部分来实现。我也尝试用普通数组替换本地内存数组，但没有效果。谁能告诉我问题是什么？

就其价值而言，系统是 Vista 64 位。

编辑：我修复了代码，因此它实际上可以工作，尽管它仍然会产生编译器通知。警告似乎不是一个问题，至少在正确性方面（它可能会影响性能）。

原文

I have a CUDA kernel which I'm compiling to a cubin file without any special flags:

nvcc text.cu -cubin

It compiles, though with this message:

Advisory: Cannot tell what pointer points to, assuming global memory space

and a reference to a line in some temporary cpp file. I can get this to work by commenting out some seemingly arbitrary code which makes no sense to me.

The kernel is as follows:

__global__ void string_search(char** texts, int* lengths, char* symbol, int* matches, int symbolLength)
{
    int localMatches = 0;
    int blockId = blockIdx.x + blockIdx.y * gridDim.x;
    int threadId = threadIdx.x + threadIdx.y * blockDim.x;
    int blockThreads = blockDim.x * blockDim.y;

    __shared__ int localMatchCounts[32];

    bool breaking = false;
    for(int i = 0; i < (lengths[blockId] - (symbolLength - 1)); i += blockThreads)
    {
        if(texts[blockId][i] == symbol[0])
        {
            for(int j = 1; j < symbolLength; j++)
            {
                if(texts[blockId][i + j] != symbol[j])
                {
                    breaking = true;
                    break;
                }
            }
            if (breaking) continue;
            localMatches++;
        }
    }

    localMatchCounts[threadId] = localMatches;

    __syncthreads();

    if(threadId == 0)
    {
        int sum = 0;
        for(int i = 0; i < 32; i++)
        {
            sum += localMatchCounts[i];
        }
        matches[blockId] = sum;
    }
}

If I replace the line

localMatchCounts[threadId] = localMatches;

after the first for loop with this line

localMatchCounts[threadId] = 5;

it compiles with no notices. This can also be achieved by commenting out seemingly random parts of the loop above the line. I have also tried replacing the local memory array with a normal array to no effect. Can anyone tell me what the problem is?

The system is Vista 64bit, for what its worth.

Edit: I fixed the code so it actually works, though it still produces the compiler notice. It does not seem as though the warning is a problem, at least with regards to correctness (it might affect performance).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

难如初 2024-07-16 02:31:50

像 char** 这样的指针数组在内核中是有问题的，因为内核无法访问主机的内存。
最好分配一个连续缓冲区并以支持并行访问的方式对其进行划分。
在这种情况下，我将定义一个一维数组，其中包含依次定位的所有字符串和另一个一维数组，大小为 2*numberOfStrings，其中包含第一个数组中每个字符串的偏移量及其长度：

例如 - 内核准备：

char* buffer = st[0] + st[1] + st[2] + ....;
int* metadata = new int[numberOfStrings * 2];
int lastpos = 0;
for (int cnt = 0; cnt < 2* numberOfStrings; cnt+=2)
{
    metadata[cnt] = lastpos;
    lastpos += length(st[cnt]);
    metadata[cnt] = length(st[cnt]);
}

In kernel:

currentIndex = threadId + blockId * numberOfBlocks;
char* currentString = buffer + metadata[2 * currentIndex];
int currentStringLength = metadata[2 * currentIndex + 1];

Arrays of pointers like char** are problematic in kernels, since the kernels have no access to the host's memory.

It is better to allocate a single continuous buffer and to divide it in a manner that enables parallel access.

In this case I'd define a 1D array which contains all the strings positioned one after another and another 1D array, sized 2*numberOfStrings which contains the offset of each string within the first array and it's length:

For example - preparation for kernel:



char* buffer = st[0] + st[1] + st[2] + ....;

int* metadata = new int[numberOfStrings * 2];

int lastpos = 0;

for (int cnt = 0; cnt < 2* numberOfStrings; cnt+=2)

{

    metadata[cnt] = lastpos;

    lastpos += length(st[cnt]);

    metadata[cnt] = length(st[cnt]);

}

In kernel:



currentIndex = threadId + blockId * numberOfBlocks;

char* currentString = buffer + metadata[2 * currentIndex];

int currentStringLength = metadata[2 * currentIndex + 1];

回复收藏 0 原文