CUDA 共享数组未获取值?

发布于 2025-01-05 05:54:06 字数 761 浏览 2 评论 0原文

我正在尝试实现简单的并行缩减。我正在使用 CUDA SDK 中的代码。但不知何故,我的内核存在问题,因为共享数组没有获取全局数组及其全零的值。

extern __ shared __ float4 sdata[];

// each thread loads one element from global to shared mem

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

sdata[tid] = dev_src[i];

__syncthreads();

// do reduction in shared mem

for(unsigned int s = 1; s < blockDim.x; s *= 2) {
    if(tid % (2*s) == 0){
        sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
}

// write result for this block to global mem
if(tid == 0)
    out[blockIdx.x] = sdata[0];

编辑

好的,我通过删除 extern 关键字并将共享数组设置为恒定大小(如 512)来使其工作。我现在状态很好。也许有人可以解释为什么它不能与 extern 关键字一起使用。

I am trying to implement simple parallel reduction. I am using the code from the CUDA SDK. But somehow there is a problem in my kernel as the shared array is not getting values of the global array and its all zeroes.

extern __ shared __ float4 sdata[];

// each thread loads one element from global to shared mem

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

sdata[tid] = dev_src[i];

__syncthreads();

// do reduction in shared mem

for(unsigned int s = 1; s < blockDim.x; s *= 2) {
    if(tid % (2*s) == 0){
        sdata[tid] += sdata[tid + s];
    }
    __syncthreads();
}

// write result for this block to global mem
if(tid == 0)
    out[blockIdx.x] = sdata[0];

Edit:

Ok I got it working by removing the extern keyword and making the shared array a constant size like 512. I am in good shape now. Maybe someone can explain why it was not working with the extern keyword.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

妄司 2025-01-12 05:54:06

我想我知道为什么会发生这种情况,因为我以前遇到过这种情况。你如何启动内核?

请记住,在启动kernel<<>中,sharedMemory应该是共享内存的大小(以字节为单位)。因此,如果您声明 512 个元素,则第三个参数应为 512 * sizeof(float4)。我认为你只是按如下方式调用,这是错误的

kernel<<<blocks,threads,512>>>   // this is wrong

I think I know why this is happening as I have faced this before. How are you launching the kernel?

Remember in the launch kernel<<<blocks,threads,sharedMemory>>> the sharedMemory should be the size of the shared memory in bytes. So, if you are declaring for 512 elements, the third parameter should be 512 * sizeof(float4). I think you are just calling as below, which is wrong

kernel<<<blocks,threads,512>>>   // this is wrong
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文