CUDA 共享数组未获取值?
我正在尝试实现简单的并行缩减。我正在使用 CUDA SDK 中的代码。但不知何故,我的内核存在问题,因为共享数组没有获取全局数组及其全零的值。
extern __ shared __ float4 sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = dev_src[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int s = 1; s < blockDim.x; s *= 2) {
if(tid % (2*s) == 0){
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if(tid == 0)
out[blockIdx.x] = sdata[0];
编辑:
好的,我通过删除 extern
关键字并将共享数组设置为恒定大小(如 512
)来使其工作。我现在状态很好。也许有人可以解释为什么它不能与 extern
关键字一起使用。
I am trying to implement simple parallel reduction. I am using the code from the CUDA SDK. But somehow there is a problem in my kernel as the shared array is not getting values of the global array and its all zeroes.
extern __ shared __ float4 sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = dev_src[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int s = 1; s < blockDim.x; s *= 2) {
if(tid % (2*s) == 0){
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if(tid == 0)
out[blockIdx.x] = sdata[0];
Edit:
Ok I got it working by removing the extern
keyword and making the shared array a constant size like 512
. I am in good shape now. Maybe someone can explain why it was not working with the extern
keyword.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我想我知道为什么会发生这种情况,因为我以前遇到过这种情况。你如何启动内核?
请记住,在启动
kernel<<>
中,sharedMemory
应该是共享内存的大小(以字节为单位)。因此,如果您声明 512 个元素,则第三个参数应为512 * sizeof(float4)
。我认为你只是按如下方式调用,这是错误的I think I know why this is happening as I have faced this before. How are you launching the kernel?
Remember in the launch
kernel<<<blocks,threads,sharedMemory>>>
thesharedMemory
should be the size of the shared memory in bytes. So, if you are declaring for 512 elements, the third parameter should be512 * sizeof(float4)
. I think you are just calling as below, which is wrong