CUDA 计数、减少和线程扭曲
我正在尝试创建一个 cuda 程序,通过缩减算法来计算长向量中的真值(由非零值定义)的数量。我得到了有趣的结果。我得到 0 或 (ceil(N/threadsPerBlock)*threadsPerBlock),两者都不正确。
__global__ void count_reduce_logical(int * l, int * cntl, int N){
// suml is assumed to blockDim.x long and hold the partial counts
__shared__ int cache[threadsPerBlock];
int cidx = threadIdx.x;
int tid = threadIdx.x + blockIdx.x*blockDim.x;
int cnt_tmp=0;
while(tid<N){
if(l[tid]!=0)
cnt_tmp++;
tid+=blockDim.x*gridDim.x;
}
cache[cidx]=cnt_tmp;
__syncthreads();
//reduce
int k =blockDim.x/2;
while(k!=0){
if(threadIdx.x<k)
cache[cidx] += cache[cidx];
__syncthreads();
k/=2;
}
if(cidx==0)
cntl[blockIdx.x] = cache[0];
}
然后主机代码收集cntl结果并完成求和。这将是一个更大项目的一部分,其中数据已经在 GPU 上,因此如果计算正常的话,在那里进行计算是有意义的。
I'm trying to create a cuda program that counts the number of true values (defined by non-zero values) in a long vector through a reduction algorithm. I'm getting funny results. I get either 0 or (ceil(N/threadsPerBlock)*threadsPerBlock), neither is correct.
__global__ void count_reduce_logical(int * l, int * cntl, int N){
// suml is assumed to blockDim.x long and hold the partial counts
__shared__ int cache[threadsPerBlock];
int cidx = threadIdx.x;
int tid = threadIdx.x + blockIdx.x*blockDim.x;
int cnt_tmp=0;
while(tid<N){
if(l[tid]!=0)
cnt_tmp++;
tid+=blockDim.x*gridDim.x;
}
cache[cidx]=cnt_tmp;
__syncthreads();
//reduce
int k =blockDim.x/2;
while(k!=0){
if(threadIdx.x<k)
cache[cidx] += cache[cidx];
__syncthreads();
k/=2;
}
if(cidx==0)
cntl[blockIdx.x] = cache[0];
}
The host code then collects the cntl results and finishes summation. This is going to be part of a larger project where the data is already on the GPU, so it makes sense to do the computations there, if they work correctly.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用 对非零值进行计数,使用 < a href="http://code.google.com/p/thrust/" rel="nofollow">推力。下面是计算
device_vector
中 1 数量的代码片段。如果您的数据不在
device_vector
内,您仍然可以通过 包装原始指针。You can count the nonzero-values with a single line of code using Thrust. Here's a code snippet that counts the number of 1s in a
device_vector
.If your data does not live inside a
device_vector
you can still usethrust::count
by wrapping the raw pointers.在你的减少中,你正在做的事情是:
你不想研究该块的本地值的另一半吗?
In your reduction you're doing:
Don't you want to be poking at the other half of the block's local values?