CUDA 共享内存阵列 - 奇怪的行为
在 CUDA 内核中,我有类似于以下的代码。 我试图为每个线程计算一个分子,并将分子累加到块上以计算分母,然后返回比率。 但是,CUDA 将 denom
的值设置为具有最大 threadIdx.x
的块中的线程为 numer
计算的任何值,而不是块中所有线程计算出的 numer
值的总和。 有谁知道发生了什么事吗?
extern __shared__ float s_shared[];
float numer = //calculate numerator
s_shared[threadIdx.x] = numer;
s_shared[blockDim.x] += numer;
__syncthreads();
float denom = s_shared[blockDim.x];
float result = numer/denom;
result
应始终介于 0 和 1 之间,并且在整个块中总和应为 1,但对于其中 threadIdx.x
为最大值的每个线程,它等于 1.0,并且一些其他值不限于块中其他线程的范围。
In a CUDA kernel, I have code similar to the following. I am trying to calculate one numerator per thread, and accumulate the numerators over the block to calculate a denominator, and then return the ratio. However, CUDA is setting the value of denom
to whatever value is calculated for numer
by the thread in the block with the largest threadIdx.x
, rather than the sum of the numer
value calculated across all the threads in the block. Does anyone know what is going on?
extern __shared__ float s_shared[];
float numer = //calculate numerator
s_shared[threadIdx.x] = numer;
s_shared[blockDim.x] += numer;
__syncthreads();
float denom = s_shared[blockDim.x];
float result = numer/denom;
result
should always be between 0 and 1 and should sum to 1 across the block, but instead it is equal to 1.0 for every thread where threadIdx.x
is the maximum, and some other value not confined to the range for the other threads in the block.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您没有将求和正确同步到
blockDim.x
位置。 在添加总和之前,没有线程会等待查看其他线程写入的内容。 有点像数字
写入内存位置,我想,高threadId获胜,因为它很有可能最后执行。
为了进行快速求和,您想要做的是对
s_shared[threadIdx.x]
numer
这需要 O(n) 工作和 O(log n) 时间。
You're not synchronizing the summing properly to the
blockDim.x
location. None of the threads are waiting to see what others have written before adding their sum. Sort of likenumer
.numer
to the memory locationThe high threadId wins b/c it has a high likelihood of acting last, I suppose.
What you want to do instead, in order to do a quick sum, is to do a binary sum on
s_shared[threadIdx.x]
numer
This takes O(n) work and O(log n) time.