CUDA 共享内存阵列 - 奇怪的行为

发布于 2024-07-26 03:02:58 字数 594 浏览 12 评论 0原文

在 CUDA 内核中，我有类似于以下的代码。我试图为每个线程计算一个分子，并将分子累加到块上以计算分母，然后返回比率。但是，CUDA 将 denom 的值设置为具有最大 threadIdx.x 的块中的线程为 numer 计算的任何值，而不是块中所有线程计算出的 numer 值的总和。有谁知道发生了什么事吗？

extern __shared__ float s_shared[];

float numer = //calculate numerator

s_shared[threadIdx.x] = numer;
s_shared[blockDim.x] += numer;
__syncthreads();

float denom = s_shared[blockDim.x];
float result = numer/denom;

result 应始终介于 0 和 1 之间，并且在整个块中总和应为 1，但对于其中 threadIdx.x 为最大值的每个线程，它等于 1.0，并且一些其他值不限于块中其他线程的范围。

原文

In a CUDA kernel, I have code similar to the following. I am trying to calculate one numerator per thread, and accumulate the numerators over the block to calculate a denominator, and then return the ratio. However, CUDA is setting the value of denom to whatever value is calculated for numer by the thread in the block with the largest threadIdx.x, rather than the sum of the numer value calculated across all the threads in the block. Does anyone know what is going on?

extern __shared__ float s_shared[];

float numer = //calculate numerator

s_shared[threadIdx.x] = numer;
s_shared[blockDim.x] += numer;
__syncthreads();

float denom = s_shared[blockDim.x];
float result = numer/denom;

result should always be between 0 and 1 and should sum to 1 across the block, but instead it is equal to 1.0 for every thread where threadIdx.x is the maximum, and some other value not confined to the range for the other threads in the block.

分享到QQ

分享到微博