CUDA 共享内存阵列 - 奇怪的行为

发布于 2024-07-26 03:02:58 字数 594 浏览 12 评论 0原文

在 CUDA 内核中,我有类似于以下的代码。 我试图为每个线程计算一个分子,并将分子累加到块上以计算分母,然后返回比率。 但是,CUDA 将 denom 的值设置为具有最大 threadIdx.x 的块中的线程为 numer 计算的任何值,而不是块中所有线程计算出的 numer 值的总和。 有谁知道发生了什么事吗?

extern __shared__ float s_shared[];

float numer = //calculate numerator

s_shared[threadIdx.x] = numer;
s_shared[blockDim.x] += numer;
__syncthreads();

float denom = s_shared[blockDim.x];
float result = numer/denom;

result 应始终介于 0 和 1 之间,并且在整个块中总和应为 1,但对于其中 threadIdx.x 为最大值的每个线程,它等于 1.0,并且一些其他值不限于块中其他线程的范围。

In a CUDA kernel, I have code similar to the following. I am trying to calculate one numerator per thread, and accumulate the numerators over the block to calculate a denominator, and then return the ratio. However, CUDA is setting the value of denom to whatever value is calculated for numer by the thread in the block with the largest threadIdx.x, rather than the sum of the numer value calculated across all the threads in the block. Does anyone know what is going on?

extern __shared__ float s_shared[];

float numer = //calculate numerator

s_shared[threadIdx.x] = numer;
s_shared[blockDim.x] += numer;
__syncthreads();

float denom = s_shared[blockDim.x];
float result = numer/denom;

result should always be between 0 and 1 and should sum to 1 across the block, but instead it is equal to 1.0 for every thread where threadIdx.x is the maximum, and some other value not confined to the range for the other threads in the block.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

她说她爱他 2024-08-02 03:02:58

您没有将求和正确同步到 blockDim.x 位置。 在添加总和之前,没有线程会等待查看其他线程写入的内容。 有点像

  1. 每个人都读到零,
  2. 回家,计算零+数字。
  3. 每个人都将零+数字写入内存位置,

我想,高threadId获胜,因为它很有可能最后执行。

为了进行快速求和,您想要做的是对
s_shared[threadIdx.x]

  1. 每个人都写下他们的numer
  2. 一半线程计算对的总和并将其写入新位置
  3. 四分之一的线程计算对对的总和,并将它们写入新位置
  4. ,直到只有一个线程和一个总和。

这需要 O(n) 工作和 O(log n) 时间。

You're not synchronizing the summing properly to the blockDim.x location. None of the threads are waiting to see what others have written before adding their sum. Sort of like

  1. Everyone reads zero,
  2. goes home, calculates zero + numer.
  3. Everyone writes zero+numer to the memory location

The high threadId wins b/c it has a high likelihood of acting last, I suppose.

What you want to do instead, in order to do a quick sum, is to do a binary sum on
s_shared[threadIdx.x]

  1. everyone writes their numer
  2. half the threads calculate sums of pairs and write those to a new location
  3. a quarter of the threads calculate the sums of pairs of pairs, and write those to a new location
  4. etc
  5. until you just have one thread and one sum

This takes O(n) work and O(log n) time.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文