编写一个有多个线程写入同一索引的 CUDA 内核？

发布于 2024-09-19 10:56:13 字数 1341 浏览 21 评论 0原文

我正在编写一些用于激活 CUDA 上的神经网络的代码，但遇到了问题。我没有得到进入给定神经元的权重的正确总和。

这是内核代码，我将尝试使用变量对其进行更清晰的解释。

__global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength)
{
int nx = threadIdx.x + TILE_WIDTH*threadIdx.y;
int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;
 if(index_in < cLength)
 {

  sumArray[sourceTargetArray[index_in].y] += fabs(weightArray[index_in]);
  //__threadfence();
  __threadfence_block();

 }

}

首先，网络中的连接数是cLength。对于每个连接，都有一个源神经元和一个目标神经元，以及该连接的权重。 SourceTargetArray 包含该信息。因此sourceTargetArray的索引i是连接i的源神经元索引，以及连接i的目标神经元索引。 weightArray 包含权重信息（因此 weightArray 的索引 i 对应于连接 i）。

如您所见，SumArray 是我存储总和的位置。因此，内核将 sumArray （在连接 i 的目标神经元索引处）增加连接 i 权重的绝对值。直观地说，对于神经元的所有传入连接，将所有权重相加。这确实是我想用这个内核做的全部事情。最终，我将使用这个总和来标准化权重。

问题是它是错误的。我连续这样做了，答案是不同的。答案有所不同，通常约为 12-15 倍（因此正确答案为 700.0，而我得到的结果在 50 倍范围内）。

您可以看到我添加了 __threadfence() （和 __threadfence_block() 试图确保写入不会由每个线程同时完成）。我不确定这是否是我的代码的问题。我已确保权重数组与我测试的串行版本相同，并且源/目标信息也相同。我做错了什么？

编辑：作为参考，CUDA 编程指南 v3.1 附录 B.5 内存栅栏函数中描述了所使用的 __threadfence()

原文

I'm writing some code for activating neural networks on CUDA, and I'm running into an issue. I'm not getting the correct summation of the weights going into a given neuron.

So here is the kernel code, and I'll try to explain it a bit clearer with the variables.

__global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength)
{
int nx = threadIdx.x + TILE_WIDTH*threadIdx.y;
int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;
 if(index_in < cLength)
 {

  sumArray[sourceTargetArray[index_in].y] += fabs(weightArray[index_in]);
  //__threadfence();
  __threadfence_block();

 }

}

First off, the number of connections in the network is cLength. For every connection, there is a source neuron and a target neuron, as well as a weight for that connection. SourceTargetArray contains that information. So index i of sourceTargetArray is the source neuron index of connection i, and target neuron index of connection i. The weightArray contains the weight information (so index i of weightArray corresponds to connection i).

As you can see, SumArray is where I'm storing the sums. So kernel increments the sumArray (at target neuron index of connection i) by the absolute value of the weight of connection i. Intuitively, for all the incoming connections to the neuron, sum all the weights. That's really all I'm trying to do with this kernel. Eventually, I'll normalize the weights using this sum.

The problem is that it's wrong. I've done this serially, and the answer is different. The answer differ, usually by about 12-15x (so the right answer will be 700.0 and what I'm getting is something in the 50s range).

You can see that I added __threadfence() (and __threadfence_block() in an attempt to make sure that the writes weren't being done at the same time by every thread). I'm not sure if this is the problem with my code. I've ensured that the weight array is identical to the serial version I tested, and that the source/target information is identical as well. What am I doing wrong?

EDIT: For reference, __threadfence() usaged is described in the CUDA Programming Guide v3.1 Appendix B.5 Memory Fence Functions

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凯凯我们等你回来 2024-09-26 10:56:13

+= 不是原子的 =>不是线程安全的。使用atomicAdd。

此外，您还应该避免写入同一存储单元。问题是这些调用将被序列化，线程将排队并相互等待。如果您无法避免此操作，请尝试将算法分为两个阶段：单独计算和合并。并行合并可以非常有效地实现。

回复收藏 0 原文

谁与争疯 2024-09-26 10:56:13

你需要做减量。

将分配给每个线程的元素相加，并将结果放入数组中，先缓存 [threadsPerBlock]，然后 __Syncthreads

现在，通过添加连续的相邻小计来减少结果小计：

int cacheIndex = threadIdx.x;
int i = blockDim.x / 2;
while (i != 0)
{
    if (cacheIndex < i)
        cache[cacheIndex] += cache[cacheIndex] + 1;
        __syncthreads;
        i /= 2;
    }
}

下面的内容详细解释了这一点：

http://developer.download.nvidia.com/compute/cuda/1_1 /Website/projects/reduction/doc/reduction.pdf

示例代码位于：

http://www.nvidia.com/object/cuda_sample_data-parallel.html

在“CUDA BY Example”（这是代码片段的来源）中也有很好的解释。

这种方法有一个很大的警告。添加的顺序不会与序列代码中的顺序相同。浮点数的相加不可交换，因此舍入误差可能会导致结果略有不同。

You need to do a reduction.

Sum the elements assigned to each thread and place the result in an array, cache[threadsPerBlock] then __Syncthreads

Now reduce the resulting sub totals by adding successive neighboring subtotals:

int cacheIndex = threadIdx.x;
int i = blockDim.x / 2;
while (i != 0)
{
    if (cacheIndex < i)
        cache[cacheIndex] += cache[cacheIndex] + 1;
        __syncthreads;
        i /= 2;
    }
}

The following deck explains this in some detail:

http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf

Sample code for this is here:

http://www.nvidia.com/object/cuda_sample_data-parallel.html

It's also very well explained in "CUDA BY Example" (which is where the code fragment comes from).

There is one big caveat with this approach. The additions will not occur in the same order they would with serial code. Addition of floats is not commutative so rounding errors may lead to slightly different results.

回复收藏 0 原文

~没有更多了~