编写一个有多个线程写入同一索引的 CUDA 内核?
我正在编写一些用于激活 CUDA 上的神经网络的代码,但遇到了问题。我没有得到进入给定神经元的权重的正确总和。
这是内核代码,我将尝试使用变量对其进行更清晰的解释。
__global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength)
{
int nx = threadIdx.x + TILE_WIDTH*threadIdx.y;
int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;
if(index_in < cLength)
{
sumArray[sourceTargetArray[index_in].y] += fabs(weightArray[index_in]);
//__threadfence();
__threadfence_block();
}
}
首先,网络中的连接数是cLength
。对于每个连接,都有一个源神经元和一个目标神经元,以及该连接的权重。 SourceTargetArray
包含该信息。因此sourceTargetArray
的索引i
是连接i
的源神经元索引,以及连接i
的目标神经元索引。 weightArray
包含权重信息(因此 weightArray
的索引 i
对应于连接 i
)。
如您所见,SumArray
是我存储总和的位置。因此,内核将 sumArray
(在连接 i
的目标神经元索引处)增加连接 i
权重的绝对值。直观地说,对于神经元的所有传入连接,将所有权重相加。这确实是我想用这个内核做的全部事情。最终,我将使用这个总和来标准化权重。
问题是它是错误的。我连续这样做了,答案是不同的。答案有所不同,通常约为 12-15 倍(因此正确答案为 700.0,而我得到的结果在 50 倍范围内)。
您可以看到我添加了 __threadfence()
(和 __threadfence_block()
试图确保写入不会由每个线程同时完成) 。我不确定这是否是我的代码的问题。我已确保权重数组与我测试的串行版本相同,并且源/目标信息也相同。我做错了什么?
编辑:作为参考,CUDA 编程指南 v3.1 附录 B.5 内存栅栏函数中描述了所使用的 __threadfence()
I'm writing some code for activating neural networks on CUDA, and I'm running into an issue. I'm not getting the correct summation of the weights going into a given neuron.
So here is the kernel code, and I'll try to explain it a bit clearer with the variables.
__global__ void kernelSumWeights(float* sumArray, float* weightArray, int2* sourceTargetArray, int cLength)
{
int nx = threadIdx.x + TILE_WIDTH*threadIdx.y;
int index_in = (blockIdx.x + gridDim.x*blockIdx.y)*TILE_WIDTH*TILE_WIDTH + nx;
if(index_in < cLength)
{
sumArray[sourceTargetArray[index_in].y] += fabs(weightArray[index_in]);
//__threadfence();
__threadfence_block();
}
}
First off, the number of connections in the network is cLength
. For every connection, there is a source neuron and a target neuron, as well as a weight for that connection. SourceTargetArray
contains that information. So index i
of sourceTargetArray
is the source neuron index of connection i
, and target neuron index of connection i
. The weightArray
contains the weight information (so index i
of weightArray
corresponds to connection i
).
As you can see, SumArray
is where I'm storing the sums. So kernel increments the sumArray
(at target neuron index of connection i
) by the absolute value of the weight of connection i
. Intuitively, for all the incoming connections to the neuron, sum all the weights. That's really all I'm trying to do with this kernel. Eventually, I'll normalize the weights using this sum.
The problem is that it's wrong. I've done this serially, and the answer is different. The answer differ, usually by about 12-15x (so the right answer will be 700.0 and what I'm getting is something in the 50s range).
You can see that I added __threadfence()
(and __threadfence_block()
in an attempt to make sure that the writes weren't being done at the same time by every thread). I'm not sure if this is the problem with my code. I've ensured that the weight array is identical to the serial version I tested, and that the source/target information is identical as well. What am I doing wrong?
EDIT: For reference, __threadfence()
usaged is described in the CUDA Programming Guide v3.1 Appendix B.5 Memory Fence Functions
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
+=
不是原子的 =>不是线程安全的。使用atomicAdd。此外,您还应该避免写入同一存储单元。问题是这些调用将被序列化,线程将排队并相互等待。如果您无法避免此操作,请尝试将算法分为两个阶段:单独计算和合并。并行合并可以非常有效地实现。
+=
is not atomical => not thread safe. Use atomicAdd.Also you should avoid writing to same memory cell. Problem is that these calls will be serialized, threads will stand in line and wait for each other. If you can't avoid this operation try to break your algorithm into two phases: individual computation and merging. Parallel merging can be implemented very efficiently.
你需要做减量。
将分配给每个线程的元素相加,并将结果放入数组中,先缓存 [threadsPerBlock],然后 __Syncthreads
现在,通过添加连续的相邻小计来减少结果小计:
下面的内容详细解释了这一点:
http://developer.download.nvidia.com/compute/cuda/1_1 /Website/projects/reduction/doc/reduction.pdf
示例代码位于:
http://www.nvidia.com/object/cuda_sample_data-parallel.html
在“CUDA BY Example”(这是代码片段的来源)中也有很好的解释。
这种方法有一个很大的警告。添加的顺序不会与序列代码中的顺序相同。浮点数的相加不可交换,因此舍入误差可能会导致结果略有不同。
You need to do a reduction.
Sum the elements assigned to each thread and place the result in an array, cache[threadsPerBlock] then __Syncthreads
Now reduce the resulting sub totals by adding successive neighboring subtotals:
The following deck explains this in some detail:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
Sample code for this is here:
http://www.nvidia.com/object/cuda_sample_data-parallel.html
It's also very well explained in "CUDA BY Example" (which is where the code fragment comes from).
There is one big caveat with this approach. The additions will not occur in the same order they would with serial code. Addition of floats is not commutative so rounding errors may lead to slightly different results.