CUDA:归约还是原子操作?
我正在编写一个 CUDA 内核,其中涉及计算给定矩阵上的最大值,并且我正在评估可能性。我能找到的最好方法是:
强制每个线程在共享内存中存储一个值,然后使用缩减算法来确定最大值(优点:最小分歧缺点:共享内存在 2.0 设备上限制为 48Kb)
我不能'不能使用原子操作,因为同时存在读操作和写操作,因此线程无法通过synchthreads进行同步。
您还有其他想法吗?
I'm writing a CUDA kernel which involves calculating the maximum value on a given matrix and I'm evaluating possibilities. The best way I could find is:
Forcing every thread to store a value in the shared memory and using a reduction algorithm after that to determine the maximum (pro: minimum divergence cons: shared memory is limited to 48Kb on 2.0 devices)
I couldn't use atomic operations because there are both a reading and a writing operation, so threads could not be synchronized by synchthreads.
Any other idea come into your mind?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
您可能还想使用 CUDA Thrust 附带的缩减例程,该例程是 CUDA 4.0 的一部分或可用 在这里。
该库由两位 nVidia 工程师编写,与大量手工优化的代码相比毫不逊色。我相信网格/块大小也正在进行一些自动调整。
您可以通过包装原始设备指针轻松地与您自己的内核进行交互。
这严格是从快速集成的角度来看的。有关理论,请参阅 tkerwin 的回答。
You may also want to use the reduction routines that comes w/ CUDA Thrust which is a part of CUDA 4.0 or available here.
The library is written by a pair of nVidia engineers and compares favorably with heavily hand optimized code. I believe there is also some auto-tuning of grid/block size going on.
You can interface with your own kernel easily by wrapping your raw device pointers.
This is strictly from a rapid integration point of view. For the theory, see tkerwin's answer.
这是在 CUDA 中执行缩减的常用方法
在每个块内,
1) 在每个线程的共享内存中保留运行的缩减值。因此,每个线程将从全局内存中读取 n 个值(我个人喜欢在 16 到 32 之间),并更新这些值的减少值
。 2) 在块内执行减少算法,以获得每个块的一个最终减少值。
这样,您将不需要比 (线程数) * sizeof (datatye) 字节更多的共享内存。
由于每个块都有一个减少的值,因此您需要执行第二次减少过程才能获得最终值。
例如,如果每个块启动 256 个线程,并且每个线程读取 16 个值,则每个块将能够减少 (256 * 16 = 4096) 个元素。
因此,考虑到 100 万个元素,您将需要在第一次传递中启动大约 250 个块,而在第二次中只需要启动一个块。
对于元素数量 > 的情况,您可能需要第三遍。此配置的 (4096)^2。
您必须注意全局内存读取是否已合并。您无法合并全局内存写入,但这是您需要承受的性能损失。
This is the usual way to perform reductions in CUDA
Within each block,
1) Keep a running reduced value in shared memory for each thread. Hence each thread will read n (I personally favor between 16 and 32), values from global memory and updates the reduced value from these
2) Perform the reduction algorithm within the block to get one final reduced value per block.
This way you will not need more shared memory than (number of threads) * sizeof (datatye) bytes.
Since each block a reduced value, you will need to perform a second reduction pass to get the final value.
For example, if you are launching 256 threads per block, and are reading 16 values per thread, you will be able to reduce (256 * 16 = 4096) elements per block.
So given 1 million elements, you will need to launch around 250 blocks in the first pass, and just one block in the second.
You will probably need a third pass for cases when the number of elements > (4096)^2 for this configuration.
You will have to take care that the global memory reads are coalesced. You can not coalesce global memory writes, but that is one performance hit you need to take.
NVIDIA 有一个可进行缩减的 CUDA 演示:此处< /a>.随附的白皮书解释了设计背后的一些动机。
NVIDIA has a CUDA demo that does reduction: here. There's a whitepaper that goes along with it that explains some motivations behind the design.
我发现此文档对于学习基础知识非常有用与 CUDA 的并行缩减。它有点老了,所以必须有额外的技巧来进一步提高性能。
I found this document very useful for learning the basics of parallel reduction with CUDA. It's kind of old, so there must be additional tricks to boost performance further.
事实上,你所描述的问题并不是真正的矩阵问题。输入数据的二维视图并不重要(假设矩阵数据连续布置在内存中)。它只是对一系列值的减少,所有矩阵元素都按照它们在内存中出现的顺序排列。
假设矩阵表示在内存中是连续的,您只想执行简单的归约。据我所知,目前最好的实现是优秀的 libcub由 nVIDIA 的 Duane Merill 设计。 这里是有关其设备范围最大值计算函数的文档。
但请注意,除非矩阵很小,否则对于大多数计算来说,它只是线程读取数据并更新其自己的线程特定最大值。只有当线程完成读取矩阵的大样本(或者更确切地说,大跨步样本)时,它才会将其局部最大值写入任何地方 - 通常写入共享内存以进行块级缩减。至于原子,您可能会制作
atomicMax()
每次读取大量矩阵元素时调用一次 - 数万甚至更多。Actually, the problem you described is not really about matrices. The two-dimensional view of the input data is not significant (assuming the matrix data is layed out contiguously in memory). It's just a reduction over a sequence of values, being all matrix elements in whatever order they appear in memory.
Assuming the matrix representation is contiguous in memory, you just want to perform a simple reduction. And the best available implementation these days - as far as I can tell - is the excellent libcub by nVIDIA's Duane Merill. Here is the documentation on its device-wide Maximum-calculating function.
Note, though, that unless the matrix is small, for most of the computation it will simply be threads reading data and updating their own thread-specific maximum. Only when a thread has finished reading through a large swatch of the matrix (or rather, a large strided swath) will it write its local maximum anywhere - typically into shared memory for a block-level reduction. And as for atomics, you will probably be making an
atomicMax()
call once every obscenely large number of matrix element reads - tens of thousands if not more.也可以使用atomicAdd函数,但它的效率比上面提到的方法低得多。 http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-操作/
The
atomicAdd
function could also be used, but it is much less efficient than the approaches mentioned above. http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/如果您有 K20 或 Titan,我建议动态并行:使用单线程内核,它使用 #items 工作内核线程来生成数据,然后使用 #items/first-round-reduction-factor 线程进行第一轮缩减,然后继续使用午餐直到结果出来。
If you have K20 or Titan, I suggest dynamic parallelism: lunching a single thread kernel, which lunches #items worker kernel threads to produce data, then lunches #items/first-round-reduction-factor threads for first round reduction, and keep lunching till result coming out.