CUDA中二维数组的有效缩减?
CUDA SDK 中提供了示例代码和演示幻灯片,可实现高效的一维缩减。我还看过几篇关于 CUDA 中一维缩减和前缀扫描的论文和实现。
是否有高效的 CUDA 代码可用于减少密集二维数组?指向代码或相关论文的指针将不胜感激。
In the CUDA SDK, there is example code and presentation slides for an efficient one-dimensional reduction. I have also seen several papers on and implementations of one-dimensional reductions and prefix scans in CUDA.
Is there efficient CUDA code available for a reduction of a dense two-dimensional array? Pointers to code or pertinent papers would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不知道你到底想解决什么问题,但实际上你可以简单地将 2D 数组视为一个长的 1D 数组,并使用 SDK 代码来减少操作。 CUDA 中的简单数组只是具有特殊寻址规则的一维内存块 - 为什么不利用这个机会呢?
I don't know what exactly the problem you try to solve, but actually you could simply think about 2D array as a long 1D array and use SDK code to reduce operation. Simple arrays in CUDA are just 1D memory blocks with special addressing rules - why wouldn't you take advantage of that opportunity.
矩阵简化可能更容易实现,因为行/列简化为向量可以独立完成。您可以让每个线程处理列/行(取决于矩阵主要维度方向)并合并内存读取。我怀疑你可以在不使用纹理/常量缓存的情况下购买更多的性能,因为局部性可能变得很重要
matrix reduction may be somewhat simpler to implement, because row/column reduction to a vector can be done independently. You can let each thread handle column/row (depending on matrix major dimension orientation) and coalesce memory reads. I doubt you can buy much performance over that without going to texture/constant cache where locality may become important