关于CUDA内存访问的小问题
嘿, 假设我遇到一个问题,每个线程计算一些内容(从常量内存中读取一些参数并使用它们进行计算)然后将其存储到全局内存矩阵中。这个矩阵永远不会被读取,只是写入访问...现在是否有任何意义首先使用共享内存来存储所有计算值,然后将它们写入全局内存?我认为不会,因为对全局内存的写入完整地保持不变,所以对共享内存的写入只是添加到我之前已经写入的内容中...... 谢谢!
hey there,
assuming I have a problem where each thread calculates something (reading some parameters out of the constant memory and using them for calculation) and than stores it to a global memory matrix. this matrix gets never read, just writing access... is there now any sense of using shared memory first to store all the calculated values in and than later write them to the global memory? I think no because the writes to global memory stay the same in complete, so the writes to shared memory just add to the writes which I had before already....
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
可能存在,具体取决于内核代码中的访问模式。当简单写入不会被合并时,使用共享内存缓冲区来“暂存”输出可能是确保合并写入的有用方法。这对于前几代 CUDA 兼容硬件 (G80/G90) 的性能至关重要。在较新的硬件中,这种情况的可能性要小得多。 Fermi 卡具有非常有效的 L1 和 L2 缓存方案,该方案可以(在合理范围内)接近过去只能使用共享内存才能实现的效果,而无需任何额外的代码。
这个问题实际上并没有一个通用的答案,因为它取决于任何给定代码的功能的很多细节,以及它预计在什么目标硬件上运行良好。
There can be, depending on the access patterns in the kernel code. Using a shared memory buffer to "stage" output can be a useful way of ensure writes are coalesced, when the naive write would not be coalesced. This was pretty crucial for performance in the first couple of generations of CUDA compatible hardware (G80/G90). In newer hardware, the case for this is a lot less strong. Fermi cards have a pretty effective L1 and L2 cache scheme which can (within reason) get close to what used to be only achievable using shared memory without any extra code.
There isn't really a general answer to this question, because it depends a lot of the specifics of what any given code does, and what target hardware it is expected to run well on.