CUDA中本地内存比共享内存慢吗?
我只发现一个评论,本地内存比寄存器内存(每线程两个类型)慢。
共享内存应该很快,但是它比[线程的]本地内存快吗?
我想做的是一种中值过滤器,但使用给定的百分位而不是中值。因此,我需要从列表中取出一部分,对它们进行排序,然后选择一个合适的。但我无法开始对共享内存列表进行排序,否则就会出现问题。仅复制到本地内存会损失大量性能吗?
I only found a remark that local memory is slower than register memory, the two-per-thread types.
Shared memory is supposed to be fast, but is it faster than local memory [of the thread]?
What I want to do is kind of a median filter, but with a given percentile instead of the median. Thus I need to take chunks of the list, sort them, and then pick a suitable one. But I can't start sorting the shared memory list or things go wrong. Will I lose a lot of performance by just copying to local memory?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
本地内存只是线程本地全局内存。它比寄存器或共享内存慢得多(在带宽和延迟方面)。它还消耗内存控制器带宽,否则这些带宽可用于全局内存事务。溢出或故意使用本地内存对性能的影响可能轻微到严重,具体取决于您使用的硬件以及本地内存的使用方式。
根据 Vasily Volkov 的研究 - 请参阅较低占用率下的更好性能 (pdf)——Fermi GPU 上的共享内存和寄存器之间的有效带宽大约相差 8 倍(大约 1000 Gb/s)共享内存为 8000 Gb/s,寄存器为 8000 Gb/s)。这在某种程度上与 CUDA 文档相矛盾,该文档意味着共享内存的速度与寄存器相当。
Local memory is just thread local global memory. It is much, much slower (both in terms of bandwidth and latency) than either registers or shared memory. It also consumes memory controller bandwidth that would otherwise be available for global memory transactions. The performance impact of spilling or deliberately using local memory can be anything from minor to severe, depending on the hardware you are using and how local memory is used.
According to Vasily Volkov's research - see Better performance at lower occupancy (pdf) -- there is about a factor of 8 difference in effective bandwidth between shared memory and register on Fermi GPUs (about 1000 Gb/s for shared memory and 8000 Gb/s for registers). This somewhat contradicts the CUDA documentation, which implies that shared memory is comparable in speed to registers.