共享内存优化混乱
我在 CUDA 中编写了一个应用程序,它在每个块中使用 1kb 共享内存。
由于每个SM中只有16kb的共享内存,所以总共只能容纳16个块,对吗?虽然一次只能调度8个,但现在如果某个块忙于进行内存操作,另一个块将被调度到GPU上,但所有共享内存都被其他已经调度在那里的16个块使用。
那么,除非之前分配的块完全完成,否则 CUDA 不会在同一个 SM 上调度更多块吗?
或者它将某些块的共享内存移动到全局内存,并在那里分配其他块?在这种情况下,我们应该担心全局内存访问延迟吗?
I have written an application in CUDA, which uses 1kb of shared memory in each block.
Since there is only 16kb of shared memory in each SM, only 16 blocks can be accommodated overall, right? Though at a time only 8 can be scheduled, but now if some block is busy in doing memory operations, another block will be scheduled on the GPU, but all the shared memory is used by the other 16 blocks which already have been scheduled there.
So will CUDA not schedule more blocks on the same SM, unless previous allocated blocks are completely finished?
Or will it move some block's shared memory to global memory, and allocate other block there? In this case should we worry about global memory access latency?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
它不是那样工作的。在单个 SM 上的任何给定时刻计划运行的块数始终为以下最小值:
这就是全部内容了。没有共享内存的“分页”来容纳更多块。 NVIDIA 制作了一个计算占用率电子表格,该电子表格随工具包一起提供,可以单独下载。您可以在其中包含的公式中看到确切的规则。 CUDA 编程指南的 4.2 节中也讨论了它们。
It does not work like that. The number of blocks which will be scheduled to run at any given moment on a single SM will always be the minimum of the following:
That is all there is to it. There is no "paging" of shared memory to accomodate more blocks. NVIDIA produce a spreadsheet for computing occupancy which ships with the toolkit and is available as a separate download. You can see the exact rules in the formulas it contains. They are also discussed in section 4.2 of the CUDA programming guide.