共享内存优化混乱

发布于 2024-10-31 04:03:36 字数 263 浏览 1 评论 0原文

我在 CUDA 中编写了一个应用程序,它在每个块中使用 1kb 共享内存。
由于每个SM中只有16kb的共享内存,所以总共只能容纳16个块,对吗?虽然一次只能调度8个,但现在如果某个块忙于进行内存操作,另一个块将被调度到GPU上,但所有共享内存都被其他已经调度在那里的16个块使用。

那么,除非之前分配的块完全完成,否则 CUDA 不会在同一个 SM 上调度更多块吗?

或者它将某些块的共享内存移动到全局内存,并在那里分配其他块?在这种情况下,我们应该担心全局内存访问延迟吗?

I have written an application in CUDA, which uses 1kb of shared memory in each block.
Since there is only 16kb of shared memory in each SM, only 16 blocks can be accommodated overall, right? Though at a time only 8 can be scheduled, but now if some block is busy in doing memory operations, another block will be scheduled on the GPU, but all the shared memory is used by the other 16 blocks which already have been scheduled there.

So will CUDA not schedule more blocks on the same SM, unless previous allocated blocks are completely finished?

Or will it move some block's shared memory to global memory, and allocate other block there? In this case should we worry about global memory access latency?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

羁拥 2024-11-07 04:03:36

它不是那样工作的。在单个 SM 上的任何给定时刻计划运行的块数始终为以下最小值:

  1. 8 个块
  2. 静态和动态分配的共享内存总和小于 16kb 或 48kb 的块数,具体取决于关于 GPU 架构和设置。还有共享内存页面大小限制,这意味着每块分配将向上舍入到页面大小的下一个最大倍数
  3. 。根据架构,每块寄存器使用量总和小于 8192/16384/32678 的块数量。还有寄存器文件页面大小,这意味着每个块的分配将向上舍入到页面大小的下一个最大倍数。

这就是全部内容了。没有共享内存的“分页”来容纳更多块。 NVIDIA 制作了一个计算占用率电子表格,该电子表格随工具包一起提供,可以单独下载。您可以在其中包含的公式中看到确切的规则。 CUDA 编程指南的 4.2 节中也讨论了它们。

It does not work like that. The number of blocks which will be scheduled to run at any given moment on a single SM will always be the minimum of the following:

  1. 8 blocks
  2. The number of blocks whose sum of static and dynamically allocated shared memory is less than 16kb or 48kb, depending on GPU architecture and settings. There is also shared memory page size limitations which mean per block allocations get rounded up to the next largest multiple of the page size
  3. The number of blocks whose sum of per block register usage is less than 8192/16384/32678 depending on architecture. There is also register file page sizes which mean that per block allocations get rounded up to the next largest multiple of the page size.

That is all there is to it. There is no "paging" of shared memory to accomodate more blocks. NVIDIA produce a spreadsheet for computing occupancy which ships with the toolkit and is available as a separate download. You can see the exact rules in the formulas it contains. They are also discussed in section 4.2 of the CUDA programming guide.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文