CUDA:有关活动扭曲(活动块)以及如何选择块大小的问题

发布于 2024-10-24 16:11:41 字数 201 浏览 1 评论 0原文

假设一个 CUDA GPU 在一个多处理器上可以有 48 个同时活动的 warp,即 48 个块的 1 个 warp,或 24 个块的 2 个 warp,...,因为来自多个块的所有活动 warp 都被调度执行,所以看起来大小block的大小对于GPU的占用并不重要(当然应该是32的倍数),32、64、128都没有区别吧?那么块的大小只是由计算任务和资源限制(共享内存或寄存器)决定的?

Suppose a CUDA GPU can have 48 simultaneously active warps on one multiprocessor, that is 48 blocks of one warp, or 24 blocks of 2 warp, ..., since all the active warps from multiple blocks are scheduled for execution, it seems the size of the block is not important for the occupancy of the GPU (of course it should be multiple of 32), whether 32, 64, or 128 make no difference, right? So the size of the block is just determined by the computation task and the resource limit (shared memory or registers)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

睫毛上残留的泪 2024-10-31 16:11:41

您忽略了多个值得考虑的因素。

  • SM 上的活动块数量有限制。当前限制为 8(所有设备),因此如果您想实现完全占用,您的块不应小于:3-warps(设备 1.0、1.1)、4-warps(1.2、1.3)和 6-warps (2.x)
  • 根据设备的不同,每个多处理器可以使用 8K、16K 或 32K 寄存器。块越大,块需要的寄存器数量的“粒度”就越大。对于大块来说,如果不能实现满员,你就会损失很多。对于较小的块,损失可能更小。这就是为什么我个人更喜欢 2x256 而不是 1x512。
  • 如果您确实需要块中扭曲之间的同步,则更大的块可以让您实现更广泛的同步。
  • 保证在单个多处理器上调度单个块。如果它的所有扭曲都有一些公共数据(例如控制变量),则可以减少全局内存获取的次数。另一方面,当您创建大量小块时,每个小块可能需要单独加载相同的数据。在具有一些缓存的 Fermi 上,它并不像 GF-200 系列上那么重要。但请记住,由于有如此多的多处理器,1MB L2 缓存仍然非常非常小!

There are multiple factors worth considering, that you ommit.

  • There is a limit on the number of active blocks on a SM. Current limit is 8 (all devices), so if you want to achieve full occupancy, your blocks shouldn't be smaller than: 3-warps (devices 1.0, 1.1), 4-warps (1.2, 1.3), and 6-warps (2.x)
  • Depending on the device, there are 8K, 16K or 32K registers available per multiprocessor. The bigger your blocks, the bigger "granularity" of how many registers the block needs. For big blocks, if full occupancy cannot be achieved, you loose a lot. For smaller blocks, the loss may be smaller. That's why personally, I prefer for example 2x256 rather than 1x512.
  • If you do need synchronisation between warps in a block, bigger blocks allow you to have wider synchronisation.
  • Single block is guaranteed to be scheduled on a single multiprocessor. If all its warps have some common data (e.g. control variables), you can reduce the number of global memory fetches. On the other hand, when you create lots of small blocks, each of them might need to load the same data separately. On Fermi, which has some caches, it is not as important as on GF-200 series. Keep in mind, however, that since there are so many multiprocessors, 1MB L2 cache is still very, very small!
终止放荡 2024-10-31 16:11:41

不。
块大小确实很重要。

如果您的块大小为 32 个线程,则占用率非常低。
如果块大小为 256,则占用率很高。这意味着所有 256 个都同时处于活动状态。
超过 256 个线程/块几乎不会产生任何影响。

由于所涉及的架构很复杂,因此使用软件进行测试始终是最好的方法。

No.
The blocksize does matter.

If you have a blocksize of 32 threads you have a very low occupancy.
If you have a blocksize of 256 you have a high occupancy. That means that all the 256 are concurrently active.
More than 256 threads / block would rarely make some difference.

As the architecture involved is complex, testing it with your software is always the best approach.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文