CUDA 确定每个块的线程、每个网格的块

发布于 2024-10-06 12:51:13 字数 172 浏览 0 评论 0原文

我是 CUDA 范例的新手。我的问题是确定每个块的线程数和每个网格的块数。这是否需要一些艺术和尝试?我发现许多例子似乎为这些事情选择了任意的数字。

我正在考虑一个问题,我可以将任意大小的矩阵传递给乘法方法。这样,C 的每个元素(如 C = A * B)将由单个线程计算。在这种情况下,您如何确定线程/块、块/网格?

I'm new to the CUDA paradigm. My question is in determining the number of threads per block, and blocks per grid. Does a bit of art and trial play into this? What I've found is that many examples have seemingly arbitrary number chosen for these things.

I'm considering a problem where I would be able to pass matrices - of any size - to a method for multiplication. So that, each element of C (as in C = A * B) would be calculated by a single thread. How would you determine the threads/block, blocks/grid in this case?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

一个人练习一个人 2024-10-13 12:51:13

一般来说,您希望调整块/网格的大小以匹配您的数据,同时最大化占用率,即一次有多少个线程处于活动状态。影响占用的主要因素是共享内存使用、寄存器使用和线程块大小。

支持 CUDA 的 GPU 的处理能力分为 SM(流式多处理器),SM 的数量取决于实际的卡,但为了简单起见,这里我们将重点关注单个 SM(它们的行为都相同)。每个 SM 都有有限数量的 32 位寄存器、共享内存、最大数量的活动块以及最大数量的活动线程。这些数字取决于 GPU 的 CC(计算能力),可以在维基百科文章 http 的中间找到://en.wikipedia.org/wiki/CUDA

首先,线程块大小应始终是 32 的倍数,因为内核在 warp 中发出指令(32 个线程)。例如,如果您的块大小为 50 个线程,GPU 仍将向 64 个线程发出命令,而您只是在浪费它们。

其次,在担心共享内存和寄存器之前,请尝试根据与卡的计算能力相对应的最大线程数和块数来确定块的大小。有时有多种方法可以做到这一点...例如,CC 3.0 卡的每个 SM 可以有 16 个活动块和 2048 个活动线程。这意味着,如果每个块有 128 个线程,则在达到 2048 个线程限制之前,您可以在 SM 中容纳 16 个块。如果您使用 256 个线程,则只能容纳 8 个线程,但您仍然使用所有可用线程,并且仍然会完全占用。然而,当达到 16 个块限制时,每个块使用 64 个线程将仅使用 1024 个线程,因此只有 50% 的占用率。如果共享内存和寄存器使用不是瓶颈,那么这应该是您主要关心的问题(而不是数据维度)。

关于网格的主题...网格中的块分布在 SM 上以开始,然后剩余的块被放入管道中。一旦SM中有足够的资源来获取块,块就会被移入SM进行处理。换句话说,当 SM 中的块完成时,新的块就会被移入。您可以提出这样的论点:较小的块(上一个示例中的 128 个而不是 256 个)可能会更快完成,因为特别慢的块会占用更少的资源,但是这很大程度上取决于代码。

关于寄存器和共享内存,请查看接下来的内容,因为它可能会限制您的占用。共享内存对于整个 SM 来说是有限的,因此请尝试以允许尽可能多的块仍然适合 SM 的数量来使用它。寄存器的使用也是如此。同样,这些数字取决于计算能力,可以在维基百科页面上找到表格。

In general you want to size your blocks/grid to match your data and simultaneously maximize occupancy, that is, how many threads are active at one time. The major factors influencing occupancy are shared memory usage, register usage, and thread block size.

A CUDA enabled GPU has its processing capability split up into SMs (streaming multiprocessors), and the number of SMs depends on the actual card, but here we'll focus on a single SM for simplicity (they all behave the same). Each SM has a finite number of 32 bit registers, shared memory, a maximum number of active blocks, AND a maximum number of active threads. These numbers depend on the CC (compute capability) of your GPU and can be found in the middle of the Wikipedia article http://en.wikipedia.org/wiki/CUDA.

First of all, your thread block size should always be a multiple of 32, because kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue commands to 64 threads and you'd just be wasting them.

Second, before worrying about shared memory and registers, try to size your blocks based on the maximum numbers of threads and blocks that correspond to the compute capability of your card. Sometimes there are multiple ways to do this... for example, a CC 3.0 card each SM can have 16 active blocks and 2048 active threads. This means if you have 128 threads per block, you could fit 16 blocks in your SM before hitting the 2048 thread limit. If you use 256 threads, you can only fit 8, but you're still using all of the available threads and will still have full occupancy. However using 64 threads per block will only use 1024 threads when the 16 block limit is hit, so only 50% occupancy. If shared memory and register usage is not a bottleneck, this should be your main concern (other than your data dimensions).

On the topic of your grid... the blocks in your grid are spread out over the SMs to start, and then the remaining blocks are placed into a pipeline. Blocks are moved into the SMs for processing as soon as there are enough resources in that SM to take the block. In other words, as blocks complete in an SM, new ones are moved in. You could make the argument that having smaller blocks (128 instead of 256 in the previous example) may complete faster since a particularly slow block will hog fewer resources, but this is very much dependent on the code.

Regarding registers and shared memory, look at that next, as it may be limiting your occupancy. Shared memory is finite for a whole SM, so try to use it in an amount that allows as many blocks as possible to still fit on an SM. The same goes for register use. Again, these numbers depend on compute capability and can be found tabulated on the wikipedia page.

神魇的王 2024-10-13 12:51:13

https://docs.nvidia.com/cuda/cuda-occupancy-计算器/index.html

CUDA 占用计算器允许您计算给定 CUDA 内核对 GPU 的多处理器占用。多处理器占用率是活动 warp 与 GPU 多处理器支持的最大 warp 数的比率。设备上的每个多处理器都有一组 N 个寄存器可供 CUDA 程序线程使用。这些寄存器是在多处理器上执行的线程块之间分配的共享资源。 CUDA 编译器尝试最小化寄存器使用,以最大化机器中可同时活动的线程块的数量。如果程序尝试启动一个内核,而每个线程使用的寄存器乘以线程块大小大于 N,则启动将失败...

https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html

The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers available for use by CUDA program threads. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a program tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail...

流绪微梦 2024-10-13 12:51:13

除了极少数例外,您应该在每个块中使用恒定数量的线程。然后,每个网格的块数由问题大小决定,例如矩阵乘法情况下的矩阵维度。

选择每个块的线程数非常复杂。大多数 CUDA 算法都承认多种可能性,并且选择基于什么使内核运行最有效。由于线程调度硬件的工作方式,它几乎总是 32 的倍数,并且至少是 64。第一次尝试的不错选择是 128 或 256。

With rare exceptions, you should use a constant number of threads per block. The number of blocks per grid is then determined by the problem size, such as the matrix dimensions in the case of matrix multiplication.

Choosing the number of threads per block is very complicated. Most CUDA algorithms admit a large range of possibilities, and the choice is based on what makes the kernel run most efficiently. It is almost always a multiple of 32, and at least 64, because of how the thread scheduling hardware works. A good choice for a first attempt is 128 or 256.

丑疤怪 2024-10-13 12:51:13

您还需要考虑共享内存,因为同一块中的线程可以访问相同的共享内存。如果您正在设计需要大量共享内存的东西,那么每块更多的线程可能会更有利。

例如,就上下文切换而言,32 的任何倍数都一样。因此,对于 1D 情况,启动 1 个具有 64 个线程的块或 2 个具有 32 个线程的块对于全局内存访问没有区别。但是,如果手头的问题自然分解为 1 个长度为 64 的向量,那么第一个选项会比第二个选项更好(内存开销更少,每个线程都可以访问相同的共享内存)。

You also need to consider shared memory because threads in the same block can access the same shared memory. If you're designing something that requires a lot of shared memory, then more threads-per-block might be advantageous.

For example, in terms of context switching, any multiple of 32 works just the same. So for the 1D case, launching 1 block with 64 threads or 2 blocks with 32 threads each makes no difference for global memory accesses. However, if the problem at hand naturally decomposes into 1 length-64 vector, then the first option will be better (less memory overhead, every thread can access the same shared memory) than the second.

水波映月 2024-10-13 12:51:13

没有灵丹妙药。每个块的最佳线程数在很大程度上取决于并行化的特定应用程序的特征。 CUDA 的设计指南建议使用少量当卸载到 GPU 的函数存在多个障碍时,每个块的线程数会增加,但是,有实验表明,对于某些应用程序,每个块的少量线程会增加同步开销,从而带来更大的开销。相反,每个块的线程数量较多可能会减少同步量并提高整体性能。

有关每个块的线程数对 CUDA 内核的影响的深入讨论(对于 StackOverflow 来说太长),请检查 这篇期刊文章,它展示了 NPB(NAS 并行基准)套件(一组 CFD(计算流体动力学)应用程序)中每块线程数的不同配置的测试。

There is no silver bullet. The best number of threads per block depends a lot on the characteristics of the specific application being parallelized. CUDA's design guide recommends using a small amount of threads per block when a function offloaded to the GPU has several barriers, however, there are experiments showing that for some applications a small number of threads per block increases the overhead of synchronizations, imposing a larger overhead. In contrast, a larger number of threads per block may decrease the amount of synchronizations and improve the overall performance.

For an in-depth discussion (too lengthy for StackOverflow) about the impact of the number of threads per block on CUDA kernels, check this journal article, it shows tests of different configurations of the number of threads per block in the NPB (NAS Parallel Benchmarks) suite, a set of CFD (Computational Fluid Dynamics) applications.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文