CUDA：分区方法巨大问题？

发布于 2024-11-02 07:13:14 字数 160 浏览 7 评论 0原文

所有这些 CUDA 云雀的力量都令人头晕，但我一直想知道的是 1d 块/网格尺寸的硬限制（通常分别为 512/65535）。

当处理范围更大（数十亿）的问题时，是否有一种自动编程方法可以通过内核有效地设置“队列”，或者是手动切片和切块的情况？

大家都是如何处理问题划分的呢？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七婞 2024-11-09 07:13:14

有 2 种基本方法可以对数据进行分区，以便您可以使用 CUDA 对其进行处理：

将数据分解为连续的块，以便每个线程处理一个块。
每个线程都会蚕食一个数据元素。当所有线程完成后，它们将自身移动 numberOfThreads 并再次重复。

我在此处用简单的示例解释了这些技术。对于大多数任务来说，方法 2 通常更容易编码和使用。

回复收藏 0 原文

十雾 2024-11-09 07:13:14

如果一维网格太小，只需使用二维（或带有 CUDA 4.0 的 Fermi 上的三维）网格即可。网格和块布局中的维度实际上只是为了方便 - 它使执行空间看起来像程序员习惯使用的常见数据并行输入空间（矩阵、网格、体素等）。但这只是与底层简单线性编号方案的一个非常小的抽象，可以在单个内核启动中处理超过 10^12 个唯一线程 ID。

在网格中，排序是列优先的，因此如果您之前遇到过一维网格问题，则“唯一的一维线程索引”计算如下：

unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

其理论上限为 512 * 65535 = 33553920 个唯一线程。等效的 2D 网格问题只是 1D 情况的简单扩展

size_t tidx = threadIdx.x + blockIdx.x * blockDim.x;
size_t tid = tidx + blockIdx.y * blockDim.x * GridDim.x;

，其理论上限为 512 * 65535 * 65535 = 2198956147200 个唯一线程。 Fermi 允许您向网格添加第三个维度，最大尺寸也是 65535，这在单个执行网格中最多提供大约 10^17 个线程。这是相当多的。

If one dimensional grids are too small, just use two dimensional (or three dimensional on Fermi with CUDA 4.0) grids instead. Dimensionality in grid and block layouts is really only for convenience - it makes the execution space look like the sort of common data parallel inputs spaces programmers are used to working with (matrices, grids, voxels, etc). But it is only a very small abstraction away from the underlying simple linear numbering scheme that can handle over 10^12 unique thread IDs within a single kernel launch.

In grids, ordering is column major, so if you had a 1D grid problem before, the "unique, 1D thread index" was calculated as:

unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

which has a theoretical upper limit of 512 * 65535 = 33553920 unique threads. The equivalent 2D grid problem is only a simple extension of the 1D case

size_t tidx = threadIdx.x + blockIdx.x * blockDim.x;
size_t tid = tidx + blockIdx.y * blockDim.x * GridDim.x;

which has a theoretical upper limit of 512 * 65535 * 65535 = 2198956147200 unique threads. Fermi will let you add a third dimension to the grid, also of 65535 maximum size, which gives up to about 10^17 threads in a single execution grid. Which is rather a lot.

回复收藏 0 原文

~没有更多了~