CUDA:分区方法*巨大*问题?

发布于 2024-11-02 07:13:14 字数 160 浏览 4 评论 0原文

所有这些 CUDA 云雀的力量都令人头晕,但我一直想知道的是 1d 块/网格尺寸的硬限制(通常分别为 512/65535)。

当处理范围更大(数十亿)的问题时,是否有一种自动编程方法可以通过内核有效地设置“队列”,或者是手动切片和切块的情况?

大家都是如何处理问题划分的呢?

All this CUDA lark is head-melting in its power, but something I've been wondering about is the hard-limits on 1d block/grid dimensions (usually 512/65535 respectively).

When dealing with problems that are much larger in their scope (in the order of billions), is there an automated programmatic way of effectively setting a 'queue' through a kernel, or is it a case of manual slicing and dicing?

How does everyone deal with problem-partitioning?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

七婞 2024-11-09 07:13:14

有 2 种基本方法可以对数据进行分区,以便您可以使用 CUDA 对其进行处理:

  1. 将数据分解为连续的,以便每个线程处理一个块。
  2. 每个线程都会蚕食一个数据元素。当所有线程完成后,它们将自身移动 numberOfThreads 并再次重复。

我在此处用简单的示例解释了这些技术。对于大多数任务来说,方法 2 通常更容易编码和使用。

There are 2 basic ways of partitioning your data, so that you can work on it using CUDA:

  1. Break data down into contiguous chunks, such that each thread works on one chunk.
  2. Each thread nibbles at one element of data. When all threads are done, they shift themselves by numberOfThreads and repeat again.

I have explained these techniques with simple examples here. Method 2 is typically easier to code and work with for most tasks.

十雾 2024-11-09 07:13:14

如果一维网格太小,只需使用二维(或带有 CUDA 4.0 的 Fermi 上的三维)网格即可。网格和块布局中的维度实际上只是为了方便 - 它使执行空间看起来像程序员习惯使用的常见数据并行输入空间(矩阵、网格、体素等)。但这只是与底层简单线性编号方案的一个非常小的抽象,可以在单个内核启动中处理超过 10^12 个唯一线程 ID。

在网格中,排序是列优先的,因此如果您之前遇到过一维网格问题,则“唯一的一维线程索引”计算如下:

unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

其理论上限为 512 * 65535 = 33553920 个唯一线程。等效的 2D 网格问题只是 1D 情况的简单扩展

size_t tidx = threadIdx.x + blockIdx.x * blockDim.x;
size_t tid = tidx + blockIdx.y * blockDim.x * GridDim.x;

,其理论上限为 512 * 65535 * 65535 = 2198956147200 个唯一线程。 Fermi 允许您向网格添加第三个维度,最大尺寸也是 65535,这在单个执行网格中最多提供大约 10^17 个线程。这是相当多的。

If one dimensional grids are too small, just use two dimensional (or three dimensional on Fermi with CUDA 4.0) grids instead. Dimensionality in grid and block layouts is really only for convenience - it makes the execution space look like the sort of common data parallel inputs spaces programmers are used to working with (matrices, grids, voxels, etc). But it is only a very small abstraction away from the underlying simple linear numbering scheme that can handle over 10^12 unique thread IDs within a single kernel launch.

In grids, ordering is column major, so if you had a 1D grid problem before, the "unique, 1D thread index" was calculated as:

unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;

which has a theoretical upper limit of 512 * 65535 = 33553920 unique threads. The equivalent 2D grid problem is only a simple extension of the 1D case

size_t tidx = threadIdx.x + blockIdx.x * blockDim.x;
size_t tid = tidx + blockIdx.y * blockDim.x * GridDim.x;

which has a theoretical upper limit of 512 * 65535 * 65535 = 2198956147200 unique threads. Fermi will let you add a third dimension to the grid, also of 65535 maximum size, which gives up to about 10^17 threads in a single execution grid. Which is rather a lot.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文