如何正确将从全局内存读取的内容合并到具有 Short 或 char 类型元素的共享内存中(假设每个元素有一个线程)?

发布于 2024-12-28 12:52:58 字数 1755 浏览 6 评论 0原文

我对 CUDA 中合并全局内存负载有疑问。目前我需要能够在具有计算能力 CUDA 1.1 或 1.3 的 CUDA 设备上执行。

我正在编写一个 CUDA 内核函数,它将一个 T 类型的数组从全局内存读取到共享内存中,进行一些计算,然后写出一个 T 类型的数组到全局内存。我使用共享内存是因为每个输出元素的计算实际上不仅取决于相应的输入元素,还取决于附近的输入元素。我只想加载每个输入元素一次,因此我想将输入元素缓存在共享内存中。

我的计划是让每个线程将一个元素读入共享内存,然后在开始计算之前使用 __syncthreads() 。在这种情况下,每个线程加载、计算和存储一个元素(尽管计算取决于其他线程加载到共享内存中的元素)。

对于这个问题,我想重点讨论从全局内存读取到共享内存。

假设数组中有 N 个元素,我已将 CUDA 配置为总共执行 N 个线程。对于 sizeof(T) == 4 的情况,根据我对 CUDA 的理解,这应该很好地合并,因为线程 K 将读取单词 K > (其中 K 是线程索引)。

然而,在 sizeof(T) sizeof(T) sizeof(T) sizeof(T) sizeof(T) sizeof(T) 的情况下, 4,例如如果T=unsigned char或者如果T=short,那么我认为可能有问题。在这种情况下,我的(天真的)计划是:

  • 计算 numElementsPerWord = 4 / sizeof(T)
  • if(K % numElementsPerWord == 0),然后使用线程 K 读取下一个完整的 32 位字,
  • 32 位字存储在共享内存中
  • 在填充共享内存后将 ,(并调用 __syncthreads()),然后每个线程K 可以处理计算输出元素 K 的工作,

我担心它不会合并,因为(例如,在 T=< code>short)

  • 线程 0 从全局内存中读取字 0
  • 线程 1 不读取
  • 线程 2 从全局内存中读取字 1
  • 线程 3 不读取
  • 等等...

换句话说,线程 K 读单词K/sizeof(T)。这似乎没有正确合并。

我考虑的另一种方法是:

  • 以线程数 = (N + 3) / 4 启动,这样每个线程将负责加载和处理 4/sizeof(T)< /code> 元素(每个线程处理一个 32 位字 - 可能是 1、2 或 4 个元素,具体取决于 sizeof(T))。但是我担心这种方法不会尽可能快,因为每个线程必须执行两次(如果 T=short)甚至四次(如果 T=unsigned char) 处理量。

有人可以告诉我我对计划的假设是否正确,即它不会正确合并?

您能评论一下我的替代方法吗?

您能推荐一种可以正确合并的更优化方法吗?

I have a questions about coalesced global memory loads in CUDA. Currently I need to be able to execute on a CUDA device with compute capability CUDA 1.1 or 1.3.

I am writing a CUDA kernel function which reads an array of type T from global memory into shared memory, does some computation, and then will write out an array of type T back to global memory. I am using the shared memory because the computation for each output element actually depends not only on the corresponding input element, but also on the nearby input elements. I only want to load each input element once, hence I want to cache the input elements in shared memory.

My plan is to have each thread read one element into shared memory, then __syncthreads() before beginning the computation. In this scenario, each thread loads, computes, and stores one element (although the computation depends on elements loaded into shared memory by other threads).

For this question I want to focus on the read from global memory into shared memory.

Assuming that there are N elements in the array, I have configured CUDA to execute a total of N threads. For the case where sizeof(T) == 4, this should coalesce nicely according to my understanding of CUDA, since thread K will read word K (where K is the thread index).

However, in the case where sizeof(T) < 4, for example if T=unsigned char or if T=short, then I think there may be a problem. In this case, my (naive) plan is:

  • Compute numElementsPerWord = 4 / sizeof(T)
  • if(K % numElementsPerWord == 0), then have thread K read the next full 32-bit word
  • store the 32 bit word in shared memory
  • after the shared memory has been populated, (and __syncthreads() called) then each thread K can process work on computing output element K

My concern is that it will not coalesce because (for example, in the case where T=short)

  • Thread 0 reads word 0 from global memory
  • Thread 1 does not read
  • Thread 2 reads word 1 from global memory
  • Thread 3 does not read
  • etc...

In other words, thread K reads word K/sizeof(T). This would seem to not coalesce properly.

An alternative approach that I considered was:

  • Launch with number of threads = (N + 3) / 4, such that each thread will be responsible for loading and processing 4/sizeof(T) elements (each thread processes one 32-bit word - possibly 1, 2, or 4 elements depending on sizeof(T)). However I am concerned that this approach will not be as fast as possible since each thread must then do twice (if T=short) or even quadruple (if T=unsigned char) the amount of processing.

Can someone please tell me if my assumption about my plan is correct, i.e. it will not coalesce properly?

Can you please comment on my alternative approach?

Can you recommend a more optimal approach that properly coalesces?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

厌倦 2025-01-04 12:52:58

您是对的,您必须执行至少 32 位大小的加载才能进行合并,并且您描述的方案(让所有其他线程执行加载)不会合并。只需将偏移量右移 2 位,让每个线程执行连续的 32 位加载,并使用条件代码来禁止对超出范围地址进行操作的线程执行。

由于您的目标是 SM 1.x,还请注意 1) 为了发生合并,给定 warp 的线程 0(32 个线程的集合)必须为 4、8 字节对齐的 64、128 或 256 字节- 和 16 字节操作数,以及 2) 一旦数据位于共享内存中,您可能需要将循环展开 2 倍(简称)或 4 倍(对于char),因此相邻线程引用相邻的 32 位字,以避免共享内存库冲突。

You are correct, you have to do loads of at least 32 bits in size to get coalescing, and the scheme you describe (having every other thread do a load) will not coalesce. Just shift the offset right by 2 bits and have each thread do a contiguous 32-bit load, and use conditional code to inhibit execution for threads that would operate on out-of-range addresses.

Since you are targeting SM 1.x, note also that 1) in order for coalescing to happen, thread 0 of a given warp (collections of 32 threads) must be 64-, 128- or 256-byte aligned for 4-, 8- and 16-byte operands, respectively, and 2) once your data is in shared memory, you may want to unroll your loop by 2x (for short) or 4x (for char) so adjacent threads reference adjacent 32-bit words, to avoid shared memory bank conflicts.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文