在 CUDA 中,什么是内存合并,它是如何实现的?

发布于 2024-10-18 13:04:06 字数 110 浏览 8 评论 0原文

CUDA 全局内存事务中的“合并”是什么?即使读完我的 CUDA 指南后我还是无法理解。怎么做呢?在CUDA编程指南矩阵示例中,逐行访问矩阵称为“合并”,或者逐列访问矩阵称为合并? 哪一个是正确的,为什么?

What is "coalesced" in CUDA global memory transaction? I couldn't understand even after going through my CUDA guide. How to do it? In CUDA programming guide matrix example, accessing the matrix row by row is called "coalesced" or col.. by col.. is called coalesced?
Which is correct and why?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

秋千易 2024-10-25 13:04:06

此信息可能仅适用于计算能力 1.x 或 cuda 2.0。更新的架构和 cuda 3.0 具有更复杂的全局内存访问,事实上,甚至没有为这些芯片分析“合并全局负载”。

此外,此逻辑还可以应用于共享内存以避免存储体冲突。合并


内存事务是一种半扭曲中的所有线程同时访问全局内存的事务。这太简单了,但正确的方法是让连续的线程访问连续的内存地址。

因此,如果线程 0、1、2 和 3 读取全局内存 0x0、0x4、0x8 和 0xc,则它应该是合并读取。

在矩阵示例中,请记住您希望矩阵线性驻留在内存中。您可以按照自己的意愿执行此操作,并且您的内存访问应该反映矩阵的布局方式。因此,下面的 3x4 矩阵

0 1 2 3
4 5 6 7
8 9 a b

可以像这样逐行完成,以便 (r,c) 映射到内存 (r*4 + c)

0 1 2 3 4 5 6 7 8 9 a b

假设您需要访问元素一次,并假设您有四个线程。哪些线程将用于哪些元素?可能是

thread 0:  0, 1, 2
thread 1:  3, 4, 5
thread 2:  6, 7, 8
thread 3:  9, a, b

或者

thread 0:  0, 4, 8
thread 1:  1, 5, 9
thread 2:  2, 6, a
thread 3:  3, 7, b

哪个更好?哪个会导致合并读取,哪个不会?

无论哪种方式,每个线程都会进行三次访问。我们先看第一次访问,看看线程是否连续访问内存。在第一个选项中,第一次访问是0、3、6、9。不连续,不合并。第二个选项,是0、1、2、3。连续!合体了!耶!

最好的方法可能是编写内核,然后对其进行分析以查看是否有非合并的全局加载和存储。

It's likely that this information applies only to compute capabality 1.x, or cuda 2.0. More recent architectures and cuda 3.0 have more sophisticated global memory access and in fact "coalesced global loads" are not even profiled for these chips.

Also, this logic can be applied to shared memory to avoid bank conflicts.


A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. This is oversimple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.

So, if threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.

In a matrix example, keep in mind that you want your matrix to reside linearly in memory. You can do this however you want, and your memory access should reflect how your matrix is laid out. So, the 3x4 matrix below

0 1 2 3
4 5 6 7
8 9 a b

could be done row after row, like this, so that (r,c) maps to memory (r*4 + c)

0 1 2 3 4 5 6 7 8 9 a b

Suppose you need to access element once, and say you have four threads. Which threads will be used for which element? Probably either

thread 0:  0, 1, 2
thread 1:  3, 4, 5
thread 2:  6, 7, 8
thread 3:  9, a, b

or

thread 0:  0, 4, 8
thread 1:  1, 5, 9
thread 2:  2, 6, a
thread 3:  3, 7, b

Which is better? Which will result in coalesced reads, and which will not?

Either way, each thread makes three accesses. Let's look at the first access and see if the threads access memory consecutively. In the first option, the first access is 0, 3, 6, 9. Not consecutive, not coalesced. The second option, it's 0, 1, 2, 3. Consecutive! Coalesced! Yay!

The best way is probably to write your kernel and then profile it to see if you have non-coalesced global loads and stores.

白色秋天 2024-10-25 13:04:06

内存合并是一种允许优化使用全局内存带宽的技术。
也就是说,当运行相同指令的并行线程访问全局存储器中的连续位置时,实现了最有利的访问模式。

在此处输入图像描述

上图中的示例有助于解释合并排列:

在图 (a) 中,n长度为m的向量以线性方式存储。向量j的元素ivji表示。 GPU 内核中的每个线程都分配给一个 m 长度的向量。 CUDA中的线程被分组在一个块数组中,GPU中的每个线程都有一个唯一的id,可以定义为indx=bd*bx+tx,其中bd代表块维度,bx 表示块索引,tx 是每个块中的线程索引。

垂直箭头演示了并行线程访问每个向量的第一个组件的情况,即内存的地址 0、m2m...。如图(a)所示,在这种情况下,存储器访问是不连续的。通过将这些地址之间的间隙归零(上图所示的红色箭头),内存访问变得合并。

然而,这里的问题有点棘手,因为每个 GPU 块允许的驻留线程大小仅限于 bd。因此,可以通过按连续顺序存储第一个 bd 向量的第一个元素,然后存储第二个 bd 向量的第一个元素,依此类推,来完成合并数据排列。其余向量元素以类似的方式存储,如图(b)所示。如果n(向量数量)不是bd的因子,则需要用一些简单的值(例如0)填充最后一个块中的剩余数据

。图 (a) 中的线性数据存储,向量 indx 的分量 i (0 ≤ i < m) >
(0 ≤ indx < n) 由 m × indx +i 寻址;合并中的相同组件
图(b)中的存储模式被寻址为

(m×bd)ixC+bd×ixB+ixA,

其中ixC=floor[(m.indx+j)/(m. bd)]= bx、ixB = jixA = mod(indx,bd) = tx

总之,在存储多个大小为 m 的向量的示例中,线性索引根据以下方式映射为合并索引:

m.indx +i −→ m.bd.bx +i .bd +tx

这种数据重新排列可以导致 GPU 全局内存的显着更高的内存带宽。


来源:“非线性有限元变形分析中基于 GPU 的计算加速”。国际生物医学工程数值方法杂志(2013)。

Memory coalescing is a technique which allows optimal usage of the global memory bandwidth.
That is, when parallel threads running the same instruction access to consecutive locations in the global memory, the most favorable access pattern is achieved.

enter image description here

The example in Figure above helps explain the coalesced arrangement:

In Fig. (a), n vectors of length m are stored in a linear fashion. Element i of vector j is denoted by v j i. Each thread in GPU kernel is assigned to one m-length vector. Threads in CUDA are grouped in an array of blocks and every thread in GPU has a unique id which can be defined as indx=bd*bx+tx, where bd represents block dimension, bx denotes the block index and tx is the thread index in each block.

Vertical arrows demonstrate the case that parallel threads access to the first components of each vector, i.e. addresses 0, m, 2m... of the memory. As shown in Fig. (a), in this case the memory access is not consecutive. By zeroing the gap between these addresses (red arrows shown in figure above), the memory access becomes coalesced.

However, the problem gets slightly tricky here, since the allowed size of residing threads per GPU block is limited to bd. Therefore coalesced data arrangement can be done by storing the first elements of the first bd vectors in consecutive order, followed by first elements of the second bd vectors and so on. The rest of vectors elements are stored in a similar fashion, as shown in Fig. (b). If n (number of vectors) is not a factor of bd, it is needed to pad the remaining data in the last block with some trivial value, e.g. 0.

In the linear data storage in Fig. (a), component i (0 ≤ i < m) of vector indx
(0 ≤ indx < n) is addressed by m × indx +i; the same component in the coalesced
storage pattern in Fig. (b) is addressed as

(m × bd) ixC + bd × ixB + ixA,

where ixC = floor[(m.indx + j )/(m.bd)]= bx, ixB = j and ixA = mod(indx,bd) = tx.

In summary, in the example of storing a number of vectors with size m, linear indexing is mapped to coalesced indexing according to:

m.indx +i −→ m.bd.bx +i .bd +tx

This data rearrangement can lead to a significant higher memory bandwidth of GPU global memory.


source: "GPU‐based acceleration of computations in nonlinear finite element deformation analysis." International journal for numerical methods in biomedical engineering (2013).

潜移默化 2024-10-25 13:04:06

如果块中的线程正在访问连续的全局内存位置,则所有访问都会由硬件组合成单个请求(或合并)。在矩阵示例中,行中的矩阵元素线性排列,然后是下一行,依此类推。
例如,对于 2x2 矩阵和块中的 2 个线程,内存位置排列如下:

(0,0) (0,1) (1,0) (1,1)

在行访问中,线程 1 访问 (0,0) 和 ( 1,0) 不能合并。
在列访问中,thread1 访问 (0,0) 和 (0,1),因为它们是相邻的,所以可以合并。

If the threads in a block are accessing consecutive global memory locations, then all the accesses are combined into a single request(or coalesced) by the hardware. In the matrix example, matrix elements in row are arranged linearly, followed by the next row, and so on.
For e.g 2x2 matrix and 2 threads in a block, memory locations are arranged as:

(0,0) (0,1) (1,0) (1,1)

In row access, thread1 accesses (0,0) and (1,0) which cannot be coalesced.
In column access, thread1 accesses (0,0) and (0,1) which can be coalesced because they are adjacent.

遮云壑 2024-10-25 13:04:06

合并的标准在 CUDA 3.2 编程指南中有详细记录,G.3.2 节。简短的版本如下:warp中的线程必须按顺序访问内存,并且正在访问的字应该> = 32位。此外,warp 访问的基地址应分别针对 32、64 和 128 位访问进行 64、128 或 256 字节对齐。

Tesla2 和 Fermi 硬件在合并 8 位和 16 位访问方面做得很好,但如果您想要峰值带宽,最好避免使用它们。

请注意,尽管 Tesla2 和 Fermi 硬件有所改进,但合并绝不是过时的。即使在 Tesla2 或 Fermi 级硬件上,未能合并全局内存事务也可能导致性能下降 2 倍。 (在 Fermi 类硬件上,这似乎仅在启用 ECC 时才成立。连续但未合并的内存事务对 Fermi 的影响约为 20%。)

The criteria for coalescing are nicely documented in the CUDA 3.2 Programming Guide, Section G.3.2. The short version is as follows: threads in the warp must be accessing the memory in sequence, and the words being accessed should >=32 bits. Additionally, the base address being accessed by the warp should be 64-, 128-, or 256-byte aligned for 32-, 64- and 128-bit accesses, respectively.

Tesla2 and Fermi hardware does an okay job of coalescing 8- and 16-bit accesses, but they are best avoided if you want peak bandwidth.

Note that despite improvements in Tesla2 and Fermi hardware, coalescing is BY NO MEANS obsolete. Even on Tesla2 or Fermi class hardware, failing to coalesce global memory transactions can result in a 2x performance hit. (On Fermi class hardware, this seems to be true only when ECC is enabled. Contiguous-but-uncoalesced memory transactions take about a 20% hit on Fermi.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文