为什么 Cuda/OpenCL 的全局内存中不存在库冲突?

发布于 2024-09-25 18:23:29 字数 198 浏览 6 评论 0 原文

我还没有弄清楚并且谷歌没有帮助我的一件事是,为什么有可能与共享内存发生银行冲突,但在全局内存中却没有?银行与寄存器会存在冲突吗?

更新 哇,我真的很感谢 Tibbit 和 Grizzly 的两个回答。看来我只能给一个答案打绿色复选标记。我对堆栈溢出很陌生。我想我必须选择一个答案作为最佳答案。我可以对我没有打绿勾的答案说声谢谢吗?

One thing I haven't figured out and google isn't helping me, is why is it possible to have bank conflicts with shared memory, but not in global memory? Can there be bank conflicts with registers?

UPDATE
Wow I really appreciate the two answers from Tibbit and Grizzly. It seems that I can only give a green check mark to one answer though. I am newish to stack overflow. I guess I have to pick one answer as the best. Can I do something to say thank you to the answer I don't give a green check to?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

夏至、离别 2024-10-02 18:23:29

简短回答:全局内存或寄存器中不存在存储体冲突。

说明:

理解原因的关键是把握操作的粒度。单个线程不访问全局内存。全局内存访问是“合并的”。由于全局内存太慢,块内线程的任何访问都被分组在一起,以尽可能少地向全局内存发出请求。

共享内存可以被线程同时访问。当两个线程尝试访问同一存储体中的地址时,这会导致存储体冲突。

除了分配寄存器的线程之外,任何线程都不能访问寄存器。由于您无法读取或写入我的寄存器,因此您无法阻止我访问它们 - 因此,不存在任何银行冲突。

谁可以阅读&写入全局内存?

仅块。单个线程可以进行访问,但事务将在块级别处理(实际上是扭曲/半扭曲级别,但我试图不复杂)。如果两个块访问相同的内存,我不认为它会花费更长的时间,而且可能会被最新设备中的 L1 缓存加速——尽管这并不明显。

谁可以阅读&写入共享内存?

给定块内的任何线程。如果每个块只有 1 个线程,则不会出现存储体冲突,但不会获得合理的性能。发生 Bank 冲突是因为一个块分配了多个线程,例如 512 个线程,并且它们都在争夺同一 Bank 内的不同地址(不完全相同的地址)。 CUDA C 编程指南 - 图 G2,第 167 页(实际上是 pdf 的第 177 页)的末尾有一些关于这些冲突的精彩图片。 链接到版本 3.2

谁可以阅读&写入寄存器?

仅分配到它的特定线程。因此,一次只有一个线程访问它。

Short Answer: There are no bank conflicts in either global memory or in registers.

Explanation:

The key to understanding why is to grasp the granularity of the operations. A single thread does not access the global memory. Global memory accesses are "coalesced". Since global memory is soo slow, any access by the threads within a block are grouped together to make as few requests to the global memory as possible.

Shared memory can be accessed by threads simultaneously. When two threads attempt to access an address within the same bank, this causes a bank conflict.

Registers cannot be accessed by any thread except the one to which it is allocated. Since you can't read or write to my registers, you can't block me from accessing them -- hence, there aren't any bank conflicts.

Who can read & write to global memory?

Only blocks. A single thread can make an access, but the transaction will be processed at the block level (actually the warp / half warp level, but I'm trying not be complicated). If two blocks access the same memory, I don't believe it will take longer and it may happen accelerated by the L1 cache in the newest devices -- though this isn't transparently evident.

Who can read & write to shared memory?

Any thread within a given block. If you only have 1 thread per block you can't have a bank conflict, but you won't have reasonable performance. Bank conflicts occur because a block is allocated with several, say 512 threads and they're all vying for different addresses within the same bank (not quite the same address). There are some excellent pictures of these conflicts at the end of the CUDA C Programming Guide -- Figure G2, on page 167 (actually page 177 of the pdf). Link to version 3.2

Who can read & write to registers?

Only the specific thread to which it is allocated. Hence only one thread is accessing it at one time.

笑脸一如从前 2024-10-02 18:23:29

给定类型的内存是否可能存在存储体冲突显然取决于内存的结构及其用途。

那么为什么共享内存的设计方式会允许存储体冲突呢?

这相对简单,设计一个可以同时处理对同一内存的独立访问的内存控制器并不容易(事实证明)大多数不能)。因此,为了允许 halfwarp 中的每个线程访问单独寻址的字,内存被存储起来,每个存储体都有一个独立的控制器(至少人们可以这么想,不确定实际的硬件)。这些存储体交错排列,使连续线程能够快速访问连续内存。因此,这些存储体中的每一个都可以一次处理一个请求,理想情况下允许同时执行 halfwarp 中的所有请求(显然,由于这些存储体的独立性,该模型理论上可以维持更高的带宽,这也是一个优点)。

寄存器怎么样?

寄存器被设计为作为 ALU 指令的操作数进行访问,这意味着它们必须以非常低的延迟进行访问。因此,他们获得了更多的晶体管/位来实现这一点。我不确定现代处理器中寄存器的访问方式到底如何(不是您经常需要的信息,也不是那么容易找到的信息)。然而,在存储体中组织寄存器显然是非常不切实际的(对于更简单的架构,您通常会看到所有寄存器都挂在一个大的多路复用器上)。所以不会,寄存器不会出现银行冲突。

全局内存

首先,全局内存的工作粒度与共享内存不同。内存以 32、64 或 128 字节块进行访问(至少对于 GT200,对于 fermi 总是 128B,但缓存,AMD 有点不同),每次您想要从一个块中获取某些内容时,都会访问/传输整个块。这就是为什么需要合并访问,因为如果每个线程都从不同的块访问内存,则必须传输所有块。

但谁说没有银行冲突呢?我对此并不完全确定,因为我还没有找到任何实际来源来支持 NVIDIA 硬件,但这似乎是合乎逻辑的:
全局内存通常分布到多个 RAM 芯片(可以通过查看显卡轻松验证)。如果这些芯片中的每一个都像一个本地内存库,那么如果同一存储体上有多个同时请求,您就会遇到存储体冲突,这是有道理的。然而,对于一件事来说,影响不太明显(因为内存访问消耗的大部分时间都是从 A 到 B 获取数据的延迟),并且在一个工作组“内部”不会产生明显的影响(因为一次只有一个 halfwarp 执行,并且如果该 halfwarp 发出多个请求,则您将拥有未合并的内存访问,因此您已经受到了打击,因此很难衡量这一冲突的影响。因此,只有在以下情况下才会发生冲突:在 gpgpu 的典型情况下,您有一个大型数据集位于顺序内存中,因此影响实际上并不明显,因为有足够多的其他工作组同时访问其他银行,但它应该可以构建数据集仅以少数银行为中心的情况,这将对带宽造成影响(因为最大带宽将来自所有银行的平均分配访问,因此每个银行只能拥有其中的一小部分)带宽)。同样,我还没有阅读任何内容来证明这个关于 nvidia 硬件的理论(大多数内容都集中在合并上,这当然更重要,因为它使得这对于自然数据集来说不再是问题)。然而,根据 ATI Stream 计算指南,这就是 Radeon 卡的情况(对于 5xxx:银行相距 2kb,并且您希望确保将访问权限(意味着来自所有工作组同时活动)平等地分布在所有银行上),所以我可以想象 NVidia 卡的行为类似。

当然,对于大多数情况,全局内存上的存储体冲突的可能性不是问题,因此在实践中您可以说:

  • 访问全局内存时注意合并 访问
  • 本地内存时注意存储体冲突 访问
  • 寄存器没有问题

Whether or not there can be bank conflicts on a given type of memory is obviously dependent on the structure of the memory and therefore of its purpose.

So why is shared memory designed in a way which allows for bank conflicts?

Thats relatively simple, its not easy to design a memory controller which can handle independent accesses to the same memory simultaneously (proven by the fact that most can't). So in order to allow each thread in a halfwarp to access an individualy addressed word the memory is banked, with an independent controller for each bank (at least thats how one can think about it, not sure about the actual hardware). These banks are interleaved to make sequential threads accessing sequential memory fast. So each of these banks can handle one request at a time ideally allowing for concurrent executions of all requests in the halfwarp (obviously this model can theoretically sustain higher bandwidth due to the independence of those banks, which is also a plus).

What about registers?

Registers are designed to be accessed as operands for ALU instructions, meaning they have to be accessed with very low latency. Therefore they get more transistors/bit to make that possible. I'm not sure how exactly registers are accessed in modern processors (not the kind of information you need often and not that easy to find out). However it would obviously be highly unpractical to organize registers in banks (for simpler architectures you typically see all registers hanging on one big multiplexer). So no, there won't be bank conflicts for registers.

Global memory

First of all global memory works on a different granuality then shared memory. Memory is accessed in 32, 64 or 128byte blocks (for GT200 atleast, for fermi it is 128B always, but cached, AMD is a bit different), where everytime you want something from a block the whole block is accessed/transferred. That is why you need coalesced accesses, since if every thread accesses memory from a different block you have to transfer all blocks.

But who says there aren't bank conflicts? I'm not completely sure about this, because I haven't found any actual sources to support this for NVIDIA hardware, but it seems logical:
The global memory is typically distributed to several ram chips (which can be easily verified by looking on a graphicscard). It would make sense, if each of these chips is like a bank of local memory, so you would get bank conflicts if there are several simultaneous requests on the same bank. However the effects would be much less pronounced for one thing (since most of the time consumed by memory accesses is the latency to get the data from A to B anyways), and it won't be an effect noticible "inside" of one workgroup (since only one halfwarp executes at a time and if that halfwarp issues more then one request you have an uncoalesced memory access, so you are already taking a hit making it hard to measure the effects of this conflict. So you would only get conflicts if several workgroups try to access the same bank. In your typical situation for gpgpu you have a large dataset lying in sequential memory so the effects shouldn't really be noticible since there are enough other workgroups accessinng the other banks at the same time, but it should be possible to construct situations where the dataset is centered on just a few banks, which would make for a hit on bandwidth (since the maximal bandwidth would come from equaly distributing access on all banks, so each bank would only have a fraction of that bandwidth). Again I haven't read anything to prove this theory for nvidia hardware (mostly everything focusses on coalescing, which of course is more important as it makes this a nonproblem for natural datasets to). However according to the ATI Stream computing guide this is the situation for Radeon cards (for 5xxx: banks are 2kb apart and you want to make sure that you distribute your accesses (meaning from all worgroups simulateously active) equaly over all banks), so I would imagine that NVidia cards behave similary.

Of course for most scenarious the possibility of bank conflicts on global memory is a non issue, so in practice you can say:

  • Watch for coalescing when accessing global memory
  • Watch for bank conflicts when accessing local memory
  • No problems with accessing registers
南街女流氓 2024-10-02 18:23:29

多个线程访问同一存储体并不一定意味着存在存储体冲突。如果线程想要同时从同一存储体中的不同行读取数据,则会发生冲突。

multiple threads accessing the same bank does not necessarily mean there is a bank conflict. There is a conflict if threads want to read at the same time from A DIFFERENT ROW within the same bank.

携余温的黄昏 2024-10-02 18:23:29

为什么可能与共享内存发生存储体冲突,但在全局内存中却不会发生冲突?

全局内存访问确实存在存储体冲突和通道冲突。仅当以循环方式均匀访问内存通道和存储体时,才能实现最大全局内存带宽。对于对单个一维阵列的线性存储器访问,存储器控制器通常被设计为自动均匀地交错每个存储体和通道的存储器请求。然而,当同时访问多个一维数组(或多维数组的不同行)时,如果它们的基地址是内存通道或存储体大小的倍数,则可能会出现不完美的内存交错。在这种情况下,一个通道或组比另一通道或组受到的打击更大,从而串行化内存访问并减少可用的全局内存带宽。

由于缺乏文档,我不完全理解它是如何工作的,但它肯定存在。在我的实验中,我发现由于内存基址不吉利,性能下降了 20%。这个问题可能相当隐蔽——根据内存分配大小,性能下降可能会随机发生。有时,内存分配器的默认对齐大小也可能过于聪明 - 当每个数组的基地址都对齐到较大大小时,它会增加通道/存储体冲突的机会,有时使其 100% 发生。时间。我还发现分配大量内存,然后添加手动偏移以“错位”远离同一通道/存储体的较小数组可以帮助缓解问题。

内存交错模式有时可能很棘手。例如,AMD 手册称 Radeon HD 79XX 系列 GPU 有 12 个内存通道 - 这不是 2 的幂,因此如果没有文档,通道映射就很不直观,因为不能仅从内存地址位推导出来。不幸的是,我发现 GPU 供应商的文档记录往往很差,因此可能需要一些反复试验。例如,AMD 的 OpenCL 优化手册仅限于 GCN 硬件,并且不提供比 Radeon HD 7970 更新的硬件的任何信息 - 有关 Vega 中具有 HBM VRAM 的较新 GCN GPU 的信息,或较新的 RDNA/CDNA 架构完全缺席。不过,AMD 提供了 OpenCL 扩展来报告硬件的通道和存储体大小,这可能有助于实验。在我的 Radeon VII / Instinct MI50 上,它们是:

Global memory channels (AMD)                    128
Global memory banks per channel (AMD)           4
Global memory bank width (AMD)                  256 bytes

大量通道可能是 4096 位 HBM2 内存的结果。

AMD 的优化手册

AMD 旧的 AMD APP SDK OpenCL 优化指南 提供了以下解释:

2.1 全局内存优化

[...] 如果两个内存访问请求定向到同一个控制器,则硬件
序列化访问。这称为渠道冲突。同样,如果两个内存
访问请求进入同一存储体,硬件对访问进行串行化。
这称为银行冲突。从开发者的角度来看,没有太多
渠道冲突和银行冲突之间的区别。通常,两步的大力量
导致渠道冲突。导致特定类型冲突的两步力量的大小取决于芯片。形成通道的一步
具有八个通道的机器上的冲突可能会导致
机器有四个。
在本文档中,术语库冲突用于指任一类型的冲突。

2.1.1 渠道冲突

重要的概念是内存步长:内存地址的增量,
以元素为单位,在获取或存储的连续元素之间进行测量
内核中的连续工作项。许多重要的内核并不专门
使用简单的跨步访问模式;相反,它们具有大型非单位
大步前进。例如,许多代码在每个维度上执行类似的操作
二维或三维数组。在低位上执行计算
维度通常可以用单位步幅来完成,但是计算的步幅
其他维度中的值通常很大。这可能会导致显着
当代码原封不动地移植到 GPU 系统时,性能会下降。
带缓存的 CPU 也存在同样的问题,即大的二次幂跨步力
数据仅存入几个缓存行。

一种解决方案是重写代码以在
内核。这允许所有计算以单位步幅完成。确保
与转换所需的时间相比,转换所需的时间相对较短
执行核计算。

对于许多内核来说,性能的降低足够大,以至于
值得尝试去理解和解决这个问题。

在GPU编程中,最好让相邻的工作项读或写
相邻的内存地址。这是避免渠道冲突的一种方法。
当应用程序完全控制访问模式和地址时
生成时,开发人员必须安排数据结构以最小化银行
冲突。低位不同的访问可以并行运行;那些不同的
只有高位可以被序列化。

在此示例中:

for (ptr=base; ptr

如果低位全部相同,则存储器请求全部访问同一通道上的同一存储体并串行处理。这是需要避免的低性能模式。当步幅是一个力量时
2(并且大于通道交错),上面的循环仅访问一个
内存通道。

还值得注意的是,跨所有通道分配内存访问并不总是有助于提高性能,反而会降低性能。 AMD 警告说,最好在同一工作组中访问相同的内存通道/存储体 - 由于 GPU 同时运行多个工作组,因此可以实现理想的内存交错。另一方面,访问同一工作组中的多个内存通道/存储体会降低性能。

如果工作组中的每个工作项都引用连续的内存地址
工作项 0 的地址对齐到 256 字节,每个工作项
获取 32 位,整个波前访问一个通道。虽然这看起来
慢,实际上是一个快的模式,因为需要考虑内存
访问整个设备,而不仅仅是单个波前。

[...]

在任何时候,每个计算单元都在执行来自单个计算单元的指令
波前。在内存密集型内核中,该指令很可能是一条
内存访问。由于 AMD Radeon HD 7970 上有 12 个通道
GPU,最多12个计算单元可以发出内存访问操作
一个周期。如果来自 12 个波前的访问去往不同的,则效率最高。
渠道。实现此目的的一种方法是让每个波前访问连续的
256 组 = 64 * 4 字节。注意,如图2.1中的sh自己,获取256 * 12
连续的字节并不总是在所有通道中循环。
低效的访问模式是每个波前访问所有通道。这
如果连续的工作项访问具有大功率的数据,则可能会发生这种情况
两步。

更多硬件实现细节请阅读原版手册,此处不再赘述。

Is why is it possible to have bank conflicts with shared memory, but not in global memory?

Bank conflicts and channel conflicts indeed exist for global memory accesses. Maximum global memory bandwidth is only achieved when memory channels and banks are evenly accessed in a round-robin manner. For linear memory accesses to a single 1D array, the memory controller is usually designed to automatically interleave memory requests each bank and channel evenly. However, when multiple 1D arrays (or different rows of a multi-dimensional array) are accessed at the same time, and if their base addresses are multiples of the size of a memory channel or bank, imperfect memory interleaving may occur. In this case, one channel or bank is hit harder than another channel or bank, serializing memory access and reducing available global memory bandwidth.

Due to lack of documentation, I don't entirely understand how it works, but it surely exists. In my experiments, I've observed 20% performance degradation due to unlucky memory base addresses. This problem can be rather insidious - depending on the memory allocation size, performance degradation may occur randomly. Sometimes the default alignment size of the memory allocator can also be too clever for its own good - when every array's base address is aligned to a large size, it can increase the chance of channel/bank conflict, sometimes making it happen 100% of the time. I also found allocating a large pool of memory, then adding manual offsets to "misalign" smaller arrays away from the same channel/bank can help mitigating the problem.

The memory interleaving pattern can sometimes be tricky. For example, AMD's manual says Radeon HD 79XX-series GPUs have 12 memory channels - this is not a power of 2, so channel mapping is far from intuitive without documentation, since cannot just be deduced from the memory address bits alone. Unfortunately, I found it's often poorly documented by the GPU vendors so it may require some trial-and-error. For example, AMD's OpenCL optimization manual is only limited to GCN hardware, and it doesn't provide any information for hardware newer than Radeon HD 7970 - information about newer GCN GPUs with HBM VRAM found in Vega, or the newer RDNA/CDNA architectures are completely absent. However, AMD provides OpenCL extensions to report the channel and bank sizes of the hardware, which may help with experiments. On my Radeon VII / Instinct MI50, they're:

Global memory channels (AMD)                    128
Global memory banks per channel (AMD)           4
Global memory bank width (AMD)                  256 bytes

The huge number of channels is likely a result of the 4096-bit HBM2 memory.

AMD's Optimization Manual

AMD's old AMD APP SDK OpenCL Optimization Guide provides the following explanation:

2.1 Global Memory Optimization

[...] If two memory access requests are directed to the same controller, the hardware
serializes the access. This is called a channel conflict. Similarly, if two memory
access requests go to the same memory bank, hardware serializes the access.
This is called a bank conflict. From a developer’s point of view, there is not much
difference between channel and bank conflicts. Often, a large power of two stride
results in a channel conflict. The size of the power of two stride that causes a specific type of conflict depends on the chip. A stride that results in a channel
conflict on a machine with eight channels might result in a bank conflict on a
machine with four.
In this document, the term bank conflict is used to refer to either kind of conflict.

2.1.1 Channel Conflicts

The important concept is memory stride: the increment in memory address,
measured in elements, between successive elements fetched or stored by
consecutive work-items in a kernel. Many important kernels do not exclusively
use simple stride one accessing patterns; instead, they feature large non-unit
strides. For instance, many codes perform similar operations on each dimension
of a two- or three-dimensional array. Performing computations on the low
dimension can often be done with unit stride, but the strides of the computations
in the other dimensions are typically large values. This can result in significantly
degraded performance when the codes are ported unchanged to GPU systems.
A CPU with caches presents the same problem, large power-of-two strides force
data into only a few cache lines.

One solution is to rewrite the code to employ array transpositions between the
kernels. This allows all computations to be done at unit stride. Ensure that the
time required for the transposition is re latively small compared to the time to
perform the kernel calculation.

For many kernels, the reduction in performance is sufficiently large that it is
worthwhile to try to understand and solve this problem.

In GPU programming, it is best to have adjacent work-items read or write
adjacent memory addresses. This is one way to avoid channel conflicts.
When the application has complete control of the access pattern and address
generation, the developer must arrange the data structures to minimize bank
conflicts. Accesses that differ in the lower bits can run in parallel; those that differ
only in the upper bits can be serialized.

In this example:

for (ptr=base; ptr<max; ptr += 16KB)
    R0 = *ptr ;

where the lower bits are all the same, the memory requests all access the same bank on the same channel and are processed serially. This is a low-performance pattern to be avoided. When the stride is a power of
2 (and larger than the channel interleave), the loop above only accesses one
channel of memory.

It's also worth noting that distributing memory access across all channels does not always help with performance, it can degrade performance instead. AMD warns that, it can be better to access the same memory channel/bank in the same workgroup - as the GPU is running many workgroups simultaneously, ideal memory interleaving is achieved. On the other hand, accessing multiple memory channels/bank in the same workgroup degrades performance.

If every work-item in a work-group references consecutive memory addresses
and the address of work-item 0 is aligned to 256 bytes and each work-item
fetches 32 bits, the entire wavefront accesses one channel. Although this seems
slow, it actually is a fast pattern because it is necessary to consider the memory
access over the entire device, not just a single wavefront.

[...]

At any time, each compute unit is executing an instruction from a single
wavefront. In memory intensive kernels, it is likely that the instruction is a
memory access. Since there are 12 channels on the AMD Radeon HD 7970
GPU, at most 12 of the compute units can issue a memory access operation in
one cycle. It is most efficient if the accesses from 12 wavefronts go to different
channels. One way to achieve this is for each wavefront to access consecutive
groups of 256 = 64 * 4 bytes. Note, as sh own in Figure 2.1, fetching 256 * 12
bytes in a row does not always cycle through all channels.
An inefficient access pattern is if each wavefront accesses all the channels. This
is likely to happen if consecutive work-items access data that has a large power
of two strides.

Read the original manual for more hardware implementation details, which are omitted here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文