GPU 共享内存库冲突

发布于 2024-10-07 06:01:28 字数 376 浏览 5 评论 0原文

我试图了解银行冲突是如何发生的。
我在全局内存中有一个大小为 256 的数组,并且在单个块中有 256 个线程,我想将该数组复制到共享内存。因此每个线程都会复制一个元素。

shared_a[threadIdx.x]=global_a[threadIdx.x]

这个简单的行动会导致银行冲突吗?

假设现在数组的大小大于线程数,所以我现在用它来将全局内存复制到共享内存:

tid = threadIdx.x;
for(int i=0;tid+i<N;i+=blockDim.x)
     shared_a[tid+i]=global_a[tid+i];

上面的代码是否会导致银行冲突?

I am trying to understand how bank conflicts take place.
I have an array of size 256 in global memory and I have 256 threads in a single block, and I want to copy the array to shared memory. Therefore every thread copies one element.

shared_a[threadIdx.x]=global_a[threadIdx.x]

Does this simple action result in a bank conflict?

Suppose now that the size of the array is larger than the number of threads, so I am now using this to copy the global memory to the shared memory:

tid = threadIdx.x;
for(int i=0;tid+i<N;i+=blockDim.x)
     shared_a[tid+i]=global_a[tid+i];

Does the above code result in a bank conflict?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

眼藏柔 2024-10-14 06:01:28

检查这一点的最佳方法是使用“Compute Visual Profiler”来分析您的代码;这与 CUDA 工具包一起提供。 GPU Gems 3 中也有一个很棒的部分 - “39.2.3 避免银行冲突”。

当同一 warp 中的多个线程访问同一存储体时,除非该 warp 的所有线程都访问同一 32 位字中的同一地址,否则会发生存储体冲突” - 首先有 16 个存储体每个 4 字节宽。因此,本质上,如果您半扭曲中的任何线程从共享内存库中的相同4字节读取内存,那么您将会遇到库冲突和序列化等问题。

好吧,所以您的第一个示例

首先假设您的数组是 int 类型(32 位字)。您的代码将这些整数保存到共享内存中,跨任何半扭曲,第 K 个线程保存到第 K 个内存条。例如,前半个 warp 的线程 0 将保存到第一个内存块中的 shared_a[0],线程 1 将保存到 shared_a[1],每个half warp 有 16 个线程,这些线程映射到 16 个 4 字节存储体。在下半个扭曲中,第一个线程现在将其值保存到再次位于第一个内存库中的shared_a[16]中。因此,如果您使用 4 字节字(例如 int、float 等),那么您的第一个示例将不会导致存储区冲突。如果您使用 1 字节字(例如 char),则在前半部分扭曲线程 0、1、2 和 3 都会将它们的值保存到共享内存的第一个存储体中,这将导致存储体冲突。

第二个示例

同样,这完全取决于您所使用的单词的大小,但在本示例中,我将使用 4 字节的单词。所以看一下前半部分的扭曲:

线程数 = 32

N = 64

线程 0:将写入 0, 31, 63
线程 1:将写入 1, 32

跨半扭曲的所有线程同时执行,因此对共享内存的写入不会导致存储体冲突。不过,我必须仔细检查一下这一点。

希望这有帮助,抱歉回复太多!

The best way to check this would be to profile your code using the "Compute Visual Profiler"; this comes with the CUDA Toolkit. Also there's a great section in GPU Gems 3 on this - "39.2.3 Avoiding Bank Conflicts".

"When multiple threads in the same warp access the same bank, a bank conflict occurs unless all threads of the warp access the same address within the same 32-bit word" - First thing there are 16 memory banks each 4bytes wide. So essentially, if you have any thread in a half warp reading memory from the same 4bytes in a shared memory bank, you're going to have bank conflicts and serialization etc.

OK so your first example:

First lets assume your arrays are say for example of the type int (a 32-bit word). Your code saves these ints into shared memory, across any half warp the Kth thread is saving to the Kth memory bank. So for example thread 0 of the first half warp will save to shared_a[0] which is in the first memory bank, thread 1 will save to shared_a[1], each half warp has 16 threads these map to the 16 4byte banks. In the next half warp, the first thread will now save its value into shared_a[16] which is in the first memory bank again. So if you use a 4byte word such int, float etc then your first example will not result in a bank conflict. If you use a 1 byte word such as char, in the first half warp threads 0, 1, 2 and 3 will all save their values to the first bank of shared memory which will cause a bank conflict.

Second example:

Again this will all depend on the size of the word you are using, but for the example I'll use a 4byte word. So looking at the first half warp:

Number of threads = 32

N = 64

Thread 0: Will write to 0, 31, 63
Thread 1: Will write to 1, 32

All threads across the half warp execute concurrently so the writes to shared memory shouldn't cause bank conflicts. I'll have to double check this one though.

Hope this helps, sorry for the huge reply!

素衣风尘叹 2024-10-14 06:01:28

在这两种情况下,线程都使用连续地址访问共享内存。它取决于共享内存的元素大小,但线程扭曲对共享内存的连续访问不会导致“小”元素大小的存储体冲突。

使用 NVIDIA Visual Profiler 分析此代码显示:元素大小小于 32 且为 4 的倍数(4, 8, 12, ... , 28)时,连续访问共享内存不会导致存储体冲突。然而,元素大小为 32 会导致库冲突。


Ljdawson 的回答包含一些过时的信息:

...如果使用 char 等 1 字节字,则在前半部分扭曲线程 0、1、2 和 3 都会将它们的值保存到共享内存的第一个存储体,这将导致存储体冲突。< /p>

这对于旧的 GPU 可能是正确的,但对于 cc >= 2.x 的最新 GPU,它们不会导致存储体冲突,这实际上是由于广播机制(链接)。以下引用来自 CUDA C编程指南 (v8.0.61) G3.3。共享内存

对 warp 的共享内存请求不会在访问同一 32 位字内任何地址的两个线程之间产生存储体冲突(即使这两个地址属于同一存储体):在这种情况下,对于读取访问,该字被广播到请求线程(单个事务中可以广播多个字),并且对于写访问,每个地址仅由一个线程写入(哪个线程执行写入未定义)。

这尤其意味着,如果按如下方式访问 char 数组,则不会出现库冲突,例如:

 extern __shared__ char 共享[];
   char 数据 = 共享[BaseIndex + tid];

In both cases threads access shared memory with consecutive address. It depends on the element size of shared memory, but consecutive access to shared memory by a warp of threads does not result in a bank conflict for "small" element sizes.

Profiling this code with NVIDIA Visual Profiler shows that for element size smaller than 32 and a multiple of 4 (4, 8, 12, ... , 28), consecutive access to the shared memory does not result in a bank conflict. Element size of 32, however, results in bank conflict.


Answer by Ljdawson contains some outdated information:

... If you use a 1 byte word such as char, in the first half warp threads 0, 1, 2 and 3 will all save their values to the first bank of shared memory which will cause a bank conflict.

This may be true for old GPUs, but for recent GPUs with cc >= 2.x, they don't cause bank conflicts, effectively due to the broadcast mechanism(link). Following quote is from CUDA C PROGRAMMING GUIDE (v8.0.61) G3.3. Shared Memory.

A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank): In that case, for read accesses, the word is broadcast to the requesting threads (multiple words can be broadcast in a single transaction) and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined).

This means, in particular, that there are no bank conflicts if an array of char is accessed as follows, for example:

   extern __shared__ char shared[];
   char data = shared[BaseIndex + tid];
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文