GPU 共享内存库冲突
我试图了解银行冲突是如何发生的。
我在全局内存中有一个大小为 256 的数组,并且在单个块中有 256 个线程,我想将该数组复制到共享内存。因此每个线程都会复制一个元素。
shared_a[threadIdx.x]=global_a[threadIdx.x]
这个简单的行动会导致银行冲突吗?
假设现在数组的大小大于线程数,所以我现在用它来将全局内存复制到共享内存:
tid = threadIdx.x;
for(int i=0;tid+i<N;i+=blockDim.x)
shared_a[tid+i]=global_a[tid+i];
上面的代码是否会导致银行冲突?
I am trying to understand how bank conflicts take place.
I have an array of size 256 in global memory and I have 256 threads in a single block, and I want to copy the array to shared memory. Therefore every thread copies one element.
shared_a[threadIdx.x]=global_a[threadIdx.x]
Does this simple action result in a bank conflict?
Suppose now that the size of the array is larger than the number of threads, so I am now using this to copy the global memory to the shared memory:
tid = threadIdx.x;
for(int i=0;tid+i<N;i+=blockDim.x)
shared_a[tid+i]=global_a[tid+i];
Does the above code result in a bank conflict?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
检查这一点的最佳方法是使用“Compute Visual Profiler”来分析您的代码;这与 CUDA 工具包一起提供。 GPU Gems 3 中也有一个很棒的部分 - “39.2.3 避免银行冲突”。
“当同一 warp 中的多个线程访问同一存储体时,除非该 warp 的所有线程都访问同一 32 位字中的同一地址,否则会发生存储体冲突” - 首先有 16 个存储体每个 4 字节宽。因此,本质上,如果您半扭曲中的任何线程从共享内存库中的相同4字节读取内存,那么您将会遇到库冲突和序列化等问题。
好吧,所以您的第一个示例:
首先假设您的数组是 int 类型(32 位字)。您的代码将这些整数保存到共享内存中,跨任何半扭曲,第 K 个线程保存到第 K 个内存条。例如,前半个 warp 的线程 0 将保存到第一个内存块中的
shared_a[0]
,线程 1 将保存到shared_a[1]
,每个half warp 有 16 个线程,这些线程映射到 16 个 4 字节存储体。在下半个扭曲中,第一个线程现在将其值保存到再次位于第一个内存库中的shared_a[16]中。因此,如果您使用 4 字节字(例如 int、float 等),那么您的第一个示例将不会导致存储区冲突。如果您使用 1 字节字(例如 char),则在前半部分扭曲线程 0、1、2 和 3 都会将它们的值保存到共享内存的第一个存储体中,这将导致存储体冲突。第二个示例:
同样,这完全取决于您所使用的单词的大小,但在本示例中,我将使用 4 字节的单词。所以看一下前半部分的扭曲:
线程数 = 32
N = 64
线程 0:将写入 0, 31, 63
线程 1:将写入 1, 32
跨半扭曲的所有线程同时执行,因此对共享内存的写入不会导致存储体冲突。不过,我必须仔细检查一下这一点。
希望这有帮助,抱歉回复太多!
The best way to check this would be to profile your code using the "Compute Visual Profiler"; this comes with the CUDA Toolkit. Also there's a great section in GPU Gems 3 on this - "39.2.3 Avoiding Bank Conflicts".
"When multiple threads in the same warp access the same bank, a bank conflict occurs unless all threads of the warp access the same address within the same 32-bit word" - First thing there are 16 memory banks each 4bytes wide. So essentially, if you have any thread in a half warp reading memory from the same 4bytes in a shared memory bank, you're going to have bank conflicts and serialization etc.
OK so your first example:
First lets assume your arrays are say for example of the type int (a 32-bit word). Your code saves these ints into shared memory, across any half warp the Kth thread is saving to the Kth memory bank. So for example thread 0 of the first half warp will save to
shared_a[0]
which is in the first memory bank, thread 1 will save toshared_a[1]
, each half warp has 16 threads these map to the 16 4byte banks. In the next half warp, the first thread will now save its value into shared_a[16] which is in the first memory bank again. So if you use a 4byte word such int, float etc then your first example will not result in a bank conflict. If you use a 1 byte word such as char, in the first half warp threads 0, 1, 2 and 3 will all save their values to the first bank of shared memory which will cause a bank conflict.Second example:
Again this will all depend on the size of the word you are using, but for the example I'll use a 4byte word. So looking at the first half warp:
Number of threads = 32
N = 64
Thread 0: Will write to 0, 31, 63
Thread 1: Will write to 1, 32
All threads across the half warp execute concurrently so the writes to shared memory shouldn't cause bank conflicts. I'll have to double check this one though.
Hope this helps, sorry for the huge reply!
在这两种情况下,线程都使用连续地址访问共享内存。它取决于共享内存的元素大小,但线程扭曲对共享内存的连续访问不会导致“小”元素大小的存储体冲突。
使用 NVIDIA Visual Profiler 分析此代码显示:元素大小小于 32 且为 4 的倍数(4, 8, 12, ... , 28)时,连续访问共享内存不会导致存储体冲突。然而,元素大小为 32 会导致库冲突。
Ljdawson 的回答包含一些过时的信息:
这对于旧的 GPU 可能是正确的,但对于 cc >= 2.x 的最新 GPU,它们不会导致存储体冲突,这实际上是由于广播机制(链接)。以下引用来自 CUDA C编程指南 (v8.0.61) G3.3。共享内存。
In both cases threads access shared memory with consecutive address. It depends on the element size of shared memory, but consecutive access to shared memory by a warp of threads does not result in a bank conflict for "small" element sizes.
Profiling this code with NVIDIA Visual Profiler shows that for element size smaller than 32 and a multiple of 4 (4, 8, 12, ... , 28), consecutive access to the shared memory does not result in a bank conflict. Element size of 32, however, results in bank conflict.
Answer by Ljdawson contains some outdated information:
This may be true for old GPUs, but for recent GPUs with cc >= 2.x, they don't cause bank conflicts, effectively due to the broadcast mechanism(link). Following quote is from CUDA C PROGRAMMING GUIDE (v8.0.61) G3.3. Shared Memory.