GPU 共享内存库冲突

发布于 2024-10-07 06:01:28 字数 376 浏览 13 评论 0原文

我试图了解银行冲突是如何发生的。
我在全局内存中有一个大小为 256 的数组，并且在单个块中有 256 个线程，我想将该数组复制到共享内存。因此每个线程都会复制一个元素。

shared_a[threadIdx.x]=global_a[threadIdx.x]

这个简单的行动会导致银行冲突吗？

假设现在数组的大小大于线程数，所以我现在用它来将全局内存复制到共享内存：

tid = threadIdx.x;
for(int i=0;tid+i<N;i+=blockDim.x)
     shared_a[tid+i]=global_a[tid+i];

上面的代码是否会导致银行冲突？

原文

I am trying to understand how bank conflicts take place.
I have an array of size 256 in global memory and I have 256 threads in a single block, and I want to copy the array to shared memory. Therefore every thread copies one element.

shared_a[threadIdx.x]=global_a[threadIdx.x]

Does this simple action result in a bank conflict?

Suppose now that the size of the array is larger than the number of threads, so I am now using this to copy the global memory to the shared memory:

tid = threadIdx.x;
for(int i=0;tid+i<N;i+=blockDim.x)
     shared_a[tid+i]=global_a[tid+i];

Does the above code result in a bank conflict?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

眼藏柔 2024-10-14 06:01:28

检查这一点的最佳方法是使用“Compute Visual Profiler”来分析您的代码；这与 CUDA 工具包一起提供。 GPU Gems 3 中也有一个很棒的部分 - “39.2.3 避免银行冲突”。

“当同一 warp 中的多个线程访问同一存储体时，除非该 warp 的所有线程都访问同一 32 位字中的同一地址，否则会发生存储体冲突” - 首先有 16 个存储体每个 4 字节宽。因此，本质上，如果您半扭曲中的任何线程从共享内存库中的相同4字节读取内存，那么您将会遇到库冲突和序列化等问题。

好吧，所以您的第一个示例：

首先假设您的数组是 int 类型（32 位字）。您的代码将这些整数保存到共享内存中，跨任何半扭曲，第 K 个线程保存到第 K 个内存条。例如，前半个 warp 的线程 0 将保存到第一个内存块中的 shared_a[0]，线程 1 将保存到 shared_a[1]，每个half warp 有 16 个线程，这些线程映射到 16 个 4 字节存储体。在下半个扭曲中，第一个线程现在将其值保存到再次位于第一个内存库中的shared_a[16]中。因此，如果您使用 4 字节字（例如 int、float 等），那么您的第一个示例将不会导致存储区冲突。如果您使用 1 字节字（例如 char），则在前半部分扭曲线程 0、1、2 和 3 都会将它们的值保存到共享内存的第一个存储体中，这将导致存储体冲突。

第二个示例：

同样，这完全取决于您所使用的单词的大小，但在本示例中，我将使用 4 字节的单词。所以看一下前半部分的扭曲：

线程数 = 32

N = 64

线程 0：将写入 0, 31, 63
线程 1：将写入 1, 32

跨半扭曲的所有线程同时执行，因此对共享内存的写入不会导致存储体冲突。不过，我必须仔细检查一下这一点。

希望这有帮助，抱歉回复太多！

回复收藏 0 原文

素衣风尘叹 2024-10-14 06:01:28

在这两种情况下，线程都使用连续地址访问共享内存。它取决于共享内存的元素大小，但线程扭曲对共享内存的连续访问不会导致“小”元素大小的存储体冲突。

使用 NVIDIA Visual Profiler 分析此代码显示：元素大小小于 32 且为 4 的倍数（4, 8, 12, ... , 28）时，连续访问共享内存不会导致存储体冲突。然而，元素大小为 32 会导致库冲突。

Ljdawson 的回答包含一些过时的信息：

...如果使用 char 等 1 字节字，则在前半部分扭曲线程 0、1、2 和 3 都会将它们的值保存到共享内存的第一个存储体，这将导致存储体冲突。< /p>

这对于旧的 GPU 可能是正确的，但对于 cc >= 2.x 的最新 GPU，它们不会导致存储体冲突，这实际上是由于广播机制(链接）。以下引用来自 CUDA C编程指南 (v8.0.61) G3.3。共享内存。

对 warp 的共享内存请求不会在访问同一 32 位字内任何地址的两个线程之间产生存储体冲突（即使这两个地址属于同一存储体）：在这种情况下，对于读取访问，该字被广播到请求线程（单个事务中可以广播多个字），并且对于写访问，每个地址仅由一个线程写入（哪个线程执行写入未定义）。
这尤其意味着，如果按如下方式访问 char 数组，则不会出现库冲突，例如：
 extern __shared__ char 共享[];
   char 数据 = 共享[BaseIndex + tid];

In both cases threads access shared memory with consecutive address. It depends on the element size of shared memory, but consecutive access to shared memory by a warp of threads does not result in a bank conflict for "small" element sizes.

Profiling this code with NVIDIA Visual Profiler shows that for element size smaller than 32 and a multiple of 4 (4, 8, 12, ... , 28), consecutive access to the shared memory does not result in a bank conflict. Element size of 32, however, results in bank conflict.

Answer by Ljdawson contains some outdated information:

... If you use a 1 byte word such as char, in the first half warp threads 0, 1, 2 and 3 will all save their values to the first bank of shared memory which will cause a bank conflict.

This may be true for old GPUs, but for recent GPUs with cc >= 2.x, they don't cause bank conflicts, effectively due to the broadcast mechanism(link). Following quote is from CUDA C PROGRAMMING GUIDE (v8.0.61) G3.3. Shared Memory.

A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank): In that case, for read accesses, the word is broadcast to the requesting threads (multiple words can be broadcast in a single transaction) and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined).
This means, in particular, that there are no bank conflicts if an array of char is accessed as follows, for example:
   extern __shared__ char shared[];
   char data = shared[BaseIndex + tid];

回复收藏 0 原文

~没有更多了~