CUDA中2D共享内存是如何排列的
我一直使用线性共享内存(加载、存储、访问邻居),但我在 2D 中做了一个简单的测试来研究存储体冲突,结果让我感到困惑。
下一个代码将数据从一维全局内存数组读取到共享内存,并将其从共享内存复制回全局内存。
__global__ void update(int* gIn, int* gOut, int w) {
// shared memory space
__shared__ int shData[16][16];
// map from threadIdx/BlockIdx to data position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the global id into the one dimensional array
int gid = x + y * w;
// load shared memory
shData[threadIdx.x][threadIdx.y] = gIn[gid];
// synchronize threads not really needed but keep it for convenience
__syncthreads();
// write data back to global memory
gOut[gid] = shData[threadIdx.x][threadIdx.y];
}
视觉分析器报告共享内存中的冲突。下一个代码避免了这些冲突(仅显示差异)
// load shared memory
shData[threadIdx.y][threadIdx.x] = gIn[gid];
// write data back to global memory
gOut[gid] = shData[threadIdx.y][threadIdx.x];
这种行为让我感到困惑,因为在编程大规模并行处理器中。我们可以阅读的实践方法:
C 和 CUDA 中的矩阵元素根据行主约定放置到线性寻址位置。也就是说,矩阵第 0 行的元素首先按顺序放置到连续的位置。
这与共享内存安排有关吗?或者使用线程索引?也许我错过了什么?
内核配置如下:
// kernel configuration
dim3 dimBlock = dim3 ( 16, 16, 1 );
dim3 dimGrid = dim3 ( 64, 64 );
// Launching a grid of 64x64 blocks with 16x16 threads -> 1048576 threads
update<<<dimGrid, dimBlock>>>(d_input, d_output, 1024);
提前致谢。
I've always worked with linear shared memory (load, store, access neighbors) but I've made a simple test in 2D to study bank conflicts which results have confused me.
The next code read data from one dimensional global memory array to shared memory and copy it back from shared memory to global memory.
__global__ void update(int* gIn, int* gOut, int w) {
// shared memory space
__shared__ int shData[16][16];
// map from threadIdx/BlockIdx to data position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the global id into the one dimensional array
int gid = x + y * w;
// load shared memory
shData[threadIdx.x][threadIdx.y] = gIn[gid];
// synchronize threads not really needed but keep it for convenience
__syncthreads();
// write data back to global memory
gOut[gid] = shData[threadIdx.x][threadIdx.y];
}
The visual profiler reported conflicts in shared memory. The next code avoid thouse conflicts (only show the differences)
// load shared memory
shData[threadIdx.y][threadIdx.x] = gIn[gid];
// write data back to global memory
gOut[gid] = shData[threadIdx.y][threadIdx.x];
This behavior has confused me because in Programming Massively Parallel Processors. A Hands-on approach we can read:
matrix elements in C and CUDA are placed into the linearly addressed locations according to the row major convention. That is, the elements of row 0 of a matrix are first placed in order into consecutive locations.
Is this related to shared memory arrangement? or with threads indexes? Maybe am I missing something?
The kernel configuration is as follow:
// kernel configuration
dim3 dimBlock = dim3 ( 16, 16, 1 );
dim3 dimGrid = dim3 ( 64, 64 );
// Launching a grid of 64x64 blocks with 16x16 threads -> 1048576 threads
update<<<dimGrid, dimBlock>>>(d_input, d_output, 1024);
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的,共享内存按照您的预期按行优先顺序排列。因此,您的 [16][16] 数组按行存储,如下所示:
因为 Fermi 之前的硬件上有 16 个 32 位共享内存库,所以每一列中的每个整数条目都映射到一个共享内存库。那么这与您选择的索引方案有何相互作用?
需要记住的是,块内的线程按照列主顺序进行编号(从技术上讲,结构的 x 维度变化最快,其次是 y,然后是 z)。因此,当您使用此索引方案时:
半扭曲内的线程将从同一列读取,这意味着从同一共享内存库读取,并且会发生库冲突。当您使用相反的方案时:
同一 half-warp 内的线程将从同一行读取,这意味着从 16 个不同的共享内存组中的每一个读取,不会发生冲突。
Yes, shared memory is arranged in row-major order as you expected. So your [16][16] array is stored row wise, something like this:
Because there are 16 32 bit shared memory banks on pre-Fermi hardware, every integer entry in each column maps onto one shared memory bank. So how does that interact with your choice of indexing scheme?
The thing to keep in mind is that threads within a block are numbered in the equivalent of column major order (technically the x dimension of the structure is the fastest varying, followed by y, followed by z). So when you use this indexing scheme:
threads within a half-warp will be reading from the same column, which implies reading from the same shared memory bank, and bank conflicts will occur. When you use the opposite scheme:
threads within the same half-warp will be reading from the same row, which implies reading from each of the 16 different shared memory banks, no conflicts occur.