什么是银行冲突？（进行Cuda/OpenCL编程）

发布于 2024-09-25 10:15:04 字数 132 浏览 6 评论 0原文

我一直在阅读 CUDA 和 OpenCL 的编程指南，但我无法弄清楚银行冲突是什么。他们只是深入研究如何解决问题，而不详细说明主题本身。有人能帮我理解吗？如果帮助是在 CUDA/OpenCL 的背景下，或者只是计算机科学中一般的银行冲突，我没有偏好。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

于我来说 2024-10-02 10:15:04

对于 Nvidia（以及 AMD）GPU，本地内存被分为多个内存库。每个存储体一次只能寻址一个数据集，因此如果半扭曲尝试从同一存储体加载/存储数据，则必须对访问进行序列化（这是存储体冲突）。对于 GT200 GPU，有 16 个库（Fermi 有 32 个库），AMD GPU 有 16 或 32 个库（57xx 或更高版本：32，以下所有内容：16），它们以 32 位的粒度交错（因此字节 0-3 是在银行 1 中，银行 2 中为 4-7，...，银行 1 中为 64-69，依此类推）。为了更好的可视化，它基本上看起来像这样：

Bank    |      1      |      2      |      3      |     ...     |      16     |
Address |  0  1  2  3 |  4  5  6  7 |  8  9 10 11 |     ...     | 60 61 62 63 |
Address | 64 65 66 67 | 68 69 70 71 | 72 73 74 75 |     ...     |     ...     |
...

因此，如果半扭曲中的每个线程访问连续的 32 位值，则不会出现存储体冲突。

此规则的一个例外（每个线程必须访问自己的存储体）是广播：如果所有线程访问相同的地址，则该值仅读取一次并广播到所有线程（对于 GT200，必须是半扭曲中的所有线程访问相同的地址，IIRC Fermi 和 AMD GPU 可以为任意数量的访问相同值的线程执行此操作）。

For Nvidia (and AMD for that matter) GPUs the local memory is divided into memory banks. Each bank can only address one dataset at a time, so if a half warp tries to load/store data from/to the same bank the access has to be serialized (this is a bank conflict). For GT200 GPUs there are 16 banks (32 banks for Fermi), 16 or 32 banks for AMD GPUs (57xx or higher: 32, everything below: 16), which are interleaved with a granularity of 32 bit (so bytes 0-3 are in bank 1, 4-7 in bank 2, ..., 64-69 in bank 1 and so on). For a better visualization it basically looks like this:

Bank    |      1      |      2      |      3      |     ...     |      16     |
Address |  0  1  2  3 |  4  5  6  7 |  8  9 10 11 |     ...     | 60 61 62 63 |
Address | 64 65 66 67 | 68 69 70 71 | 72 73 74 75 |     ...     |     ...     |
...

So if each thread in a half warp accesses successive 32-bit values there are no bank conflicts.

An exception from this rule (every thread must access its own bank) is broadcasting: if all threads access the same address, the value is only read once and broadcasted to all threads (for GT200 it has to be all threads in the half warp accessing the same address, IIRC Fermi and AMD GPUs can do this for any number of threads accessing the same value).

回复收藏 0 原文

放肆 2024-10-02 10:15:04

可以并行访问的共享内存被分为模块（也称为存储体）。如果两个内存位置（地址）出现在同一个存储体中，则会发生存储体冲突，在此期间访问是串行完成的，从而失去了并行访问的优势。

回复收藏 0 原文

今天小雨转甜 2024-10-02 10:15:04

简而言之，存储体冲突是指任何内存访问模式都无法在内存系统中可用的存储体之间分配 IO 的情况。以下示例详细阐述了这个概念：-

假设我们有二维 512x512 整数数组，并且我们的 DRAM 或内存系统中有 512 个存储体。默认情况下，数组数据的布局方式为：arr[0][0] 到bank 0，arr[0][1] 到bank 1，arr[0][2] 到bank 2 .... arr[0][511] 转到存储体 511。概括而言，arr[x][y] 占用存储体编号 y。现在一些代码（如下所示）开始以列主要方式访问数据，即。改变 x 同时保持 y 不变，那么最终结果将是所有连续的内存访问都将访问同一个存储体 - 因此出现存储体冲突。

int arr[512][512];
  for ( j = 0; j < 512; j++ ) // outer loop
    for ( i = 0; i < 512; i++ ) // inner loop
       arr[i][j] = 2 * arr[i][j]; // column major processing

通常，编译器可以通过缓冲数组或使用数组中的素数元素来避免此类问题。

In simple words, bank conflict is a case when any memory access pattern fails to distribute IO across banks available in the memory system. The following examples elaborates the concept:-

Let us suppose we have two dimensional 512x512 array of integers and our DRAM or memory system has 512 banks in it. By default the array data will be layout in a way that arr[0][0] goes to bank 0, arr[0][1] goes to bank 1, arr[0][2] to bank 2 ....arr[0][511] goes to bank 511. To generalize arr[x][y] occupies bank number y. Now some code (as shown below) start accessing data in column major fashion ie. changing x while keeping y constant, then the end result will be that all consecutive memory access will hit the same bank--hence bank conflict.

int arr[512][512];
  for ( j = 0; j < 512; j++ ) // outer loop
    for ( i = 0; i < 512; i++ ) // inner loop
       arr[i][j] = 2 * arr[i][j]; // column major processing

Such problems, usually, are avoided by compilers by buffering the array or using prime number of elements in the array.

回复收藏 0 原文