当前位置：文江博客话题详情

如何使用合并内存访问

发布于 2024-11-18 15:22:24 字数 81 浏览 8 评论 0原文

我有“N”个线程在设备上同时执行，它们需要全局内存中的 M*N 个浮点数。访问合并的全局内存的正确方法是什么？在这件事上，共享内存能提供什么帮助呢？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半﹌身腐败 2024-11-25 15:22:24

通常，当相邻线程访问内存中的相邻单元时，可以实现良好的合并访问。因此，如果 tid 保存线程的索引，则访问：

arr[tid] --- 给出完美的合并
arr[tid+5] --- 几乎完美，可能未对齐
arr[tid*4] --- 不再那么好了，因为存在间隙
arr[random(0..N)]< /code> --- 太可怕了！

我是从 CUDA 程序员的角度来说的，但类似的规则也适用于其他地方，即使是简单的 CPU 编程，尽管影响并不那么大。

“但是我有很多数组，每个数组的长度都比我的线程数长大约 2 到 3 倍，并且使用像“arr[tid*4]”这样的模式是不可避免的。有什么办法可以解决这个问题吗？”< /i>

如果偏移量是某个较高 2 次方的倍数（例如 16*x 或 32*x），则不是问题。所以，如果你必须在 for 循环中处理一个相当长的数组，你可以这样做：（

for (size_t base=0; i<arraySize; i+=numberOfThreads)
    process(arr[base+threadIndex])

上面假设数组大小是线程数的倍数）

所以，如果线程数是32的倍数，内存访问就会好。

再次注意：我是从 CUDA 程序员的角度来讲的。对于不同的 GPU/环境，您可能需要更少或更多的线程来实现完美的内存访问合并，但应该适用类似的规则。

“32”与与全局内存并行访问的warp大小有关吗？

虽然不是直接的，但还是有一定联系的。全局内存分为 32、64 和 128 字节的段，通过 half-warp 访问。对于给定的内存获取指令访问的段越多，它运行的时间就越长。您可以在“CUDA 编程指南”中阅读更多详细信息，有一整章介绍此主题：“5.3.最大化内存吞吐量”。

此外，我还听说过一些关于共享内存来本地化内存访问的知识。这是内存合并的首选还是有其自身的困难？
由于共享内存位于片上，因此速度要快得多，但其大小有限。内存不像全局内存那样被分段，您可以几乎随机地访问而无需付出任何代价。然而，存在宽度为 4 字节（32 位 int 大小）的存储体行。每个线程访问的内存地址应该不同模16（或32，取决于GPU）。因此，地址[tid*4]将比[tid*5]慢得多，因为第一个地址仅访问存储体0、4、8、12，而后者仅访问存储体0、4、8、12 0, 5, 10, 15, 4, 9, 14, ...（组 ID = 地址模 16）。

同样，您可以在 CUDA 编程指南中阅读更多内容。

Usually, a good coalesced access can be achieved when the neighbouring threads access neighbouring cells in memory. So, if tid holds the index of your thread, then accessing:

arr[tid] --- gives perfect coalescence
arr[tid+5] --- is almost perfect, probably misaligned
arr[tid*4] --- is not that good anymore, because of the gaps
arr[random(0..N)] --- horrible!

I am talking from the perspective of a CUDA programmer, but similar rules apply elsewhere as well, even in a simple CPU programming, although the impact is not that big there.

"But I have so many arrays everyone has about 2 or 3 times longer than the number of my threads and using the pattern like "arr[tid*4]" is inevitable. What may be a cure for this?"

If the offset is a multiple of some higher 2-power (e.g. 16*x or 32*x) it is not a problem. So, if you have to process a rather long array in a for-loop, you can do something like this:

for (size_t base=0; i<arraySize; i+=numberOfThreads)
    process(arr[base+threadIndex])

(the above asumes that array size is a multiple of the number of threads)

So, if the number of threads is a multiple of 32, the memory access will be good.

Note again: I am talking from the perspective of a CUDA programmer. For different GPUs/environment you might need less or more threads for perfect memory access coalescence, but similar rules should apply.

Is "32" related to the warp size which access parallel to the global memory?

Although not directly, there is some connection. Global memory is divided into segments of 32, 64 and 128 bytes which are accessed by half-warps. The more segments you access for a given memory-fetch instruction, the longer it goes. You can read more into details in the "CUDA Programming Guide", there is a whole chapter on this topic: "5.3. Maximise Memory Throughput".

In addition, I heard a little about shared memory to localize the memory access. Is this preferred for memory coalescing or have its own difficulties?
Shared memory is much faster as it lies on-chip, but its size is limited. The memory is not segmented like global, you can access is almost-randomly at no penality cost. However, there are memory bank lines of width 4 bytes (size of 32-bit int). The address of memory that each thread access should be different modulo 16 (or 32, depending on the GPU). So, address [tid*4] will be much slower than [tid*5], because the first one access only banks 0, 4, 8, 12 and the latter 0, 5, 10, 15, 4, 9, 14, ... (bank id = address modulo 16).

Again, you can read more in the CUDA Programming Guide.

回复收藏 0 原文

~没有更多了~