GPU 共享内存大小非常小 - 我该怎么办?
目前大多数 nVIDIA GPU 上的共享内存(OpenCL 术语中的“本地内存”)大小仅为 16 KiB。
我有一个应用程序,需要在其中创建一个包含 10,000 个整数的数组。所以我需要容纳 10,000 个整数的内存量 = 10,000 * 4b = 40kb。
- 我该如何解决这个问题?
- 有没有共享内存超过 16 KiB 的 GPU?
The size of the shared memory ("local memory" in OpenCL terms) is only 16 KiB on most nVIDIA GPUs of today.
I have an application in which I need to create an array that has 10,000 integers. so the amount of memory I will need to fit 10,000 integers = 10,000 * 4b = 40kb.
- How can I work around this?
- Is there any GPU that has more than 16 KiB of shared memory ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
将共享内存视为显式管理的缓存。您需要将数组存储在全局内存中,并根据需要将其部分缓存在共享内存中,方法是进行多次传递或其他某种方案,以最大限度地减少全局内存的加载和存储数量。
如何实现这一点取决于您的算法 - 如果您可以提供一些您正在尝试实现的具体内容的详细信息,您可能会得到一些更具体的建议。
最后一点 - 请注意,共享内存在块中的所有线程之间共享 - 每个线程的内存远小于 16 kb,除非您有一个对块中所有线程通用的单一数据结构堵塞。
Think of shared memory as explicitly managed cache. You will need to store your array in global memory and cache parts of it in shared memory as needed, either by making multiple passes or some other scheme which minimises the number of loads and stores to/from global memory.
How you implement this will depend on your algorithm - if you can give some details of what it is exactly that you are trying to implement you may get some more concrete suggestions.
One last point - be aware that shared memory is shared between all threads in a block - you have way less than 16 kb per thread, unless you have a single data structure which is common to all threads in a block.
所有计算能力 2.0 及更高版本的设备(大多数是在过去一两年内)每个多处理器都有 48KB 的可用共享内存。首先,Paul 的答案是正确的,因为您可能不希望将所有 10K 整数加载到单个多处理器中。
All compute capability 2.0 and greater devices (most in the last year or two) have 48KB of available shared memory per multiprocessor. That begin said, Paul's answer is correct in that you likely will not want to load all 10K integers into a single multiprocessor.
您可以尝试使用
cudaFuncSetCacheConfig(nameOfKernel, cudaFuncCachePrefer{Shared, L1})
函数。如果您更喜欢 L1 而不是共享,则 48KB 将分配给 L1,16KB 将分配给共享。
如果您更喜欢共享而不是 L1,则 48KB 将分配给共享,16KB 将分配给 L1。
用法:
You can try to use
cudaFuncSetCacheConfig(nameOfKernel, cudaFuncCachePrefer{Shared, L1})
function.If you prefer L1 to Shared, then 48KB will go to L1 and 16KB will go to Shared.
If you prefer Shared to L1, then 48KB will go to Shared and 16KB will go to L1.
Usage: