Cuda 不同的内存分配
我正在使用 CUDA 开发一个小型应用程序。
我有一个巨大的二维数组(不适合共享内存),其中所有块中的线程都将从随机位置不断读取。
这个二维数组是一个只读数组。
我应该在哪里分配这个二维数组?全局内存?恒定内存?纹理记忆?
I am developing a small application using CUDA.
i have a huge 2d array (won't fit on shared memory) in which threads in all blocks will read from constantly at random places.
this 2d array is a read-only array.
where should i allocate this 2d array? global memory?constant memroy? texture memory?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
根据设备纹理内存的大小,您应该在此区域中实现它。事实上,纹理内存是基于顺序局部缓存机制的。这意味着当连续标识符的线程尝试访问相对较近的存储位置内的数据元素时,内存访问会得到优化。
此外,这里实现该局部性以用于2D访问。因此,当每个线程到达存储在纹理内存中的数组的数据元素时,就处于连续 2D 访问的情况。因此,您可以充分利用内存架构。
不幸的是,这个内存并不是那么大,并且使用巨大的数组,您也许可以将数据放入其中。在这种情况下,就无法避免使用全局内存。
Depending on the size of your device's texture memory, you should implement it in this area. Indeed, texture memory is based upon sequential locality cache mechanism. It means that memory accesses are optimized when threads of consecutive identifiers try to reach data elements within relatively close storage locations.
Moreover, this locality is here implemented for 2D accesses. So when each thread reaches a data element of an array stored in texture memory, you're in the case of consecutive 2D accesses. Consequently, you take a full advantage of the memory architecture.
Unfortunately, this memory is not that big and with huge arrays you might be able to make your data fit in it. In this case, you can't avoid to use the global memory.
我同意 jHackTheRipper,一个简单的解决方案是使用纹理内存,然后使用计算视觉分析器进行分析。这是来自 NVIDIA 的一组很好的幻灯片,介绍了图像的不同内存类型卷积;它表明良好的共享内存使用和全局读取并不比使用纹理内存快太多。在您的情况下,您应该从 texmemory 中获得一些合并读取,而访问全局内存中的随机值通常不会获得这些读取。
I agree the jHackTheRipper, a simple solution would be to use texture memory and then profile using the Compute Visual Profiler. Heres a good set of slides from NVIDIA about the different memory types for image convolution; it shows that good shared memory usage and global reads was not too much faster than using texture memory. In your case you should get some coalesced reads from the texmemory that you wouldn't usually get with accessing random values in global memory.
如果它足够小以适应恒定或纹理,我会尝试所有三个。
此处未列出的一项有趣的选项是主机上的映射内存。您可以在主机上分配可从设备访问的内存,而无需显式地将其传输到设备内存。根据需要访问的数组量,它可能比复制到全局内存并从那里读取要快。
If it's small enough to fit it constant or texture, I would just try all three.
One interesting option that you don't have listed here is mapped memory on the host. You can allocate memory on the host that will be accessible from the device, without explicitly transferring it to device memory. Depending on the amount of the array you need to access, it could be faster than copying to global memory and reading from there.