如何“串流”全局内存中的数据?

发布于 2024-11-17 13:02:19 字数 395 浏览 3 评论 0原文

codeproject.com 展示第 2 部分:OpenCL™ – 内存空间 指出全局内存应被视为流内存 [...]并且当流连续内存地址或内存访问模式可以利用内存子系统的全部带宽时,将实现最佳性能.

我对这句话的理解也就是说,为了获得最佳性能,应该在 GPU 处理内核时不断填充和读取全局内存。但我不知道如何实现这样的概念,并且我无法在我读过的(相当简单的)示例和教程中识别它。

知道一个很好的例子或者可以链接到一个吗?

额外问题:这是 CUDA 框架中的模拟吗?

The codeproject.com showcase Part 2: OpenCL™ – Memory Spaces states that Global memory should be considered as streaming memory [...] and that the best performance will be achieved when streaming contiguous memory addresses or memory access patterns that can exploit the full bandwidth of the memory subsystem.

My understanding of this sentence is, that for optimal performance one should constantly fill and read global memory while the GPU is working on the kernels. But I have no idea, how I would implement such an concept and I am not able to recognize it in the (rather simple) examples and tutorials I've read.

Do know a good example or can link to one?

Bonus question: Is this analog in the CUDA framework?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

国际总奸 2024-11-24 13:02:19

我同意 talonmies 对该准则的解释:顺序内存访问速度最快。很明显(对于任何支持 OpenCL 的开发人员),顺序内存访问是最快的,所以有趣的是 NVidia 明确地这样表述。

您的解释虽然不是该文件的内容,但也是正确的。如果您的算法允许,最好以合理大小的块异步上传,这样它就可以更快地开始计算,将计算与往返系统 RAM 的 DMA 传输重叠。

拥有多个波前/扭曲也很有帮助,因此设备可以交错它们以隐藏内存延迟。好的 GPU 经过大量优化,能够极快地进行这种切换,以便在内存阻塞时保持忙碌。

I agree with talonmies about his interpretation of that guideline: sequential memory access are fastest. It's pretty obvious (to any OpenCL-capable developer) that sequential memory accesses are the fastest though, so it's funny that NVidia explicitly spells it out like that.

Your interpretation, although not what that document is saying, is also correct. If your algorithm allows it, it is best to upload in reasonably sized chunks asynchronously so it can get started on the compute sooner, overlapping compute with DMA transfers to/from system RAM.

It is also helpful to have more than one wavefront/warp, so the device can interleave them to hide memory latency. Good GPUs are heavily optimized to be able to do this switching extremely fast to stay busy while blocked on memory.

无所谓啦 2024-11-24 13:02:19

我对这句话的理解是,
为了获得最佳性能
应不断填写和阅读全球
GPU 工作时的内存
内核

这并不是一个真正正确的解释。

典型的 OpenCL 设备(即 GPU)具有极高的带宽、高延迟的全局内存系统。这种内存系统针对连续或线性内存访问进行了高度优化。您引用的那篇文章真正想说的是,OpenCL 内核应该设计为以最适合 GPU 内存的连续方式访问全局内存。 NVIDIA 将这种最佳的连续内存访问称为“合并”,并在 CUDA 和 OpenCL 指南中详细讨论了其硬件的内存访问模式优化。

My understanding of this sentence is,
that for optimal performance one
should constantly fill and read global
memory while the GPU is working on the
kernels

That isn't really a correct interpretation.

Typical OpenCL devices (ie. GPUs) have extremely high bandwidth, high latency global memory systems. This sort of memory system is highly optimized for access to contiguous or linear memory access. What that piece you quote is really saying is that OpenCL kernels should be designed to access global memory in the sort of contiguous fashion which is optimal for GPU memory. NVIDIA call this sort of optimal, contiguous memory access "coalesced", and discuss memory access pattern optimization for their hardware in some detail in both their CUDA and OpenCL guides.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文