CUDA 中带宽的含义及其重要性
CUDA 编程指南指出
“带宽是性能最重要的控制因素之一。几乎所有代码更改都应该在它们如何影响带宽的背景下进行。”
它继续计算理论带宽,约为每秒数百千兆字节。我不明白为什么可以读取/写入全局内存的字节数反映了内核的优化程度。
如果我有一个内核,对存储在共享内存和/或寄存器中的数据进行密集计算,并且在开始时仅从全局内存读取一次并在最后向全局内存写入,那么有效带宽肯定会很小,而内核本身可能非常高效。
任何人都可以进一步解释这种情况下的带宽吗?
谢谢
The CUDA programming guide states that
"Bandwidth is one of the most important gating factors for performance. Almost all changes to code should be made in the context of how they affect bandwidth."
It goes on to calculate theoretical bandwidth which is in the order of hundreds of gigabytes per second. I am at a loss as to why how many bytes one can read/write to global memory is a reflection of how well optimised a kernel is.
If I have a kernel which does intensive computation on data stored in shared memory and/or registers, with only a single read at the start and write out at the end from and to global memory, surely the effective bandwidth will be small, while the kernel itself may be very efficient.
Could any one further explain bandwidth in this context?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
大多数重要的计算内核,在 CPU 和 GPU 领域,受内存限制。
GPU 具有非常高的计算强度和吞吐量,但是对主存的访问非常慢并且具有很高的延迟,每次读取/存储需要几百个周期,而许多算术运算需要四个周期。
听起来你的内核是受计算限制的,所以你运气好。但是,您仍然需要注意共享内存库冲突,这可能会意外地序列化部分代码。
most all nontrivial computational kernels, in CPU and GPU land, memory bound.
GPU has very high computational intensity and throughput, but access to main memory is very slow and has high latency, few hundred cycles per read/store versus four cycles for mmany arithmetic operations.
It sounds like your kernel is computation bound, so your luck. However you still have to watch out for shared memory bank conflict, which can serialize portions of code unexpectedly.
大多数内核都受内存限制,因此最大化内存吞吐量至关重要。如果您足够幸运拥有一个计算密集型内核,那么优化计算通常会更容易。您确实需要注意差异,并且仍然应该确保有足够的线程来隐藏内存延迟。
请查看高级 CUDA C 演示文稿,了解更多信息,包括一些提示了解如何将实际性能与理论性能进行比较。 CUDA 最佳实践 Gude 也有一些很好的信息,它作为 CUDA 工具包的一部分提供(从 NVIDIA 网站)。
Most kernels are memory bound so maximising memory throughput is critical. If you're lucky enough to have a compute bound kernel then optimizing for computation is generally easier. You do need to look out for divergence and you should still ensure you have enough threads to hide memory latency.
Check out the Advanced CUDA C presentation for more information, including some tips for how to compare your realised performance with theoretical performance. The CUDA Best Practices Gude also has some good information, it's available as part of the CUDA toolkit (download from the NVIDIA site).
通常,内核相当小且简单,并且对大量数据执行相同的操作。您可能有一堆按顺序调用的内核来执行一些更复杂的操作(将其视为处理管道)。显然,管道的吞吐量将取决于内核的效率以及是否受到内存带宽的限制。
Typically kernels are fairly small and simple and perform the same operation on a lot of data. You might have a bunch of kernels that you invoke in sequence to perform some more complex operation (think of it as a processing pipeline). Obviously the throughput of your pipeline will depend both on how efficient your kernels are and whether you are limited by memory bandwidth in any way.