OpenCL/CUDA 中每种内存访问类型有多少内存延迟周期?
我浏览了编程指南和最佳实践指南,其中提到全局内存访问需要 400-600 个周期。我没有看到太多其他内存类型,如纹理缓存、常量缓存、共享内存。寄存器具有 0 内存延迟。
我认为如果所有线程在常量缓存中使用相同的地址,则常量缓存与寄存器相同。最坏的情况我不太确定。
只要不存在bank冲突,共享内存就和寄存器一样吗?如果存在,那么延迟是如何展开的?
纹理缓存怎么样?
I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency.
I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure.
Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does the latency unfold?
What about texture cache?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于 (Kepler) Tesla K20,延迟如下:
我怎么知道?我运行了Demystifying GPU Microarchitecture through Microbenchmarking的作者描述的微基准测试。他们为较旧的 GTX 280 提供了类似的结果。
这是在 Linux 集群上测量的,我运行基准测试的计算节点没有被任何其他用户使用或运行任何其他进程。它是 BULLX linux,具有一对 8 核 Xeon 和 64 GB RAM,nvcc 6.5.12。我将
sm_20
更改为sm_35
进行编译。PTX 中还有一个操作数成本章节ISA虽然不是很有帮助,但它只是重申了你已经期望的内容,没有给出精确的数字。
For (Kepler) Tesla K20 the latencies are as follows:
How do I know? I ran the microbenchmarks described by the authors of Demystifying GPU Microarchitecture through Microbenchmarking. They provide similar results for the older GTX 280.
This was measured on a Linux cluster, the computing node where I was running the benchmarks was not used by any other users or ran any other processes. It is BULLX linux with a pair of 8 core Xeons and 64 GB RAM, nvcc 6.5.12. I changed the
sm_20
tosm_35
for compiling.There is also an operands cost chapter in PTX ISA although it is not very helpful, it just reiterates what you already expect, without giving precise figures.
共享/常量/纹理内存的延迟很小,并且取决于您拥有的设备。一般来说,虽然 GPU 被设计为吞吐量架构,但这意味着通过创建足够的线程,可以隐藏内存(包括全局内存)的延迟。
指南谈论全局内存延迟的原因是,该延迟比其他内存高几个数量级,这意味着它是优化时要考虑的主要延迟。
您特别提到了常量缓存。您说得非常正确,如果 warp 中的所有线程(即 32 个线程组)访问相同的地址,则不会有任何惩罚,即从缓存中读取该值并同时广播到所有线程。但是,如果线程访问不同的地址,则访问必须序列化,因为缓存一次只能提供一个值。如果您使用 CUDA Profiler,那么这将显示在序列化计数器下。
与恒定缓存不同,共享内存可以提供更高的带宽。查看 CUDA 优化 演讲,了解更多详细信息和银行解释冲突及其影响。
The latency to the shared/constant/texture memories is small and depends on which device you have. In general though GPUs are designed as a throughput architecture which means that by creating enough threads the latency to the memories, including the global memory, is hidden.
The reason the guides talk about the latency to global memory is that the latency is orders of magnitude higher than that of other memories, meaning that it is the dominant latency to be considered for optimization.
You mentioned constant cache in particular. You are quite correct that if all threads within a warp (i.e. group of 32 threads) access the same address then there is no penalty, i.e. the value is read from the cache and broadcast to all threads simultaneously. However, if threads access different addresses then the accesses must serialize since the cache can only provide one value at a time. If you're using the CUDA Profiler, then this will show up under the serialization counter.
Shared memory, unlike constant cache, can provide much higher bandwidth. Check out the CUDA Optimization talk for more details and an explanation of bank conflicts and their impact.