CUDA 分析(解释 gst/gld 请求)
cudaprof 中曾经有用于全局内存的分析计数器(gst_coherent、gst_incoherent、gld_coherent、gld_incoherent),这些计数器对我来说非常有用且清晰,因为它们告诉我有多少未合并的全局读取和写入。
现在,似乎只有“gst requests”和“gld requests”。这些是 mp 0 上每个扭曲的总加载/存储。如何确定是否有未合并的读/写?我猜测如果合并请求,请求会更少。我是否应该计算出每个线程的预期数量并进行比较?不幸的是,我的内核太动态了。
There used to be profiling counters in cudaprof for global memory (gst_coherent, gst_incoherent, gld_coherent, gld_incoherent) that were useful and clear to me because they told me how many uncoalesced global reads and writes I had.
Now, there seems to be only "gst requests" and "gld requests". These are the total loads/stores per warp on mp 0. How do I determine if I have uncoalesced reads/writes? I'm guessing that there would be fewer requests if the requests were coalesced. Am I supposed to figure out how many I expect per thread and compare? Unfortunately, my kernel is too dynamic for that.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
相干/非相干计数器与 sm_10/sm_11 设备相关,其中访问必须对齐和合并以避免不良性能。在 sm_12 和 sm_13 上,硬件尝试将尽可能的访问合并到段事务中,而在 sm_2x 上,L1 缓存提供了类似的功能,但在无法做到这一点时,还附加了缓存的功能。
理想情况下,您会感觉到正在读取和写入多少数据,并将其与所实现的性能进行比较,这将使您了解效率。然而,鉴于您的内核非常依赖数据,您应该查看 GTC2010 的一些演示文稿,以了解分析器中提供的其他信息。我推荐GPU 基本性能优化 谈话,更重要的是,从第一个开始,
您还可以考虑使用一些额外的计数器手动检测代码。
The coherent/incoherent counters are relevant on sm_10/sm_11 devices, where accesses had to be aligned and coalesced to avoid pathological performance. On sm_12 and sm_13 the hardware attempts to coalesce accesses wherever possible into segment transactions, and on sm_2x the L1 cache provides a similar function with the additional beenift of the cache for when this is not possible.
Ideally you would have a feel for how much data you are reading and writing and compare this with the achieved performance, this will give you an idea of the efficiency. However given that your kernel is very data-dependent you should take a look at a couple of the presentations from GTC2010 to understand the other information that is available in the profiler. I'd recommend the Fundamental Performance Optimizations for GPUs talk and, more importantly but following on from the first one, the Analysis-Driven Performance Optimization talk.
You could also consider instrumenting your code manually with a few extra counters.