NVidia CUDA:二级缓存和多个内核调用
我想知道 L2 缓存是否在多个内核调用之间被释放。例如,我有一个内核对数据进行一些预处理,第二个内核则使用它。如果数据大小小于 768 KB 是否可以获得更高的性能?我认为 NVidia 人员没有理由以其他方式实施它,但也许我错了。有人有这方面的经验吗?
I'm wondering whether L2 cache is freed between multiple kernel invocations. For example I have a kernel that does some preprocessing on data and the second one that uses it. Is it possible to achieve greater performance if data size is less than 768 KB? I see no reason for NVidia guys to implement it otherwise but maybe I'm wrong. Does anybody have an experience with that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
假设您正在谈论 Fermi 中的 L2 数据缓存。
我认为每次内核调用后都会刷新缓存。根据我的经验,连续两次启动同一内核并进行大量内存访问(和 #L2 缓存未命中)不会对 L1/L2 缓存统计数据产生任何实质性变化。
在您的问题中,我认为,根据数据依赖性,可以将两个阶段放入一个内核中(具有一些同步),以便内核的第二部分可以重用第一部分处理的数据。
这是另一个技巧:你知道 GPU 有 N 个 SM,你可以使用前 N * M1 块执行第一部分。接下来的 N * M2 块用于第二部分。使用同步确保第一部分中的所有块同时(或几乎)完成。根据我的经验,块调度顺序确实是确定性的。
希望有帮助。
Assuming you are talking about L2 data cache in Fermi.
I think the caches are flushed after each kernel invocation. In my experience, running two consecutive launches of the same kernel with a lots of memory accesses (and #L2 cache misses) doesn't make any substantial changes to the L1/L2 cache statistics.
In your problem, I think, depending on the data dependency, it is possible to put two stages into one kernel (with some sync) so the second part of the kernel can reuse the data processed by the first part.
Here is another trick: You know the gpu has, for example N SMs, you can perform the first part using the first N * M1 blocks. The next N * M2 blocks for the second part. Make sure all the blocks in the first part finish at the same time (or almost) using sync. In my experience, the block scheduling order is really deterministic.
Hope it helps.