GPU 中的活动扭曲数量 (Fermi)

发布于 2024-11-24 04:18:58 字数 141 浏览 1 评论 0原文

我有一个关于 GPU 中的活动扭曲的快速问题(我更愿意在费米中知道它)。 对于特定的内核,SM中任何周期的活动warp数量在内核的整个执行时间内是否相同? 正如我所试验的,活动扭曲总数(对于整个执行)和程序内核中的同步数量之间存在一些相关性。谁能澄清这个关系? 谢谢

I have a quick question about the active warps in GPU (I would prefer to know it in Fermi).
For specific kernel, is the number of active warps at any cycle in a SM the same for the whole execution time of the kernel?
As I experimented, there is some correlation between the total number of active warps (for the whole execution) and the number of synchronizations in the program kernel. Can anyone clarify this relation?
Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

牵你手 2024-12-01 04:18:58

活动扭曲的数量可能会随着时间的推移而变化,因为:

  • 其他线程块可以在同一个 SM 上完成或开始,因此,如果每个线程块有四个扭曲,那么如果 SM 上只有一个线程块驻留,则最多会有四个扭曲,但是两个或三个线程块,您最多可以有八个或十二个线程块。
  • 如果一个 warp 到达了代码的末尾,那么它将不再执行代码(自然)。

整个程序执行的活动 warp 计数将取决于许多因素,但请记住,它是按活动 warp 的数量递增的。每个周期。这意味着如果增加同步数量,这也会增加每个 warp 执行内核所需的周期数,那么您会期望更高的活动 warp 计数。

另请注意,分析器中的一些派生统计数据是近似值,因为它们通常使用来自多次运行的值,因此可能存在一些可变性。

The number of active warps can vary over time since:

  • Other threadblocks can complete or begin on the same SM, so if you have four warps per threadblock then if only one threadblock is resident on the SM you would have up to four warps, but with two or three threadblocks you would have up to eight or twelve resp.
  • If a warp reaches the end of their code then it will no longer be executing code (naturally)

The active warps count for a whole program execution would depend on a number of factors, but remember that it is incremented by the number of active warps on each cycle. This means if you increase the number of syncs, which would also increase the number of cycles each warp requires to execute the kernel, then you would expect a higher active warps count.

Also note that some derived statistics in the profiler are approximate since they often use values from more than one run, hence there can be some variability.

朮生 2024-12-01 04:18:58

本文解释了barrier同步和wrap之间的关系,
通过微基准测试揭秘 GPU 微架构

The relationship between the barrier synchronization and wrap is explained in this paper,
Demystifying GPU Microarchitecture through Microbenchmarking.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文