CUDA 扭曲发散和时钟周期
我读到,在具有 8 个 SP 的 SM 上,在执行一条指令期间,warp 中的每个线程都映射到 8 个 SP 中的每一个。因此,warp 在 32/8 = 4 个时钟周期内执行。
如果是这样,那么假设我在代码中有一个 if-else 语句。假设“then”和“else”分支指令各需要一个时钟周期才能完成。
如果发生分歧,需要多少个时钟周期才能完成 if-else 语句?是 2 还是应该是 2 x 4 = 8? (即在后者中,每个季度扭曲 2 个周期)
感谢任何澄清!
I read that on an SM with 8 SPs, each thread in a warp is mapped to each of the 8 SPs during the execution of one instruction. Hence, a warp is executed in 32/8 = 4 clock cycles.
If so, then suppose I have an if-else statement in the code. Suppose the "then" and "else" branch instructions each take one clock cycle to complete.
How many clock cycles would be needed to complete the if-else statement if divergence occurs? Is it 2, or should it be 2 x 4 = 8? (i.e. in the latter, 2 cycles for each quarter warp)
Appreciate any clarifications!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
粒度为 4 个时钟周期 - 有 4 个指令阶段,4 个阶段中的每个阶段为 8 个线程处理相同的指令(如果有条件/分支,则可以选择屏蔽),这就是如何让 32 个线程每 4 个时钟周期执行一条指令。因此,对于您的示例中的发散分支,一个分支至少有 4 个时钟,另一个分支至少有 4 个时钟。
The granularity is 4 clock cycles - there are 4 instruction phases and each of the 4 phases processes the same instruction for 8 threads (optionally masked if you have conditionals/branching), which is how you get 32 threads executing one instruction every 4 clock cycles. So for a divergent branch as in your example you have a minimum of 4 clocks for one branch and a minimum of 4 clocks for the other branch.