GPU中的上下文切换机制是怎样的?

发布于 2024-11-19 08:07:17 字数 110 浏览 3 评论 0原文

据我所知,GPU 在扭曲之间切换以隐藏内存延迟。但我想知道在什么情况下,扭曲会被切换出去?例如,如果扭曲执行加载,并且数据已经在缓存中。那么扭曲是被关闭还是继续下一个计算?如果连续两次添加会发生什么? 谢谢

As I know, GPUs switch between warps to hide the memory latency. But I wonder in which condition, a warp will be switched out? For example, if a warp perform a load, and the data is there in the cache already. So is the warp switched out or continue the next computation? What happens if there are two consecutive adds?
Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

墨小沫ゞ 2024-11-26 08:07:17

首先,一旦线程块在多处理器(SM)上启动,它的所有扭曲都会驻留,直到它们全部退出内核。因此,直到有足够的寄存器用于该块的所有扭曲,并且直到有足够的空闲共享内存用于该块时,才会启动该块。

因此,warp 永远不会“切换出去”——不存在传统意义上的 warp 间上下文切换,上下文切换需要将寄存器保存到内存并恢复它们。

然而,SM 确实会从所有常驻扭曲中选择要发出的指令。事实上,SM 更有可能从不同的 warp 连续发出两条指令,而不是从同一 warp 发出,无论它们是什么类型的指令,无论有多少 ILP(指令级并行性)。不这样做会使 SM 陷入依赖停滞。即使像加法这样的“快速”指令也具有非零延迟,因为算术流水线有多个周期长。例如,在 Fermi 上,硬件每个周期(峰值)可以发出 2 个或更多扭曲指令,算术流水线延迟约为 12 个周期。因此,您需要多个运行中的扭曲来隐藏算术延迟,而不仅仅是内存延迟。

一般来说,warp 调度的细节取决于架构,没有公开记录,并且几乎肯定会随着时间的推移而改变。 CUDA 编程模型独立于调度算法,您不应在软件中依赖它。

First of all, once a thread block is launched on a multiprocessor (SM), all of its warps are resident until they all exit the kernel. Thus a block is not launched until there are sufficient registers for all warps of the block, and until there is enough free shared memory for the block.

So warps are never "switched out" -- there is no inter-warp context switching in the traditional sense of the word, where a context switch requires saving registers to memory and restoring them.

The SM does, however, choose instructions to issue from among all resident warps. In fact, the SM is more likely to issue two instructions in a row from different warps than from the same warp, no matter what type of instruction they are, regardless of how much ILP (instruction-level parallelism) there is. Not doing so would expose the SM to dependency stalls. Even "fast" instructions like adds have a non-zero latency, because the arithmetic pipeline is multiple cycles long. On Fermi, for example, the hardware can issue 2 or more warp-instructions per cycle (peak), and the arithmetic pipeline latency is ~12 cycles. Therefore you need multiple warps in flight just to hide arithmetic latency, not just memory latency.

In general, the details of warp scheduling are architecture dependent, not publicly documented, and pretty much guaranteed to change over time. The CUDA programming model is independent of the scheduling algorithm, and you should not rely on it in your software.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文