CUDA 扭曲和占用
我一直认为warp调度程序一次会执行一个warp,具体取决于哪个warp已准备好,并且这个warp可以来自多处理器中的任何一个线程块。然而,在 Nvidia 网络研讨会的一张幻灯片中,指出“占用率 = 在多处理器上同时运行的扭曲数量除以可以同时运行的最大扭曲数量”。那么一次可以运行多个扭曲吗?这是如何运作的?
谢谢。
I have always thought that the warp scheduler will execute one warp at a time, depending on which warp is ready, and this warp can be from any one of the thread blocks in the multiprocessor. However, in one of the Nvidia webminar slides, it is stated that "Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently". So more than one warp can run at one time? How does this work?
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
“运行”可能更好地解释为“在 SM 上具有状态和/或管道中的指令”。 GPU硬件调度尽可能多的可用块或适合SM的资源(以较小者为准),为它们包含的每个warp分配状态(即寄存器文件和本地内存),然后开始调度warp执行。指令管道似乎大约有 21-24 个周期长,因此在任何给定时间都有大量线程处于“运行”的各个阶段。
前两代支持 CUDA 的 GPU(例如 G80/90 和 G200)仅每四个时钟周期从单个扭曲中退出指令。计算 2.0 设备每两个时钟周期从两个 warp 发出双重指令,因此每个时钟有两个 warp 退休指令。 Compute 2.1 通过允许实际上无序执行来扩展这一点 - 每个时钟仍然只有两个扭曲,但可能一次来自同一扭曲的两条指令。因此,每个 SM 额外的 16 个核心用于指令级并行性,仍然由同一共享调度程序发出。
"Running" might be better interpreted as "having state on the SM and/or instructions in the pipeline". The GPU hardware schedules up as many blocks as are available or will fit into the resources of the SM (whichever is smaller), allocates state for every warp they contain (ie. register file and local memory), then starts scheduling the warps for execution. The instruction pipeline seems to be about 21-24 cycles long, and so there are a lot of threads in various stages of "running" at any given time.
The first two generations of CUDA capable GPU (so G80/90 and G200) only retire instructions from a single warp per four clock cycles. Compute 2.0 devices dual-issue instructions from two warps per two clock cycles, so there are two warps retiring instructions per clock. Compute 2.1 extends this by allowing what is effectively out of order execution - still only two warps per clock, but potentially two instructions from the same warp at a time. So the extra 16 cores per SM get used for instruction level parallelism, still issued from the same shared scheduler.