GPU 中的同步
我对 GPU 如何执行同步有一些疑问。 据我所知,当一个扭曲遇到障碍时(假设它在 OpenCL 中),它知道同一组的其他扭曲还没有到达那里。所以还得等。但在等待期间,扭曲到底做了什么? 它仍然是一个活跃的扭曲吗?或者它会执行某种空操作吗?
正如我注意到的,当我们在内核中进行同步时,指令数量会增加。我想知道这个增量的来源是什么。同步是否被分解为许多更小的 GPU 指令?或者因为空闲扭曲执行一些额外的指令?
最后,我强烈想知道与没有同步的成本相比,同步增加的成本(假设屏障(CLK_LOCAL_MEM_FENCE))是否受到工作组(或线程块)中扭曲数量的影响? 谢谢
I have some question about how GPUs perform synchronizations.
As I know, when a warp encounters a barrier (assuming it is in OpenCL), and it knows that the other warps of the same group haven't been there yet. So it has to wait. But what exactly does that warp do during the waiting time?
Is it still an active warp? Or will it do some kind of null operations?
As I notice, when we have a synchronization in the kernel, the number of instructions increases. I wonder what is the source of this increment. Is the synchronization broken down into that many smaller GPU instructions? Or because the idle warps perform some extra instructions?
And finally, I strongly wonder if the cost added by a synchronization, compared to one without synch, (let's say barrier(CLK_LOCAL_MEM_FENCE)) is affected by the number of warp in a workgroup (or threadblock)?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
活动的warp 是驻留在SM 上的warp,即所有资源(寄存器等)都已被分配,并且只要它是可调度的,warp 就可用于执行。如果一个线程束在同一线程块/工作组中的其他线程束之前到达屏障,它仍然会处于活动状态(它仍然驻留在 SM 上,并且它的所有寄存器仍然有效),但它不会执行任何指令,因为它是尚未准备好安排。
插入屏障不仅会停止执行,还会充当编译器的屏障:不允许编译器跨屏障执行大多数优化,因为这可能会使屏障的目的失效。这是您看到更多指令的最可能原因 - 没有障碍,编译器能够执行更多优化。
屏障的成本很大程度上取决于您的代码正在执行的操作,但是每个屏障都会引入一个气泡,其中所有扭曲都必须(有效地)变得空闲,然后才能再次开始工作,因此如果您有一个非常大的线程块/工作组那么当然,可能存在比小块更大的泡沫。泡沫的影响取决于您的代码 - 如果您的代码非常受内存限制,那么屏障将暴露之前可能被隐藏的内存延迟,但如果更加平衡,那么它可能会产生不太明显的影响。
这意味着在内存非常有限的内核中,您最好启动大量较小的块,以便当一个块在屏障上冒泡时可以执行其他块。您需要确保占用率因此而增加,并且如果您使用块共享内存在线程之间共享数据,则需要进行权衡。
An active warp is one that is resident on the SM, i.e. all the resources (registers etc.) have been allocated and the warp is available for executing providing it is schedulable. If a warp reaches a barrier before other warps in the same threadblock/work-group it will still be active (it is still resident on the SM and all its registers are still valid), but it won't execute any instructions since it is not ready to be scheduled.
Inserting a barrier not only stalls execution but also acts as a barrier for the compiler: the compiler is not allowed to perform most optimisations across the barrier since this may invalidate the purpose of the barrier. This is the most likely reason you are seeing more instructions - without the barrier the compiler is able to perform more optimisations.
The cost of a barrier is very dependent on what your code is doing, but each barrier introduces a bubble where all warps have to (effectively) become idle before they all start work again, so if you have a very large threadblock/work-group then of course there is potentially a bigger bubble than with a small block. The impact of the bubble depends on your code - if your code is very memory bound then the barrier will expose the memory latencies where before they may have been hidden, but if more balanced then it may have a less noticeable effect.
This means that in a very memory-bound kernel you may be better off launching a larger number of smaller blocks so that other blocks can be executing when one block is bubbling on a barrier. You would need to ensure that your occupancy increases as a result, and if you are sharing data between threads using the block-shared-memory then there is a trade-off to be had.