删除线程后可以使用 __syncthreads() 吗?
在我故意使用 return
删除线程的块中使用 __syncthreads()
是否安全?
文档指出 __syncthreads()
必须由块中的每个线程调用,否则会导致死锁,但实际上我从未经历过此类行为。
示例代码:
__global__ void kernel(float* data, size_t size) {
// Drop excess threads if user put too many in kernel call.
// After the return, there are `size` active threads.
if (threadIdx.x >= size) {
return;
}
// ... do some work ...
__syncthreads(); // Is this safe?
// For the rest of the kernel, we need to drop one excess thread
// After the return, there are `size - 1` active threads
if (threadIdx.x + 1 == size) {
return;
}
// ... do more work ...
__syncthreads(); // Is this safe?
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这个简短问题的答案是“否”。围绕
__syncthreads()
指令的扭曲级别分支分歧将导致死锁并导致内核挂起。不保证您的代码示例是安全或正确的。正确的代码实现方式应该是这样的:这样
__syncthreads()
指令就会无条件执行。编辑:只是为了添加一些额外信息来确认此断言,
__syncthreads()
调用会编译到所有架构上的 PTXbar.sync
指令中。 PTX2.0 指南 (p133) 记录了bar.sync
并包含以下警告:因此,尽管有任何相反的断言,但在 __syncthreads() 调用周围进行条件分支并不安全,除非您可以 100% 确定任何给定的 warp 中的每个线程遵循相同的代码路径,并且不会发生扭曲发散。
The answer to the short question is "No". Warp level branch divergence around a
__syncthreads()
instruction will cause a deadlock and result in a kernel hang. Your code example is not guaranteed to be safe or correct. The correct way to implement the code would be like this:so that the
__syncthreads()
instructions are executed unconditionally.EDIT: Just to add a bit of additional information which confirms this assertion,
__syncthreads()
calls get compiled into the PTXbar.sync
instruction on all architectures. The PTX2.0 guide (p133) documentsbar.sync
and includes the following warning:So despite any assertions to the contrary, it is not safe to have conditional branching around a
__syncthreads()
call unless you can be 100% certain that every thread in any given warp follows the same code path and no warp divergence can occur.计算能力 7.x (Volta) 更新:
随着在 warp 中的线程之间引入独立线程调度,CUDA 在实践中终于变得更加严格,现在匹配记录的行为。来自编程指南< /a>:
下面是之前的答案,其中漫谈了 Volta 之前的行为。
更新:这个答案可能不会在 talonmies 之上添加任何内容(我想这取决于您对主题的理解),但冒着过于冗长的风险,我正在呈现对我有帮助的信息更好地理解这一点。另外,如果您对“幕后”工作方式或官方文档之外的可能性不感兴趣,那么这里就没有什么可看的。尽管如此,我仍然不建议做出超出官方记录的假设,尤其是在希望支持多种或未来架构的环境中。我主要想指出的是,虽然 CUDA 编程指南,
__syncthreads()
的实际行为可能与它的描述方式有所不同,这对我来说很有趣。我最不想看到的就是传播错误信息,所以我愿意接受讨论并修改我的答案!一些重要的事实
这个答案没有太长的篇幅,因为很可能会产生误解,但这里有一些相关的事实可以开始:
__syncthreads()
的行为就像块中扭曲的屏障而不是一个块中的所有线程,尽管按照建议使用时它相当于同一件事。bar
指令(例如,来自_syncthreads
),就好像 warp 中的所有线程都执行了 PTXbar
指令。bar.sync
时(由固有的__syncthreads()
生成),该块和屏障的到达计数将按扭曲大小递增。前面的几点就是这样实现的。这些陈述得到官方文件和其他来源的支持。
解释和文档
由于
__syncthreads()
充当块中扭曲的屏障,而不是块中所有线程的屏障,如《编程指南》中所述,因此似乎简单的提前退出将如果每个经纱中至少有一根纱线碰到障碍,那就没问题。 (但这并不是说您不能导致内在的死锁!) 这也假设__syncthreads()
将始终生成一个简单的bar.sync a; PTX 指令,并且其语义也不会改变,所以不要在生产中这样做。
我遇到的一项有趣的研究实际上调查了当您违背 CUDA 编程指南的建议时会发生什么,他们发现虽然确实有可能通过滥用
__syncthreads()
在条件块中,并非所有在条件代码中使用内部函数都会这样做。来自论文中的 D.1 节:此声明与的位一致talonmies 引用的 PTX 文档。具体来说:
由此可以清楚地看出为什么
bar.sync a{, b};
指令中的可选线程计数b
必须是 warp 大小的倍数——只要是单个线程在 warp 中执行bar
指令到达计数按 warp 大小递增,而不是 warp 中实际遇到屏障的线程数。无论如何,提前终止的线程(遵循不同的路径)实际上都被视为已到达。现在,引用段落中的下一句确实表示不要在条件代码中使用 __syncthreads() ,除非“已知所有线程都以相同的方式评估条件(扭曲不发散)”。这似乎是一个过于严格的建议(对于当前的体系结构),旨在确保到达计数实际上反映了遇到障碍的线程的真实数量。如果至少有一个线程遇到屏障,从而增加了整个扭曲的到达计数,那么您可能确实拥有更多的灵活性。PTX 文档中没有明确指出,由
__syncthreads()
生成的bar.sync a;
指令会等待当前协作线程数组(块)中的所有线程到达障碍a
。然而,重点是,当前如何通过在遇到屏障时以扭曲大小的倍数增加到达计数来确定“所有线程”(默认情况下,当未指定b
时)。这部分不是未定义的行为,至少在并行线程执行 ISA 版本 4.2 中不是。请记住,即使没有条件,扭曲中也可能存在不活动的线程——“线程数不是扭曲大小的倍数的块的最后一个线程”。 (SIMT 架构说明)。然而,在此类块中并没有禁止
__syncthreads()
。示例
提前退出版本 1:
如果每个 warp 至少有一个线程达到同步,则不会出现死锁,但可能的问题是不同代码路径的执行序列化顺序。您可以更改上述内核以有效地交换分支。
提前退出版本 2:
如果经纱中至少有一个线程遇到障碍,仍然不会出现死锁,但是在这种情况下分支执行的顺序重要吗?我不这么认为,但要求特定的执行顺序可能是一个坏主意。
与微不足道的早期退出相比,本文通过一个更复杂的示例证明了这一点,这也提醒我们对扭曲发散保持谨慎。这里,warp 的前半部分([0,15] 上的线程 id
tid
)写入一些共享内存并执行__syncthreads()
,而另一半(线程 id [16,31] 上的tid
)也执行__syncthreads()
,但现在从扭曲前半部分写入的共享内存位置读取。首先忽略共享内存测试,您可能会预期任一屏障都会出现死锁。没有死锁,表明 __syncthreads() 不会同步扭曲内的分叉线程。 不同的代码路径在一个 warp 中序列化,并且只需要代码路径中的一个线程即可使对
__syncthreads()
的调用在每个 warp 级别工作。但是,共享内存位显示了一些不可预测的行为可能会进入其中的位置。 warp 的后半部分不会从前半部分获取更新的值,因为分支分歧序列化了 warp 的执行,并且首先执行了 else 块。因此该函数无法正常工作,但它也表明
__syncthreads()
不会同步扭曲中的不同线程。摘要
__syncthreads() 不会等待 warp 中的所有线程,并且 warp 中单个线程的到达有效地将整个 warp 视为已到达屏障。 (目前的架构)。
由于不同线程执行的序列化方式,在条件代码中使用 __syncthreads() 可能很危险。
仅当您了解其工作原理以及如何处理分支分歧(发生在扭曲内)时,才可以在条件代码中使用内在函数。
请注意,我并没有说要以与记录方式不一致的方式继续使用
__syncthreads()
。Compute Capability 7.x (Volta) update:
With the introduction of Independent Thread Scheduling among threads in a warp, CUDA is finally more strict in practice, now matching documented behavior. From the Programming Guide:
Below is the previous answer, which rambled about pre-Volta behavior.
Update: This answer may not add anything on top of talonmies' (depending on your understanding of the subject, I suppose), but at the risk of being too verbose I'm presenting the information that helped me understand this better. Also, if you are not interested in how things might work "under the hood" or what might be possible beyond the official documentation, there's nothing to see here. That all said, I still don't recommend making assumptions beyond what is officially documented, especially in an environment that hopes to support multiple or future architectures. I primarily wanted to point out that while this is explicitly called out as bad practice by the CUDA Programming Guide, the actual behavior of
__syncthreads()
may be somewhat different from how it is described and to me that is interesting. The last thing I want is to spread misinformation, so I'm open to discussion and revising my answer!A few important facts
There is no TL;DR for this answer as there is too much potential for misinterpretation, but here are some relevant facts to start:
__syncthreads()
behaves like a barrier for warps in a block rather than all of the threads in a block, although when used as advised it amounts to the same thing.bar
instruction (e.g. from_syncthreads
), it is as if all the threads in the warp have.bar.sync
is called (as generated by the instrinsic__syncthreads()
), the arrival count for that block and barrier are incremented by the warp size. This is how the previous points are achieved.__syncthreads()
. The instruction will not cause the warp to stall and wait for the threads on divergent paths. Branch execution is serialized, so only when the branches rejoin or the code terminates do the threads in the warp then resynchronize. Until that, the branches run in sequence and independently. Again, only one thread in each warp of the block needs to hit__syncthreads()
for execution to continue.These statements are supported by official documentation and other sources.
Interpretation and documentation
Since
__syncthreads()
acts as a barrier for warps in a block rather than all of the threads in a block, as it is described in the Programming Guide, it seems that a simple early exit would be fine if at least one thread in each warp hits the barrier. (But that is not to say you can't cause deadlocks with the intrinsic!) This also supposes that__syncthreads()
will always generate a simplebar.sync a;
PTX instruction and that the semantics of that will not change either, so don't do this in production.One interesting study that I came across actually investigates what happens when you go against the recommendations of the CUDA Programming Guide, and they found that while it is indeed possible to cause a deadlock by abusing
__syncthreads()
in conditional blocks, not all use of the intrinsic in conditional code will do so. From Section D.1 in the paper:This statement is concordant with the bit of the PTX documentation quoted by talonmies. Specifically:
It is clear from this why the optional thread count
b
in thebar.sync a{, b};
instruction must be a multiple of warp size -- whenever a single thread in a warp executes abar
instruction the arrival count is incremented by the warp size, not the number of threads in the warp that actually hit the barrier. Threads that terminate early (followed a different path) were effectively counted as arrived anyway. Now, the next sentence in the quoted passage does then say not to use__syncthreads()
in conditional code unless "it is known that all threads evaluate the condition identically (the warp does not diverge)." This seems to be an overly strict recommendation (for current architecture), meant to ensure that the arrival count actually reflects the real number of threads that hit the barrier. If at least one thread hitting the barrier increments the arrival count for the entire warp, you might really have a little more flexibility.There is no ambiguity in the PTX documentation that the
bar.sync a;
instruction generated by__syncthreads()
waits for all threads in the current cooperative thread array (block) to reach barriera
. However, the point is that how "all threads" is presently determined by incrementing the arrival count in multiples of warp size whenever the barrier is hit (by default whenb
is not specified). This part is not undefined behavior, at least not with Parallel Thread Execution ISA Version 4.2.Keep in mind that there may be inactive threads in a warp even without a conditional -- "the last threads of a block whose number of threads is not a multiple of the warp size." (SIMT architecture notes). Yet
__syncthreads()
is not forbidden in such blocks.Examples
Early exit version 1:
This will not deadlock if at least one thread per warp hits the sync, but a possible issue is order of serialization of the execution of divergent code paths. You can change around the above kernel to effectively swap the branches.
Early exit version 2:
Still no deadlock if you have at least one thread in the warp hit the barrier, but is the order of branch execution important in this case? I don't think so, but it's probably a bad idea to require a particular execution order.
The paper demonstrates this in a more involved example compared to a trivial early exit that also reminds us to be cautious around warp divergence. Here the first half of the warp (thread id
tid
on [0,15]) writes to some shared memory and executes__syncthreads()
, while the other half (thread idtid
on [16,31]) also executes__syncthreads()
but now reads from the shared memory locations written by the first half of the warp. Ignoring the shared memory test at first, you might expect a deadlock at either barrier.There is no deadlock, indicating that
__syncthreads()
does not synchronize diverged threads within a warp. Divergent code paths are serialized in a warp and it only takes one thread in a code path to make the call to__syncthreads()
work at the per-warp level.However, the shared memory bit shows where some unpredictable behavior can enter into this. The second half of the warp does not get the updated values from the first half because branch divergence serialized execution of the warp and the else block was executed first. So the function doesn't work right, but it also show that
__syncthreads()
does not synchronize divergent threads in a warp.Summary
__syncthreads()
does not wait for all threads in a warp, and the arrival of a single thread in a warp effectively counts the entire warp as having reached the barrier. (Present architecture).It can be dangerous to use
__syncthreads()
in conditional code because of how divergent thread execution is serialized.Use the intrinsic in conditional code only if you understand how it works and how branch divergence (which occurs within a warp) is handled.
Note that I didn't say to go ahead and use
__syncthreads()
in a way inconsistent with how it is documented.简而言之:是的,它很安全。
被接受的答案在书写时很可能是正确的,但至少自 Volta 以来,它是错误的。 CUDA 文档 明确所有非退出线程必须调用
__syncthreads
,这意味着可以提前退出并且不会导致死锁。In short: yes it's safe.
The accepted answer may well have been correct when written, but at least since Volta, it is wrong. CUDA docs make clear that the
__syncthreads
call must be reached by all non-exited threads, which means one can exit early and not cause deadlock.