删除线程后可以使用 __syncthreads() 吗?

发布于 2024-11-19 10:17:21 字数 780 浏览 1 评论 0 原文

在我故意使用 return 删除线程的块中使用 __syncthreads() 是否安全?

文档指出 __syncthreads() 必须由块中的每个线程调用,否则会导致死锁,但实际上我从未经历过此类行为。

示例代码:

__global__ void kernel(float* data, size_t size) {
    // Drop excess threads if user put too many in kernel call.
    // After the return, there are `size` active threads.
    if (threadIdx.x >= size) {
        return;
    }

    // ... do some work ...

    __syncthreads(); // Is this safe?

    // For the rest of the kernel, we need to drop one excess thread
    // After the return, there are `size - 1` active threads
    if (threadIdx.x + 1 == size) {
        return;
    }

     // ... do more work ...

    __syncthreads(); // Is this safe?
}

Is it safe to use __syncthreads() in a block where I have purposefully dropped threads using return?

The documentation states that __syncthreads() must be called by every thread in the block or else it will lead to a deadlock, but in practice I have never experienced such behavior.

Sample code:

__global__ void kernel(float* data, size_t size) {
    // Drop excess threads if user put too many in kernel call.
    // After the return, there are `size` active threads.
    if (threadIdx.x >= size) {
        return;
    }

    // ... do some work ...

    __syncthreads(); // Is this safe?

    // For the rest of the kernel, we need to drop one excess thread
    // After the return, there are `size - 1` active threads
    if (threadIdx.x + 1 == size) {
        return;
    }

     // ... do more work ...

    __syncthreads(); // Is this safe?
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

格子衫的從容 2024-11-26 10:17:21

这个简短问题的答案是“否”。围绕 __syncthreads() 指令的扭曲级别分支分歧将导致死锁并导致内核挂起。不保证您的代码示例是安全或正确的。正确的代码实现方式应该是这样的:

__global__ void kernel(...)

    if (tidx < N) {
        // Code stanza #1
    }

    __syncthreads();


    if (tidx < N) {
        // Code stanza #2
    }

    // etc
}

这样 __syncthreads() 指令就会无条件执行。


编辑:只是为了添加一些额外信息来确认此断言,__syncthreads() 调用会编译到所有架构上的 PTX bar.sync 指令中。 PTX2.0 指南 (p133) 记录了 bar.sync 并包含以下警告:

屏障是在每个 warp 的基础上执行的,就好像一个线程中的所有线程一样
扭曲处于活动状态。因此,如果 warp 中的任何线程执行 bar
指令,就好像经线中的所有线程都执行了该指令
酒吧指令。经纱中的所有线程都被停止,直到屏障
完成,并且屏障的到达计数增加
扭曲大小(不是扭曲中活动线程的数量)。在
有条件执行的代码,仅应在以下情况下使用 bar 指令
众所周知,所有线程对条件的评估都是相同的(
扭曲不发散)。由于屏障是在每个经纱上执行的
基础上,可选线程数必须是扭曲大小的倍数。

因此,尽管有任何相反的断言,但在 __syncthreads() 调用周围进行条件分支并不安全,除非您可以 100% 确定任何给定的 warp 中的每个线程遵循相同的代码路径,并且不会发生扭曲发散。

The answer to the short question is "No". Warp level branch divergence around a __syncthreads() instruction will cause a deadlock and result in a kernel hang. Your code example is not guaranteed to be safe or correct. The correct way to implement the code would be like this:

__global__ void kernel(...)

    if (tidx < N) {
        // Code stanza #1
    }

    __syncthreads();


    if (tidx < N) {
        // Code stanza #2
    }

    // etc
}

so that the __syncthreads() instructions are executed unconditionally.


EDIT: Just to add a bit of additional information which confirms this assertion, __syncthreads() calls get compiled into the PTX bar.sync instruction on all architectures. The PTX2.0 guide (p133) documents bar.sync and includes the following warning:

Barriers are executed on a per-warp basis as if all the threads in a
warp are active. Thus, if any thread in a warp executes a bar
instruction, it is as if all the threads in the warp have executed the
bar instruction. All threads in the warp are stalled until the barrier
completes, and the arrival count for the barrier is incremented by the
warp size (not the number of active threads in the warp). In
conditionally executed code, a bar instruction should only be used if
it is known that all threads evaluate the condition identically (the
warp does not diverge). Since barriers are executed on a per-warp
basis, the optional thread count must be a multiple of the warp size.

So despite any assertions to the contrary, it is not safe to have conditional branching around a __syncthreads() call unless you can be 100% certain that every thread in any given warp follows the same code path and no warp divergence can occur.

梦幻之岛 2024-11-26 10:17:21

计算能力 7.x (Volta) 更新:

随着在 warp 中的线程之间引入独立线程调度,CUDA 在实践中终于变得更加严格,现在匹配记录的行为。来自编程指南< /a>:

虽然 __syncthreads() 一直被记录为同步线程块中的所有线程,但 Pascal 和之前的架构只能在 warp 级别强制同步。在某些情况下,只要每个线程束中至少有一些线程到达屏障,这就允许屏障成功,而无需由每个线程执行。从 Volta 开始,CUDA 内置 __syncthreads() 和 PTX 指令 bar.sync (及其派生指令)针对每个线程强制执行,因此在块中所有未退出线程到达之前不会成功。利用先前行为的代码可能会死锁,必须进行修改以确保所有未退出的线程到达屏障。

下面是之前的答案,其中漫谈了 Volta 之前的行为。


更新:这个答案可能不会在 talonmies 之上添加任何内容(我想这取决于您对主题的理解),但冒着过于冗长的风险,我正在呈现对我有帮助的信息更好地理解这一点。另外,如果您对“幕后”工作方式或官方文档之外的可能性不感兴趣,那么这里就没有什么可看的。尽管如此,我仍然不建议做出超出官方记录的假设,尤其是在希望支持多种或未来架构的环境中。我主要想指出的是,虽然 CUDA 编程指南__syncthreads() 的实际行为可能与它的描述方式有所不同,这对我来说很有趣。我最不想看到的就是传播错误信息,所以我愿意接受讨论并修改我的答案!


一些重要的事实

这个答案没有太长的篇幅,因为很可能会产生误解,但这里有一些相关的事实可以开始:

  • __syncthreads() 的行为就像块中扭曲的屏障而不是一个块中的所有线程,尽管按照建议使用时它相当于同一件事。
  • 如果 warp 中的任何线程执行 PTX bar 指令(例如,来自 _syncthreads),就好像 warp 中的所有线程都执行了 PTX bar 指令。
  • 当调用 bar.sync 时(由固有的 __syncthreads() 生成),该块和屏障的到达计数将按扭曲大小递增。前面的几点就是这样实现的。
  • 线程分歧(多路径)是通过串行化分支的执行来处理的。序列化的顺序是一个可能导致问题的因素。
  • warp 内的线程不通过 __syncthreads() 进行同步。该指令不会导致扭曲停止并等待不同路径上的线程。分支执行是串行化的,因此只有当分支重新加入或代码终止时,warp 中的线程才会重新同步。在此之前,分支按顺序独立运行。同样,块的每个扭曲中只有一个线程需要命中 __syncthreads() 才能继续执行。

这些陈述得到官方文件和其他来源的支持。

解释和文档

由于 __syncthreads() 充当块中扭曲的屏障,而不是块中所有线程的屏障,如《编程指南》中所述,因此似乎简单的提前退出将如果每个经纱中至少有一根纱线碰到障碍,那就没问题。 (但这并不是说您不能导致内在的死锁!) 这也假设 __syncthreads() 将始终生成一个简单的 bar.sync a; PTX 指令,并且其语义也不会改变,所以不要在生产中这样做。

我遇到的一项有趣的研究实际上调查了当您违背 CUDA 编程指南的建议时会发生什么,他们发现虽然确实有可能通过滥用 __syncthreads() 在条件块中,并非所有在条件代码中使用内部函数都会这样做。来自论文中的 D.1 节:

编程指南建议仅当条件在整个线程块中计算结果相同时才在条件代码中使用syncthreads()。本节的其余部分将研究违反此建议时syncthreads() 的行为。我们演示了syncthreads()作为扭曲的屏障,而不是线程。我们表明,当一个 warp 的线程由于分支发散而被序列化时,一个路径上的任何syncthreads()都不会等待来自另一路径的线程,而只等待同一线程块中运行的其他 warp。

此声明与的位一致talonmies 引用的 PTX 文档。具体来说:

屏障是在每个 warp 的基础上执行的,就好像一个 warp 中的所有线程都处于活动状态一样。因此,如果 warp 中的任何线程执行 bar 指令,则就好像该 warp 中的所有线程都执行了 bar 指令。 warp 中的所有线程都将停止,直到屏障完成,并且屏障的到达计数按 warp 大小(而不是 warp 中活动线程的数量)递增。

由此可以清楚地看出为什么 bar.sync a{, b}; 指令中的可选线程计数 b 必须是 warp 大小的倍数——只要是单个线程在 warp 中执行 bar 指令到达计数按 warp 大小递增,而不是 warp 中实际遇到屏障的线程数。无论如何,提前终止的线程(遵循不同的路径)实际上都被视为已到达。现在,引用段落中的下一句确实表示不要在条件代码中使用 __syncthreads() ,除非“已知所有线程都以相同的方式评估条件(扭曲不发散)”。这似乎是一个过于严格的建议(对于当前的体系结构),旨在确保到达计数实际上反映了遇到障碍的线程的真实数量。如果至少有一个线程遇到屏障,从而增加了整个扭曲的到达计数,那么您可能确实拥有更多的灵活性。

PTX 文档中没有明确指出,由 __syncthreads() 生成的 bar.sync a; 指令会等待当前协作线程数组(块)中的所有线程到达障碍a。然而,重点是,当前如何通过在遇到屏障时以扭曲大小的倍数增加到达计数来确定“所有线程”(默认情况下,当未指定 b 时)。这部分不是未定义的行为,至少在并行线程执行 ISA 版本 4.2 中不是。

请记住,即使没有条件,扭曲中也可能存在不活动的线程——“线程数不是扭曲大小的倍数的块的最后一个线程”。 (SIMT 架构说明)。然而,在此类块中并没有禁止 __syncthreads()

示例

提前退出版本 1:

__global__ void kernel(...)

    if (tidx >= N)
        return;      // OK for <32 threads to hit this, but if ALL
                     // threads in a warp hit this, THEN you are deadlocked
                     // (assuming there are other warps that sync)

    __syncthreads(); // If at least one thread on this path reaches this, the 
                     // arrival count for this barrier is incremented by 
                     // the number of threads in a warp, NOT the number of 
                     // threads that reach this in the current warp.
}

如果每个 warp 至少有一个线程达到同步,则不会出现死锁,但可能的问题是不同代码路径的执行序列化顺序。您可以更改上述内核以有效地交换分支。

提前退出版本 2:

__global__ void kernel(...)

    if (tidx < N) {
        // do stuff

        __syncthreads();
    }
    // else return;
}

如果经纱中至少有一个线程遇到障碍,仍然不会出现死锁,但是在这种情况下分支执行的顺序重要吗?我不这么认为,但要求特定的执行顺序可能是一个坏主意。

与微不足道的早期退出相比,本文通过一个更复杂的示例证明了这一点,这也提醒我们对扭曲发散保持谨慎。这里,warp 的前半部分([0,15] 上的线程 id tid)写入一些共享内存并执行 __syncthreads(),而另一半(线程 id [16,31] 上的 tid)也执行 __syncthreads(),但现在从扭曲前半部分写入的共享内存位置读取。首先忽略共享内存测试,您可能会预期任一屏障都会出现死锁。

// incorrect code to demonstrate behavior of __syncthreads
if (tid < 16 ) {
  shared_array[tid] = tid;
  __syncthreads();
}
else {
  __syncthreads();
  output[tid] =
    shared_array[tid%16];
}

没有死锁,表明 __syncthreads() 不会同步扭曲内的分叉线程。 不同的代码路径在一个 warp 中序列化,并且只需要代码路径中的一个线程即可使对 __syncthreads() 的调用在每个 warp 级别工作。

但是,共享内存位显示了一些不可预测的行为可能会进入其中的位置。 warp 的后半部分不会从前半部分获取更新的值,因为分支分歧序列化了 warp 的执行,并且首先执行了 else 块。因此该函数无法正常工作,但它也表明 __syncthreads() 不会同步扭曲中的不同线程。

摘要

__syncthreads() 不会等待 warp 中的所有线程,并且 warp 中单个线程的到达有效地将整个 warp 视为已到达屏障。 (目前的架构)。

由于不同线程执行的序列化方式,在条件代码中使用 __syncthreads() 可能很危险。

仅当您了解其工作原理以及如何处理分支分歧(发生在扭曲内)时,才可以在条件代码中使用内在函数。

请注意,我并没有说要以与记录方式不一致的方式继续使用 __syncthreads()

Compute Capability 7.x (Volta) update:

With the introduction of Independent Thread Scheduling among threads in a warp, CUDA is finally more strict in practice, now matching documented behavior. From the Programming Guide:

Although __syncthreads() has been consistently documented as synchronizing all threads in the thread block, Pascal and prior architectures could only enforce synchronization at the warp level. In certain cases, this allowed a barrier to succeed without being executed by every thread as long as at least some thread in every warp reached the barrier. Starting with Volta, the CUDA built-in __syncthreads() and PTX instruction bar.sync (and their derivatives) are enforced per thread and thus will not succeed until reached by all non-exited threads in the block. Code exploiting the previous behavior will likely deadlock and must be modified to ensure that all non-exited threads reach the barrier.

Below is the previous answer, which rambled about pre-Volta behavior.


Update: This answer may not add anything on top of talonmies' (depending on your understanding of the subject, I suppose), but at the risk of being too verbose I'm presenting the information that helped me understand this better. Also, if you are not interested in how things might work "under the hood" or what might be possible beyond the official documentation, there's nothing to see here. That all said, I still don't recommend making assumptions beyond what is officially documented, especially in an environment that hopes to support multiple or future architectures. I primarily wanted to point out that while this is explicitly called out as bad practice by the CUDA Programming Guide, the actual behavior of __syncthreads() may be somewhat different from how it is described and to me that is interesting. The last thing I want is to spread misinformation, so I'm open to discussion and revising my answer!


A few important facts

There is no TL;DR for this answer as there is too much potential for misinterpretation, but here are some relevant facts to start:

  • __syncthreads() behaves like a barrier for warps in a block rather than all of the threads in a block, although when used as advised it amounts to the same thing.
  • If any thread in a warp executes a PTX bar instruction (e.g. from _syncthreads), it is as if all the threads in the warp have.
  • When a bar.sync is called (as generated by the instrinsic __syncthreads()), the arrival count for that block and barrier are incremented by the warp size. This is how the previous points are achieved.
  • Thread divergence (multiple paths) is handled by serializing the execution of the branches. The order of serialization is a factor that can cause trouble.
  • The threads within a warp are not synchronized by __syncthreads(). The instruction will not cause the warp to stall and wait for the threads on divergent paths. Branch execution is serialized, so only when the branches rejoin or the code terminates do the threads in the warp then resynchronize. Until that, the branches run in sequence and independently. Again, only one thread in each warp of the block needs to hit __syncthreads() for execution to continue.

These statements are supported by official documentation and other sources.

Interpretation and documentation

Since __syncthreads() acts as a barrier for warps in a block rather than all of the threads in a block, as it is described in the Programming Guide, it seems that a simple early exit would be fine if at least one thread in each warp hits the barrier. (But that is not to say you can't cause deadlocks with the intrinsic!) This also supposes that __syncthreads() will always generate a simple bar.sync a; PTX instruction and that the semantics of that will not change either, so don't do this in production.

One interesting study that I came across actually investigates what happens when you go against the recommendations of the CUDA Programming Guide, and they found that while it is indeed possible to cause a deadlock by abusing __syncthreads() in conditional blocks, not all use of the intrinsic in conditional code will do so. From Section D.1 in the paper:

The Programming Guide recommends that syncthreads() be used in conditional code only if the condition evaluates identically across the entire thread block. The rest of this section investigates the behavior of syncthreads() when this recommendation is violated. We demonstrate that syncthreads() operates as a barrier for warps, not threads. We show that when threads of a warp are serialized due to branch divergence, any syncthreads() on one path does not wait for threads from the other path, but only waits for other warps running within the same thread block.

This statement is concordant with the bit of the PTX documentation quoted by talonmies. Specifically:

Barriers are executed on a per-warp basis as if all the threads in a warp are active. Thus, if any thread in a warp executes a bar instruction, it is as if all the threads in the warp have executed the bar instruction. All threads in the warp are stalled until the barrier completes, and the arrival count for the barrier is incremented by the warp size (not the number of active threads in the warp).

It is clear from this why the optional thread count b in the bar.sync a{, b}; instruction must be a multiple of warp size -- whenever a single thread in a warp executes a bar instruction the arrival count is incremented by the warp size, not the number of threads in the warp that actually hit the barrier. Threads that terminate early (followed a different path) were effectively counted as arrived anyway. Now, the next sentence in the quoted passage does then say not to use __syncthreads() in conditional code unless "it is known that all threads evaluate the condition identically (the warp does not diverge)." This seems to be an overly strict recommendation (for current architecture), meant to ensure that the arrival count actually reflects the real number of threads that hit the barrier. If at least one thread hitting the barrier increments the arrival count for the entire warp, you might really have a little more flexibility.

There is no ambiguity in the PTX documentation that the bar.sync a; instruction generated by __syncthreads() waits for all threads in the current cooperative thread array (block) to reach barrier a. However, the point is that how "all threads" is presently determined by incrementing the arrival count in multiples of warp size whenever the barrier is hit (by default when b is not specified). This part is not undefined behavior, at least not with Parallel Thread Execution ISA Version 4.2.

Keep in mind that there may be inactive threads in a warp even without a conditional -- "the last threads of a block whose number of threads is not a multiple of the warp size." (SIMT architecture notes). Yet __syncthreads() is not forbidden in such blocks.

Examples

Early exit version 1:

__global__ void kernel(...)

    if (tidx >= N)
        return;      // OK for <32 threads to hit this, but if ALL
                     // threads in a warp hit this, THEN you are deadlocked
                     // (assuming there are other warps that sync)

    __syncthreads(); // If at least one thread on this path reaches this, the 
                     // arrival count for this barrier is incremented by 
                     // the number of threads in a warp, NOT the number of 
                     // threads that reach this in the current warp.
}

This will not deadlock if at least one thread per warp hits the sync, but a possible issue is order of serialization of the execution of divergent code paths. You can change around the above kernel to effectively swap the branches.

Early exit version 2:

__global__ void kernel(...)

    if (tidx < N) {
        // do stuff

        __syncthreads();
    }
    // else return;
}

Still no deadlock if you have at least one thread in the warp hit the barrier, but is the order of branch execution important in this case? I don't think so, but it's probably a bad idea to require a particular execution order.

The paper demonstrates this in a more involved example compared to a trivial early exit that also reminds us to be cautious around warp divergence. Here the first half of the warp (thread id tid on [0,15]) writes to some shared memory and executes __syncthreads(), while the other half (thread id tid on [16,31]) also executes __syncthreads() but now reads from the shared memory locations written by the first half of the warp. Ignoring the shared memory test at first, you might expect a deadlock at either barrier.

// incorrect code to demonstrate behavior of __syncthreads
if (tid < 16 ) {
  shared_array[tid] = tid;
  __syncthreads();
}
else {
  __syncthreads();
  output[tid] =
    shared_array[tid%16];
}

There is no deadlock, indicating that __syncthreads() does not synchronize diverged threads within a warp. Divergent code paths are serialized in a warp and it only takes one thread in a code path to make the call to __syncthreads() work at the per-warp level.

However, the shared memory bit shows where some unpredictable behavior can enter into this. The second half of the warp does not get the updated values from the first half because branch divergence serialized execution of the warp and the else block was executed first. So the function doesn't work right, but it also show that __syncthreads() does not synchronize divergent threads in a warp.

Summary

__syncthreads() does not wait for all threads in a warp, and the arrival of a single thread in a warp effectively counts the entire warp as having reached the barrier. (Present architecture).

It can be dangerous to use __syncthreads() in conditional code because of how divergent thread execution is serialized.

Use the intrinsic in conditional code only if you understand how it works and how branch divergence (which occurs within a warp) is handled.

Note that I didn't say to go ahead and use __syncthreads() in a way inconsistent with how it is documented.

娇纵 2024-11-26 10:17:21

简而言之:是的,它很安全。

被接受的答案在书写时很可能是正确的,但至少自 Volta 以来,它是错误的。 CUDA 文档 明确所有非退出线程必须调用__syncthreads,这意味着可以提前退出并且不会导致死锁。

In short: yes it's safe.

The accepted answer may well have been correct when written, but at least since Volta, it is wrong. CUDA docs make clear that the __syncthreads call must be reached by all non-exited threads, which means one can exit early and not cause deadlock.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文