CUDA：同步线程

发布于 2024-08-09 03:49:43 字数 529 浏览 3 评论 0原文

几乎在我读到的有关使用 CUDA 编程的任何地方都提到了扭曲中的所有线程都执行相同操作的重要性。
在我的代码中，我遇到了无法避免某种条件的情况。它看起来像这样：

// some math code, calculating d1, d2
if (d1 < 0.5)
{
    buffer[x1] += 1;  // buffer is in the global memory
}
if (d2 < 0.5)
{
    buffer[x2] += 1;
}
// some more math code.

一些线程可能会根据条件进入一个线程，一些线程可能会进入两个线程，而其他线程可能不会进入任何一个线程。

现在，为了使所有线程在条件满足后再次返回“做同样的事情”，我应该在条件满足后使用 __syncthreads() 同步它们吗？或者这会以某种方式自动发生吗？
由于其中一个线程落后了一个操作，两个线程是否可以不做同样的事情，从而破坏每个人的工作？或者是否有一些幕后努力让他们在分支之后再次做同样的事情？

原文

Almost anywhere I read about programming with CUDA there is a mention of the importance that all of the threads in a warp do the same thing.
In my code I have a situation where I can't avoid a certain condition. It looks like this:

// some math code, calculating d1, d2
if (d1 < 0.5)
{
    buffer[x1] += 1;  // buffer is in the global memory
}
if (d2 < 0.5)
{
    buffer[x2] += 1;
}
// some more math code.

Some of the threads might enter into one for the conditions, some might enter into both and other might not enter into either.

Now in order to make all the thread get back to "doing the same thing" again after the conditions, should I synchronize them after the conditions using __syncthreads() ? Or does this somehow happens automagically?
Can two threads be not doing the same thing due to one of them being one operation behind, thus ruining it for everyone? Or is there some behind the scenes effort to get them to do the same thing again after a branch?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

单挑你×的.吻 2024-08-16 03:49:43

在扭曲内，任何线程都不会“领先于”任何其他线程。如果存在一个条件分支，并且它被经纱中的某些线程采用，但其他线程没有采用（也称为经纱“发散”），则其他线程将闲置，直到分支完成，并且它们都在公共指令上“聚合”在一起。因此，如果您只需要线程的扭曲内同步，那么就会“自动”发生。

但不同的扭曲不会以这种方式同步。因此，如果您的算法要求某些操作在多个 warp 上完成，那么您将需要使用显式同步调用（请参阅 CUDA 编程指南，第 5.4 节）。

编辑：重新组织了接下来的几段以澄清一些事情。

这里实际上有两个不同的问题：指令同步和内存可见性。

__syncthreads() 强制指令同步并确保内存可见性，但仅限于块内，而不是跨块（CUDA 编程指南，附录 B.6）。它对于共享内存上的先写后读很有用，但不适合同步全局内存访问。
__threadfence() 确保全局内存可见性，但不执行任何指令同步，因此根据我的经验，它的用途有限（但请参阅附录 B.5 中的示例代码）。
内核中不可能进行全局指令同步。如果您需要在任何线程上调用 g() 之前在所有线程上完成 f()，请拆分 f() 和 g( ） 到两个不同的内核中，并从主机串行调用它们。
如果您只需要增加共享或全局计数器，请考虑使用原子增量函数atomicInc()（附录B.10）。对于上面的代码，如果 x1 和 x2 不是全局唯一的（在网格中的所有线程中），非原子增量将导致竞争条件，类似于附录 B.2.4 的最后一段。

最后，请记住，对全局内存的任何操作，特别是同步函数（包括原子）都会损害性能。

在不知道您正在解决的问题的情况下，很难推测，但也许您可以重新设计算法，在某些地方使用共享内存而不是全局内存。这将减少同步的需要并提高性能。

Within a warp, no threads will "get ahead" of any others. If there is a conditional branch and it is taken by some threads in the warp but not others (a.k.a. warp "divergence"), the other threads will just idle until the branch is complete and they all "converge" back together on a common instruction. So if you only need within-warp synchronization of threads, that happens "automagically."

But different warps are not synchronized this way. So if your algorithm requires that certain operations be complete across many warps then you'll need to use explicit synchronization calls (see the CUDA Programming Guide, Section 5.4).

EDIT: reorganized the next few paragraphs to clarify some things.

There are really two different issues here: Instruction synchronization and memory visibility.

__syncthreads() enforces instruction synchronization and ensures memory visibility, but only within a block, not across blocks (CUDA Programming Guide, Appendix B.6). It is useful for write-then-read on shared memory, but is not appropriate for synchronizing global memory access.
__threadfence() ensures global memory visibility but doesn't do any instruction synchronization, so in my experience it is of limited use (but see sample code in Appendix B.5).
Global instruction synchronization is not possible within a kernel. If you need f() done on all threads before calling g() on any thread, split f() and g() into two different kernels and call them serially from the host.
If you just need to increment shared or global counters, consider using the atomic increment function atomicInc() (Appendix B.10). In the case of your code above, if x1 and x2 are not globally unique (across all threads in your grid), non-atomic increments will result in a race-condition, similar to the last paragraph of Appendix B.2.4.

Finally, keep in mind that any operations on global memory, and synchronization functions in particular (including atomics) are bad for performance.

Without knowing the problem you're solving it is hard to speculate, but perhaps you can redesign your algorithm to use shared memory instead of global memory in some places. This will reduce the need for synchronization and give you a performance boost.

回复收藏 0 原文

何时共饮酒 2024-08-16 03:49:43

来自 CUDA 最佳实践指南第 6.1 节：

任何流程控制指令（if、switch、do、for、while）都会显着影响
通过导致同一扭曲的线程发散来提高指令吞吐量；那是，
遵循不同的执行路径。如果发生这种情况，不同的执行路径
必须串行化，增加为此执行的指令总数
经。当所有不同的执行路径都完成后，线程会聚
返回到相同的执行路径。

所以，你不需要做任何特别的事情。

回复收藏 0 原文

怪我太投入 2024-08-16 03:49:43

你的问题的答案是否定的。你不需要做任何特别的事情。
不管怎样，你可以解决这个问题，而不是你的代码，你可以做这样的事情：

buffer[x1] += (d1 < 0.5);
buffer[x2] += (d2 < 0.5);

你应该检查是否可以使用共享内存并以合并模式访问全局内存。另请确保您不想在多个线程中写入相同的索引。

The answer to your question is no. You don't need to do anything special.
Anyway, you can fix this, instead of your code you can do something like this:

buffer[x1] += (d1 < 0.5);
buffer[x2] += (d2 < 0.5);

You should check if you can use shared memory and access global memory in a coalesced pattern. Also be sure that you DON'T want to write to the same index in more than 1 thread.

回复收藏 0 原文