CUDA 中的忙旋转

发布于 2024-12-09 19:47:28 字数 445 浏览 5 评论 0原文

如何实现一种繁忙的自旋机制,

while(variable == 0);

其中变量在发生某个事件后由其他 CUDA 线程更新为 1。

我尝试像上面那样编写它,但代码似乎被忽略了,并且调用线程根本不等待就运行了它。我绝对确定该值为 0,但线程根本不等待。 另外,如果我写:

while(variable == 0) __threadfence();

为了不冒缓存变量的风险,即使变量最终设置为 1,线程也会无限期地阻塞。 这对我来说是非常奇怪的行为,因为在 CPU 上复制这段代码会产生正确的行为。

编辑:奇怪的是,如果我每个块有 1 个线程,这似乎可以正常工作,但如果我在一个块中有多个线程,则不能正常工作。因此,一个块中的线程可以看到其他块中的线程完成的写入,但看不到同一块中的线程完成的写入。奇怪的...

How can I implement a busy spin mechanism of the form

while(variable == 0);

where variable is updated to 1 by some other CUDA thread after some event has occured.

I tried to just write it like above but the code just seems to get ignored and the calling thread just runs past it without waiting at all. I'm absolutely sure that the value is 0, but the thread does not wait at all.
Also, if I write:

while(variable == 0) __threadfence();

in order to not risk having the variable cached, the thread blocks indefinitely even thought the variable gets set to 1 eventually.
This is all very strange behavior to me, since replicating this code on the CPU produces the correct behavior.

Edit: Oddly, this seems to work correctly if I have blocks of 1 thread each, but not if I have several threads within one block. So threads from one block can see writes done by threads from other blocks, but not writes done by threads from the same block. Strange...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

救星 2024-12-16 19:47:28

忙碌的旋转需要大量的注意力,你必须非常小心!

您必须记住,32 个线程形成的扭曲工作完美同步。如果遇到分支,不采用该分支的线程将被禁用,直到执行该分支的线程从该分支退出。
这就是为什么尝试在 warp 内进行忙自旋可能会导致死锁:31 个线程将永远等待单个禁用的线程完成其工作。

其次,如果您尝试在块之间进行同步,则必须知道两个块都是并行运行的。理论上,你不知道有多少块正在运行;在实践中,您可以阅读 GPU 的规格并启动尽可能多的 GPU(驱动程序和/或硬件中存在一些错误,这也可能导致一些问题)。

第三,您必须记住 CUDA 编译器会尝试进行优化。您必须将共享或全局变量设置为“易失性”,以确保它始终被读取。

Busy-spinning requires a lot of attention and you have to be really careful about it!

You have to keep in mind, that 32 threads, forming a warp work in perfect sync. If you encounter a branch, threads not taking it become disabled, until the threads executing the branch - exit from it.
That is why, trying to busy-spin within a warp can lead to a deadlock: 31 threads will be waiting forever for the single, disabled thread to do its work.

Secondly, if you try to synchronise between blocks, you must know that both blocks are running in parallel. In theory, you don't know how many blocks are running; in practice, you can read the specs of your GPU and launch just as many as it can handle (there are some bugs in the driver and/or hardware, which can cause some problems too)

Thirdly, you have to remember that CUDA compiler tries to optimise. You have to set your shared or global variable as 'volatile' to ensure that it is always being read.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文