CUDA:停止所有其他线程
我有一个问题,似乎可以通过枚举所有可能的解决方案然后找到最佳解决方案来解决。为此,我设计了一种回溯算法,可以枚举并存储找到的最佳解决方案。到目前为止效果很好。
现在,我想将此算法移植到 CUDA 中。因此,我创建了一个生成一些不同的基本案例的过程。这些基本情况应该在 GPU 上并行处理。如果其中一个 CUDA 线程找到最佳解决方案,那么所有其他线程当然可以停止其工作。
所以,我想要以下内容:找到最佳解决方案的线程应该停止运行我的程序的所有 CUDA 线程,从而完成计算。
经过一些快速搜索,我发现线程只有在同一个块中才能通信。 (所以我认为不可能阻止其他线程阻塞。)
我能想到的唯一方法是我有一个专用标志optimum_found
,它在每个内核的开头进行检查。如果找到最佳解决方案,则该标志设置为1
,因此所有未来线程都知道它们不必工作。但是,当然,如果已经运行的线程不在每次迭代时检查该标志,则它们不会注意到该标志。
那么,是否有可能停止所有剩余的 CUDA 线程?
I have a problem that is seemingly solvable by enumerating all possible solutions and then finding the best. In order to do so, I devised a backtracking algorithm that enumerates and stores the best solution if found. It works fine so far.
Now, I wanted to port this algorithm to CUDA. Therefore, I created a procedure that generates some distinct basic cases. These basic cases should be processed in parallel on the GPU. If one of the CUDA-threads finds an optimal solution, all the other threads can - of course - stop their work.
So, I wanted kind of the following: The thread that finds the optimal solution should stop all running CUDA-threads of my program, thus finishing calculation.
After some quick search, I found that threads can only communicate if they are in the same block. (So I suppose it's impossible to stop others blocks threads.)
The only method I could think of is that I have a dedicated flag optimum_found
, which is checked at the beginning of every kernel. If an optimum solution is found, this flag is set to 1
, so all future threads know that they do not have to work. But of course, threads already running do not notice this flag if they do not check it at every iteration.
So, is there a possibility to stop all remaining CUDA-threads?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我认为你拥有专用标志的方法可以工作,只要它是全局内存中的内存位置。这样您就可以在每次内核调用开始时检查这一点,正如您所说。
无论如何,内核调用通常应该相对较短,因此即使其中一个线程找到了最佳解决方案,让批处理中的其他线程完成也不会对性能产生太大影响。
也就是说,我相当确定没有 CUDA 调用可以杀死其他正在执行的线程。
I think that your method of having a dedicated flag could work provided that it was a memory location in global memory. That way you can check this, as you said, at the beginning of each kernel call.
Kernel calls should generally be relatively short anyways, therefore letting the other threads in a batch finish even though an optimal solution was found by one of those threads shouldn't affect your performance too much.
That said, I am fairly sure there is no CUDA call that can kill off other actively executing threads.
我认为伊恩的想法是正确的。最佳性能来自最少的内存传输和分支。写入全局内存并检查标志(分支)违反了CUDA 最佳实践指南,并且会降低加速速度。
I think Ian has the right idea here. Optimum performance would come from minimal memory transfers and branching. Writing to global memory and checking flags (branching) goes against the CUDA best practices guide and will reduce your speedup.
您可能想查看回调。主CPU线程可以确保所有线程按正确的顺序运行。 CPU 回调线程(读:后处理)可以执行额外的开销并调用相关的 api 函数以及处理所有子线程数据...此功能可在 cuda 示例中找到,并在 cuda 功能 2 上进行编译。希望这会有所帮助。
You might want to look at callbacks. The main CPU thread can make sure all threads run in the right order. CPU callback threads (read: postprocessing) can do additional overhead and call the related api functions as well as disposing all of the sub thread data... This feature is found in cuda samples and compiles on cuda capability 2. Hope this helps.