CUDA 上的块间屏障
我想在 CUDA 上实现块间屏障,但遇到了严重的问题。
我不明白为什么它不起作用。
#include <iostream>
#include <cstdlib>
#include <ctime>
#define SIZE 10000000
#define BLOCKS 100
using namespace std;
struct Barrier {
int *count;
__device__ void wait() {
atomicSub(count, 1);
while(*count)
;
}
Barrier() {
int blocks = BLOCKS;
cudaMalloc((void**) &count, sizeof(int));
cudaMemcpy(count, &blocks, sizeof(int), cudaMemcpyHostToDevice);
}
~Barrier() {
cudaFree(count);
}
};
__global__ void sum(int* vec, int* cache, int *sum, Barrier barrier)
{
int tid = blockIdx.x;
int temp = 0;
while(tid < SIZE) {
temp += vec[tid];
tid += gridDim.x;
}
cache[blockIdx.x] = temp;
barrier.wait();
if(blockIdx.x == 0) {
for(int i = 0 ; i < BLOCKS; ++i)
*sum += cache[i];
}
}
int main()
{
int* vec_host = (int *) malloc(SIZE * sizeof(int));
for(int i = 0; i < SIZE; ++i)
vec_host[i] = 1;
int *vec_dev;
int *sum_dev;
int *cache;
int sum_gpu = 0;
cudaMalloc((void**) &vec_dev, SIZE * sizeof(int));
cudaMemcpy(vec_dev, vec_host, SIZE * sizeof(int), cudaMemcpyHostToDevice);
cudaMalloc((void**) &sum_dev, sizeof(int));
cudaMemcpy(sum_dev, &sum_gpu, sizeof(int), cudaMemcpyHostToDevice);
cudaMalloc((void**) &cache, BLOCKS * sizeof(int));
cudaMemset(cache, 0, BLOCKS * sizeof(int));
Barrier barrier;
sum<<<BLOCKS, 1>>>(vec_dev, cache, sum_dev, barrier);
cudaMemcpy(&sum_gpu, sum_dev, sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(vec_dev);
cudaFree(sum_dev);
cudaFree(cache);
free(vec_host);
return 0;
}
事实上,即使我将 wait() 重写为下面的
__device__ void wait() {
while(*count != 234124)
;
}
程序也会正常退出。但我希望在这种情况下会出现无限循环。
I want to implement a Inter-block barrier on CUDA, but encountering a serious problem.
I cannot figure out why it does not work.
#include <iostream>
#include <cstdlib>
#include <ctime>
#define SIZE 10000000
#define BLOCKS 100
using namespace std;
struct Barrier {
int *count;
__device__ void wait() {
atomicSub(count, 1);
while(*count)
;
}
Barrier() {
int blocks = BLOCKS;
cudaMalloc((void**) &count, sizeof(int));
cudaMemcpy(count, &blocks, sizeof(int), cudaMemcpyHostToDevice);
}
~Barrier() {
cudaFree(count);
}
};
__global__ void sum(int* vec, int* cache, int *sum, Barrier barrier)
{
int tid = blockIdx.x;
int temp = 0;
while(tid < SIZE) {
temp += vec[tid];
tid += gridDim.x;
}
cache[blockIdx.x] = temp;
barrier.wait();
if(blockIdx.x == 0) {
for(int i = 0 ; i < BLOCKS; ++i)
*sum += cache[i];
}
}
int main()
{
int* vec_host = (int *) malloc(SIZE * sizeof(int));
for(int i = 0; i < SIZE; ++i)
vec_host[i] = 1;
int *vec_dev;
int *sum_dev;
int *cache;
int sum_gpu = 0;
cudaMalloc((void**) &vec_dev, SIZE * sizeof(int));
cudaMemcpy(vec_dev, vec_host, SIZE * sizeof(int), cudaMemcpyHostToDevice);
cudaMalloc((void**) &sum_dev, sizeof(int));
cudaMemcpy(sum_dev, &sum_gpu, sizeof(int), cudaMemcpyHostToDevice);
cudaMalloc((void**) &cache, BLOCKS * sizeof(int));
cudaMemset(cache, 0, BLOCKS * sizeof(int));
Barrier barrier;
sum<<<BLOCKS, 1>>>(vec_dev, cache, sum_dev, barrier);
cudaMemcpy(&sum_gpu, sum_dev, sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(vec_dev);
cudaFree(sum_dev);
cudaFree(cache);
free(vec_host);
return 0;
}
In fact, even if I rewrite the wait() as the following
__device__ void wait() {
while(*count != 234124)
;
}
The program exits normally. But I expect to get an infinite loop in this case.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
不幸的是,您想要实现的目标(块间通信/同步)在 CUDA 中并不完全可能。 CUDA 编程指南指出“线程块需要独立执行:必须能够以任何顺序、并行或串行执行它们”。此限制的原因是为了允许线程块调度程序具有灵活性,并允许代码随内核数量进行不可知的扩展。唯一支持的块间同步方法是启动另一个内核:内核启动(在同一流内)是隐式同步点。
您的代码违反了块独立性规则,因为它隐式假设内核的线程块同时执行(参见并行)。但不能保证他们会这样做。为了了解为什么这对您的代码很重要,让我们考虑一个假设的只有一个核心的 GPU。我们还假设您只想启动两个线程块。在这种情况下,你的 spinloop 内核实际上会死锁。如果线程块 0 首先在核心上调度,那么当它到达屏障时,它将永远循环,因为线程块 1 永远没有机会更新计数器。因为线程块零永远不会被换出(线程块执行到完成),所以它在旋转时会使核心之一的线程块挨饿。
有些人尝试过像您这样的方案,并取得了成功,因为调度程序碰巧以假设成立的方式安排了块。例如,曾经有一段时间,启动与 GPU 具有 SM 一样多的线程块意味着这些块真正是并发执行的。但当驱动程序或 CUDA 运行时或 GPU 的更改使该假设无效并破坏了他们的代码时,他们感到失望。
对于您的应用程序,尝试找到一个不依赖于块间同步的解决方案,因为(除非对 CUDA 编程模型进行含义更改)这是不可能的。
Unfortunately, what you want to achieve (inter-block communication/synchronization) isn't strictly possible in CUDA. The CUDA programming guide states that "thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series." The reason for this restriction is to allow flexibility in the thread block scheduler, and to allow the code to agnostically scale with the number of cores. The only supported inter-block synchronization method is to launch another kernel: kernel launches (within the same stream) are implicit synchronization points.
Your code violates the block independence rule because it implicitly assumes that your kernel's thread blocks execute concurrently (cf. in parallel). But there's no guarantee that they do. To see why this matters to your code, let's consider a hypothetical GPU with only one core. We'll also assume that you only want to launch two thread blocks. Your spinloop kernel will actually deadlock in this situation. If thread block zero is scheduled on the core first, it will loop forever when it gets to the barrier, because thread block one never has a chance to update the counter. Because thread block zero is never swapped out (thread blocks execute to their completion) it starves thread block one of the core while it spins.
Some folks have tried schemes such as yours and have seen success because the scheduler happened to serendipitously schedule blocks in such a way that the assumptions worked out. For example, there was a time when launching as many thread blocks as a GPU has SMs meant that the blocks were truly executed concurrently. But they were disappointed when a change to the driver or CUDA runtime or GPU invalidated that assumption, breaking their code.
For your application, try to find a solution which doesn't depend on inter-block synchronization, because (barring a signification change to the CUDA programming model) it just isn't possible.
块到块同步是可能的。请参阅此论文。
论文并没有详细介绍其工作原理,但它依赖于 __syncthreads() 的操作;为当前块创建暂停屏障,...同时等待其他块到达同步点。
论文中没有提到的一项是,只有当块的数量足够小或者 SM 的数量对于手头的任务来说足够大时,同步才有可能。即,如果您有 4 个 SM 并尝试同步 5 个块,.. 内核将死锁。
通过他们的方法,我能够将长串行任务分散到多个块中,与单块方法相比,轻松节省 30% 的时间。即块同步对我有用。
Block to block synchronization is possible. See this paper.
The paper doesn't go into great detail on how it works, but it relies on the operation of __syncthreads(); to create the pause-barrier for the current block,... while waiting for the other blocks to get to the sync point.
One item that isn't noted in the paper is that sync is only possible if the number of blocks is small enough or the number of SM's is large enough for the task on hand. i.e. If you have 4 SM's and are trying to sync 5 blocks,.. the kernel will deadlock.
With their approach, I've been able to spread a long serial task among many blocks, easily saving 30% time over a single block approach. i.e. The block-sync worked for me.
看起来像是编译器优化问题。我不擅长阅读 PTX 代码,但看起来编译器根本省略了
while
循环(即使使用-O0
编译):以防万一对于 CPU 代码,通过使用
volatile
前缀声明变量可以防止这种行为。但即使我们将count
声明为int __device__ count
(并适当地更改代码),添加volatile
说明符只会中断编译(出现错误 < code>类型“volatile int *”的参数与类型“void *”的参数不兼容)我建议查看 CUDA SDK 中的 threadFenceReduction 示例。他们的做法与您几乎相同,但是执行最终求和的块是在运行时选择的,而不是预定义的,并且消除了 while 循环,因为全局变量上的自旋锁应该非常慢。
Looks like compiler optimizations issue. I'm not good with reading PTX-code, but it looks like the compiler have omitted the
while
-loop at all (even when compiled with-O0
):In case of CPU code, such behavior is prevented by declaring the variable with
volatile
prefix. But even if we declarecount
asint __device__ count
(and appropriately change the code), addingvolatile
specifier just breaks compilation (with errors lokeargument of type "volatile int *" is incompatible with parameter of type "void *"
)I suggest looking at threadFenceReduction example from CUDA SDK. There they are doing pretty much the same as you do, but the block to do final summation is chosen in runtime, rather than predefined, and the
while
-loop is eliminated, because spin-lock on global variable should be very slow.