关于CUDA中从block到SM的分配细节的问题

发布于 2024-12-01 04:05:50 字数 887 浏览 1 评论 0原文

我以计算能力1.3的硬件为例。

30 个 SM 可用。那么最多可以同时运行240个块（考虑到寄存器和共享内存的限制，对块数量的限制可能要低得多）。超过 240 的块必须等待可用的硬件资源。

我的问题是，超过 240 的区块何时会分配给 SM。前 240 个区块中的部分区块完成后？或者前 240 个区块全部何时完成？

我写了这样一段代码。

#include<stdio.h>
#include<string.h>
#include<cuda_runtime.h>
#include<cutil_inline.h>

const int BLOCKNUM = 1024;
const int N=240;
__global__ void kernel ( volatile int* mark ) {
    if ( blockIdx.x == 0 ) while ( mark[N] == 0 );
    if ( threadIdx.x == 0 ) mark[blockIdx.x] = 1;
}

int main() {
    int * mark;
    cudaMalloc ( ( void** ) &mark, sizeof ( int ) *BLOCKNUM );
    cudaMemset ( mark, 0, sizeof ( int ) *BLOCKNUM );
    kernel <<< BLOCKNUM, 1>>> ( mark );
    cudaFree ( mark );
    return 0;
}

此代码会导致死锁并且无法终止。但如果我将 N 从 240 更改为 239，代码就能够终止。所以我想了解一些关于块调度的细节。

原文

Let me take the hardware with computation ability 1.3 as an example.

30 SMs are available. Then at most 240 blocks are able to be running at the same time(Considering the limit of register and shared memory, the restriction to the number of block may be much lower). Those blocks beyond 240 have to wait for available hardware resources.

My question is when those blocks beyond 240 will be assigned to SMs. Once some blocks of the first 240 are completed? Or when all of the first 240 blocks are finished?

I wrote such a piece of code.

#include<stdio.h>
#include<string.h>
#include<cuda_runtime.h>
#include<cutil_inline.h>

const int BLOCKNUM = 1024;
const int N=240;
__global__ void kernel ( volatile int* mark ) {
    if ( blockIdx.x == 0 ) while ( mark[N] == 0 );
    if ( threadIdx.x == 0 ) mark[blockIdx.x] = 1;
}

int main() {
    int * mark;
    cudaMalloc ( ( void** ) &mark, sizeof ( int ) *BLOCKNUM );
    cudaMemset ( mark, 0, sizeof ( int ) *BLOCKNUM );
    kernel <<< BLOCKNUM, 1>>> ( mark );
    cudaFree ( mark );
    return 0;
}

This code causes a deadlock and fails to terminate. But if I change N from 240 to 239, the code is able to terminate. So I want to know some details about the scheduling of blocks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

久隐师 2024-12-08 04:05:50

在 GT200 上，通过微基准测试证明，只要 SM 退役其正在运行的所有当前活动块，就会调度新块。所以答案是当一些块完成时，调度粒度是SM级别。人们似乎一致认为，Fermi GPU 比前几代硬件具有更精细的调度粒度。

回复收藏 0 原文

哆兒滾 2024-12-08 04:05:50

我找不到任何关于计算能力的参考文献< 1.3.

Fermi 架构引入了一种新的块调度程序，称为 GigaThread 引擎。
GigaThread 可以在完成执行后立即替换 SM 上的块，并且还可以实现并发内核执行。

回复收藏 0 原文

内心旳酸楚 2024-12-08 04:05:50

虽然对此没有官方答案，但您可以通过原子操作来测量块何时开始工作以及何时结束。

尝试使用以下代码：

#include <stdio.h>

const int maxBlocks=60; //Number of blocks of size 512 threads on current device required to achieve full occupancy

__global__ void emptyKernel() {}


__global__ void myKernel(int *control, int *output) {
        if (threadIdx.x==1) {
                //register that we enter
                int enter=atomicAdd(control,1);
                output[blockIdx.x]=enter;

                //some intensive and long task
                int &var=output[blockIdx.x+gridDim.x]; //var references global memory
                var=1;
                for (int i=0; i<12345678; ++i) {
                        var+=1+tanhf(var);
                }

                //register that we quit
                var=atomicAdd(control,1);
        }
}


int main() {

        int *gpuControl;
        cudaMalloc((void**)&gpuControl, sizeof(int));
        int cpuControl=0;
        cudaMemcpy(gpuControl,&cpuControl,sizeof(int),cudaMemcpyHostToDevice);


        int *gpuOutput;
        cudaMalloc((void**)&gpuOutput, sizeof(int)*maxBlocks*2);
        int cpuOutput[maxBlocks*2];

        for (int i=0; i<maxBlocks*2; ++i) //clear the host array just to be on the safe side
                cpuOutput[i]=-1;

        // play with these values
        const int thr=479;
        const int p=13;
        const int q=maxBlocks;

        //I found that this may actually affect the scheduler! Try with and without this call.
        emptyKernel<<<p,thr>>>();

        cudaEvent_t timerStart;
        cudaEvent_t timerStop;
        cudaEventCreate(&timerStart);
        cudaEventCreate(&timerStop);

        cudaThreadSynchronize();

        cudaEventRecord(timerStart,0);

        myKernel<<<q,512>>>(gpuControl, gpuOutput);

        cudaEventRecord(timerStop,0);
        cudaEventSynchronize(timerStop);

        cudaMemcpy(cpuOutput,gpuOutput,sizeof(int)*maxBlocks*2,cudaMemcpyDeviceToHost);

        cudaThreadSynchronize();
        float thisTime;
        cudaEventElapsedTime(&thisTime,timerStart,timerStop);

        cudaEventDestroy(timerStart);
        cudaEventDestroy(timerStop);
        printf("Elapsed time: %f\n",thisTime);

        for (int i=0; i<q; ++i)
                printf("%d: %d-%d\n",i,cpuOutput[i],cpuOutput[i+q]);
}

输出中得到的是块 ID，后跟输入“时间”和退出“时间”。通过这种方式，您可以了解这些事件发生的顺序。

While there is no official answer to this, you can measure through atomic operations when your blocks begin your work and when they end.

Try playing with the following code:

#include <stdio.h>

const int maxBlocks=60; //Number of blocks of size 512 threads on current device required to achieve full occupancy

__global__ void emptyKernel() {}


__global__ void myKernel(int *control, int *output) {
        if (threadIdx.x==1) {
                //register that we enter
                int enter=atomicAdd(control,1);
                output[blockIdx.x]=enter;

                //some intensive and long task
                int &var=output[blockIdx.x+gridDim.x]; //var references global memory
                var=1;
                for (int i=0; i<12345678; ++i) {
                        var+=1+tanhf(var);
                }

                //register that we quit
                var=atomicAdd(control,1);
        }
}


int main() {

        int *gpuControl;
        cudaMalloc((void**)&gpuControl, sizeof(int));
        int cpuControl=0;
        cudaMemcpy(gpuControl,&cpuControl,sizeof(int),cudaMemcpyHostToDevice);


        int *gpuOutput;
        cudaMalloc((void**)&gpuOutput, sizeof(int)*maxBlocks*2);
        int cpuOutput[maxBlocks*2];

        for (int i=0; i<maxBlocks*2; ++i) //clear the host array just to be on the safe side
                cpuOutput[i]=-1;

        // play with these values
        const int thr=479;
        const int p=13;
        const int q=maxBlocks;

        //I found that this may actually affect the scheduler! Try with and without this call.
        emptyKernel<<<p,thr>>>();

        cudaEvent_t timerStart;
        cudaEvent_t timerStop;
        cudaEventCreate(&timerStart);
        cudaEventCreate(&timerStop);

        cudaThreadSynchronize();

        cudaEventRecord(timerStart,0);

        myKernel<<<q,512>>>(gpuControl, gpuOutput);

        cudaEventRecord(timerStop,0);
        cudaEventSynchronize(timerStop);

        cudaMemcpy(cpuOutput,gpuOutput,sizeof(int)*maxBlocks*2,cudaMemcpyDeviceToHost);

        cudaThreadSynchronize();
        float thisTime;
        cudaEventElapsedTime(&thisTime,timerStart,timerStop);

        cudaEventDestroy(timerStart);
        cudaEventDestroy(timerStop);
        printf("Elapsed time: %f\n",thisTime);

        for (int i=0; i<q; ++i)
                printf("%d: %d-%d\n",i,cpuOutput[i],cpuOutput[i+q]);
}

What you get in the output is the block ID, followed by the enter "time" and exit "time". This way you can learn in which order those events occured.

回复收藏 0 原文

殊姿 2024-12-08 04:05:50

在费米上，我确信只要有空间，就会在 SM 上安排一个区块。即，每当SM执行完一个块时，如果还有剩余块，它将执行另一个块。（但是，实际顺序并不是确定的）。

在旧版本中，我不知道。但您可以使用内置的clock()函数来验证它。

例如，我使用了以下 OpenCL 内核代码（您可以轻松地将其转换为 CUDA）：

   __kernel void test(uint* start, uint* end, float* buffer);
   {
       int id = get_global_id(0);
       start[id] = clock();
       __do_something_here;
       end[id] = clock();
   }

然后将其输出到文件并构建图形。你会看到它是多么直观。

On Fermi, I'm sure that a block is scheduled on a SM as soon there is room for it. I.e., whenever, a SM finishes executing one block, it will execute another block if there is any block left. (However, the actual order is not deterministic).

In older versions, I don't know. But you can verify it by using the build-in clock() function.

For example, I used the following OpenCL kernel code (you can easily convert it to CUDA):

   __kernel void test(uint* start, uint* end, float* buffer);
   {
       int id = get_global_id(0);
       start[id] = clock();
       __do_something_here;
       end[id] = clock();
   }

Then output it to a file and build a graph. You will see how visual it is.

回复收藏 0 原文

~没有更多了~