cuda寄存器压力

发布于 2024-10-02 15:56:48 字数 2368 浏览 1 评论 0原文

我有一个内核进行线性最小二乘拟合。事实证明，线程使用了太多寄存器，因此占用率很低。这是内核，

__global__
void strainAxialKernel(
    float* d_dis,
    float* d_str
){
    int i = threadIdx.x;
    float a = 0;
    float c = 0;
    float e = 0;
    float f = 0;
    int shift = (int)((float)(i*NEIGHBOURS)/(float)WINDOW_PER_LINE);
    int j;
    __shared__ float dis[WINDOW_PER_LINE];
    __shared__ float str[WINDOW_PER_LINE];

    // fetch data from global memory
    dis[i] = d_dis[blockIdx.x*WINDOW_PER_LINE+i];
    __syncthreads();

    // least square fit
    for (j=-shift; j<NEIGHBOURS-shift; j++)                                     
    {                                                                           
        a += j;                                                                 
        c += j*j;                                                               
        e += dis[i+j];                                                          
        f += (float(j))*dis[i+j];                                               
    }                                                                       
    str[i] = AMP*(a*e-NEIGHBOURS*f)/(a*a-NEIGHBOURS*c)/(float)BLOCK_SPACING;    

    // compensate attenuation
    if (COMPEN_EXP>0 && COMPEN_BASE>0)                                          
    {                                                                           
        str[i]                                                                  
        = (float)(str[i]*pow((float)i/(float)COMPEN_BASE+1.0f,COMPEN_EXP));     
    }   

    // write back to global memory
    if (!SIGN_PRESERVE && str[i]<0)                                             
    {                                                                           
        d_str[blockIdx.x*WINDOW_PER_LINE+i] = -str[i];                          
    }                                                                           
    else                                                                        
    {                                                                           
        d_str[blockIdx.x*WINDOW_PER_LINE+i] = str[i];                           
    }
}

我有 32x404 块，每个块中有 96 个线程。在 GTS 250 上，SM 应能够处理 8 个块。然而，视觉分析器显示每个线程有 11 个寄存器，因此占用率为 0.625（每个 SM 5 个块）。 BTW，每个块使用的共享内存是792 B，所以寄存器是问题所在。演出并不是世界末日。我只是好奇是否有办法可以解决这个问题。谢谢。

原文

I have a kernel does a linear least square fit. It turns out threads are using too many registers, therefore, the occupancy is low. Here is the kernel,

__global__
void strainAxialKernel(
    float* d_dis,
    float* d_str
){
    int i = threadIdx.x;
    float a = 0;
    float c = 0;
    float e = 0;
    float f = 0;
    int shift = (int)((float)(i*NEIGHBOURS)/(float)WINDOW_PER_LINE);
    int j;
    __shared__ float dis[WINDOW_PER_LINE];
    __shared__ float str[WINDOW_PER_LINE];

    // fetch data from global memory
    dis[i] = d_dis[blockIdx.x*WINDOW_PER_LINE+i];
    __syncthreads();

    // least square fit
    for (j=-shift; j<NEIGHBOURS-shift; j++)                                     
    {                                                                           
        a += j;                                                                 
        c += j*j;                                                               
        e += dis[i+j];                                                          
        f += (float(j))*dis[i+j];                                               
    }                                                                       
    str[i] = AMP*(a*e-NEIGHBOURS*f)/(a*a-NEIGHBOURS*c)/(float)BLOCK_SPACING;    

    // compensate attenuation
    if (COMPEN_EXP>0 && COMPEN_BASE>0)                                          
    {                                                                           
        str[i]                                                                  
        = (float)(str[i]*pow((float)i/(float)COMPEN_BASE+1.0f,COMPEN_EXP));     
    }   

    // write back to global memory
    if (!SIGN_PRESERVE && str[i]<0)                                             
    {                                                                           
        d_str[blockIdx.x*WINDOW_PER_LINE+i] = -str[i];                          
    }                                                                           
    else                                                                        
    {                                                                           
        d_str[blockIdx.x*WINDOW_PER_LINE+i] = str[i];                           
    }
}

I have 32x404 blocks with 96 threads in each block. On GTS 250, the SM shall be able to handle 8 blocks. Yet, visual profiler shows I have 11 registers per thread, as a result, occupancy is 0.625 (5 blocks per SM). BTW, the shared memory used by each block is 792 B, so the register is the problem.
The performance is not end of the world. I am just curious if there is anyway I can get around this. Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

黯然#的苍凉 2024-10-09 15:56:49

您可以使用启动边界来指示编译器为每个多处理器的最大线程数和最小块数生成寄存器映射。这可以减少寄存器数量，从而达到所需的占用率。

对于您的情况，Nvidia 的占用率计算器显示理论峰值占用率为 63%，这似乎就是您所实现的目标。正如您提到的，这是由于您的寄存器计数造成的，但它也是由于每个块的线程数造成的。将每个块的线程数增加到 128，并将寄存器计数减少到 10，可获得 100% 的理论峰值占用率。

要控制内核的启动范围：

__global__ void
__launch_bounds__(128, 6)
MyKernel(...)
{
    ...
}

然后以 128 个线程的块大小启动并享受您的占用。编译器应该生成您的内核，使其使用 10 个或更少的寄存器。

You can use launch bounds to instruct the compiler to generate a register mapping for a maximum number of threads and a minimum number of blocks per multiprocessor. This can reduce register counts so that you can achieve the desired occupancy.

For your case, Nvidia's occupancy calculator shows a theoretical peak occupancy of 63%, which seems to be what you're achieving. This is due to your register count, as you mention, but it is also due to the number of threads per block. Increasing the number of threads per block to 128 and decreasing the register count to 10 yields 100% theoretical peak occupancy.

To control the launch bounds for your kernel:

__global__ void
__launch_bounds__(128, 6)
MyKernel(...)
{
    ...
}

Then just launch with a block size of 128 threads and enjoy your occupancy. The compiler should generate your kernel such that it uses 10 or less registers.

回复收藏 0 原文

蓬勃野心 2024-10-09 15:56:48

快速但有限的寄存器/共享内存与缓慢但大的全局内存之间总是存在权衡。没有办法“绕过”这种权衡。如果您通过使用全局内存来减少寄存器的使用，您应该获得更高的占用率，但内存访问速度会更慢。

也就是说，这里有一些使用更少寄存器的想法：

可以预先计算移位并将其存储在常量内存中吗？那么每个线程只需要查找shift[i]即可。
a 和 c 必须是浮点数吗？
或者，可以将 a 和 c 从循环中删除并计算一次吗？从而完全删除？

a 被计算为一个简单的算术序列，因此减少它...（类似这样）

a = ((NEIGHBORS-shift) - (-shift) + 1) * ((NEIGHBORS-shift) + (-shift)) / 2

或者

a = (NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2

相反，执行类似以下的操作（您可能可以进一步减少这些表达式）：

str[i] = AMP*((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*e-NEIGHBOURS*f)
str[i] /= ((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*(NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2-NEIGHBOURS*c)
str[i] /= (float)BLOCK_SPACING;

There is always a trade-off between the fast but limited registers/shared memory and the slow but large global memory. There's no way to "get around" that trade-off. If you use reduce register usage by using global memory, you should get higher occupancy but slower memory access.

That said, here are some ideas to use fewer registers:

Can shift be precomputed and stored in constant memory? Then each thread just needs to look up shift[i].
Do a and c have to be floats?
Or, can a and c be removed from the loop and computed once? And thus removed completely?

a is computed as a simple arithmetic sequence, so reduce it... (something like this)

a = ((NEIGHBORS-shift) - (-shift) + 1) * ((NEIGHBORS-shift) + (-shift)) / 2

a = (NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2

so instead, do something like the following (you can probably reduce these expressions further):

str[i] = AMP*((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*e-NEIGHBOURS*f)
str[i] /= ((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*(NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2-NEIGHBOURS*c)
str[i] /= (float)BLOCK_SPACING;

回复收藏 0 原文