减少 CUDA 内核中使用的寄存器数量

发布于 2024-08-21 22:03:03 字数 221 浏览 7 评论 0原文

我有一个使用 17 个寄存器的内核,将其减少到 16 个将使我获得 100% 的占用率。我的问题是:是否有方法可以用来减少使用的寄存器数量,而不是以不同的方式完全重写我的算法。我一直认为编译器比我聪明得多,因此,例如,为了清楚起见,我经常使用额外的变量。我这个想法有错吗?

请注意:我确实知道 --max_registers (或任何语法)标志,但使用本地内存比降低 25% 的占用率更有害(我应该对此进行测试)

I have a kernel which uses 17 registers, reducing it to 16 would bring me 100% occupancy. My question is: are there methods that can be used to reduce the number or registers used, excluding completely rewriting my algorithms in a different manner. I have always kind of assumed the compiler is a lot smarter than I am, so for example I often use extra variables for clarity's sake alone. Am I wrong in this thinking?

Please note: I do know about the --max_registers (or whatever the syntax is) flag, but the use of local memory would be more detrimental than a 25% lower occupancy (I should test this)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

绮烟 2024-08-28 22:03:03

真的很难说,在我看来 nvcc 编译器不是很聪明。
您可以尝试明显的事情,例如使用 Short 而不是 int、通过引用传递和使用变量(例如&variable)、展开循环、使用模板(如 C++ 中)。如果您有按顺序应用除法、超越函数,请尝试将它们作为循环。尝试摆脱条件,可能用冗余计算替换它们。

如果你发布一些代码,也许你会得到具体的答案。

It's really hard to say, nvcc compiler is not very smart in my opinion.
You can try obvious things, for example using short instead of int, passing and using variables by reference (e.g.&variable), unrolling loops, using templates (as in C++). If you have divisions, transcendental functions, been applied in sequence, try to make them as a loop. Try to get rid of conditionals, possibly replacing them with redundant computations.

If you post some code, maybe you will get specific answers.

愚人国度 2024-08-28 22:03:03

入住率可能会有点误导,100% 入住率不应该是您的主要目标。如果您可以获得对全局内存的完全合并访问,那么在高端 GPU 上,50% 的占用率将足以隐藏全局内存的延迟(对于浮点数,对于双精度数甚至更低)。查看去年 GTC 的高级 CUDA C 演示,了解有关以下方面的更多信息这个话题。

在您的情况下,您应该测量 maxrregcount 设置为 16 和不设置 maxrregcount 的性能。由于拥有足够的线程,本地内存的延迟应该被隐藏,假设您不随机访问本地数组(这将导致非-合并访问)。

要回答您有关减少寄存器的具体问题,请发布代码以获取更详细的答案!了解编译器的一般工作原理可能会有所帮助,但请记住,nvcc 是一个具有较大参数空间的优化编译器,因此最小化寄存器数量必须与整体性能相平衡。

Occupancy can be a little misleading and 100% occupancy should not be your primary target. If you can get fully coalesced accesses to global memory then on a high end GPU 50% occupancy will be sufficient to hide the latency to global memory (for floats, even lower for doubles). Check out the Advanced CUDA C presentation from GTC last year for more information on this topic.

In your case, you should measure performance both with and without maxrregcount set to 16. The latency to local memory should be hidden as a result of having sufficient threads, assuming you don't random access into local arrays (which would result in non-coalesced accesses).

To answer you specific question about reducing registers, post the code for more detailed answers! Understanding how compilers work in general may help, but remember that nvcc is an optimising compiler with a large parameter space, so minimising register count has to be balanced with overall performance.

極樂鬼 2024-08-28 22:03:03

利用共享内存作为缓存可能会减少寄存器的使用,并防止寄存器溢出到本地内存...

认为内核计算了一些值,并且这些计算出的值被所有线程使用,

__global__ void kernel(...) {
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
    int id0 = blockDim.x * blockIdx.x;

    int reg = id0 * ...;
    int reg0 = reg * a / x + y;


    ...

    int val =  reg + reg0 + 2 * idx;

    output[idx] = val > 10;
}

因此,不要将 reg 和 reg0 保留为寄存器,为了使它们有可能溢出到本地内存(全局内存),我们可以使用共享内存。

__global__ void kernel(...) {
    __shared__ int cache[10];

    int idx = threadIdx.x + blockDim.x * blockIdx.x;

    if (threadIdx.x == 0) {
      int id0 = blockDim.x * blockIdx.x;

      cache[0] = id0 * ...;
      cache[1] = cache[0] * a / x + y;
    }
    __syncthreads();


    ...

    int val =  cache[0] + cache[1] + 2 * idx;

    output[idx] = val > 10;
}

请查看这篇论文以获取更多信息。

Utilizing shared memory as cache may lead less register usage and prevent register spilling to local memory...

Think that the kernel calculates some values and these calculated values are used by all of the threads,

__global__ void kernel(...) {
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
    int id0 = blockDim.x * blockIdx.x;

    int reg = id0 * ...;
    int reg0 = reg * a / x + y;


    ...

    int val =  reg + reg0 + 2 * idx;

    output[idx] = val > 10;
}

So, instead of keeping reg and reg0 as registers and making them possibily spill out to local memory (global memory), we may use shared memory.

__global__ void kernel(...) {
    __shared__ int cache[10];

    int idx = threadIdx.x + blockDim.x * blockIdx.x;

    if (threadIdx.x == 0) {
      int id0 = blockDim.x * blockIdx.x;

      cache[0] = id0 * ...;
      cache[1] = cache[0] * a / x + y;
    }
    __syncthreads();


    ...

    int val =  cache[0] + cache[1] + 2 * idx;

    output[idx] = val > 10;
}

Take a look at this paper for further information..

自由如风 2024-08-28 22:03:03

一般来说,这不是最小化套准压力的好方法。编译器很好地优化了总体预计的内核性能,并且它考虑了很多因素,包括寄存器。

减少寄存器导致速度变慢时如何工作

最有可能的是编译器不得不将不足的寄存器数据溢出到“本地”内存中,这本质上与全局内存相同,因此非常慢

为了优化目的,我建议使用像 const 这样的关键字、 volatile 等必要时,帮助编译器进行优化阶段。

无论如何,并不是这些像寄存器这样的小问题经常导致 CUDA 内核运行缓慢。我建议优化全局内存、访问模式、纹理内存中的缓存(如果可能)以及 PCIe 上的事务。

It is not generally a good approach to minimize register pressure. The compiler does a good job optimizing the overall projected kernel performance, and it takes into account lots of factors, incliding register.

How does it work when reducing registers caused slower speed

Most probably the compiler had to spill insufficient register data into "local" memory, which is essentially the same as global memory, and thus very slow

For optimization purposes I would recommend to use keywords like const, volatile and so on where necessary, to help the compiler on the optimization phase.

Anyway, it is not these tiny issues like registers which often make CUDA kernels run slow. I'd recommend to optimize work with global memory, the access pattern, caching in texture memory if possible, transactions over the PCIe.

十六岁半 2024-08-28 22:03:03

降低寄存器使用量时指令数增加有一个简单的解释。编译器可能使用寄存器来存储代码中多次使用的某些操作的结果,以避免重新计算这些值,当被迫使用较少的寄存器时,编译器决定重新计算将存储在寄存器中的那些值否则。

The instruction count increase when lowering the register usage have a simple explanation. The compiler could be using registers to store the results of some operations that are used more than once through your code in order to avoid recalculating those values, when forced to use less registers, the compiler decides to recalculate those values that would be stored in registers otherwise.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文