cuda SM寄存器限制
我知道在一个 SM 上运行的块数量受到块数量、线程、共享内存和寄存器的限制。有没有什么策略可以避免寄存器过多?我的意思是我只是不想拥有太多,最终它限制了我在一个 SM 上运行的块的数量。
I know number of block running on one SM is limited by block number, threads, shared memory, and register. Is there any strategy to avoiding having too many registers? I mean I just don't want to have too many of them, eventually it limits the number of block I run on one SM.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用
nvcc -Xptxas -v
编译将打印出Edric提到的诊断信息。此外,您可以使用 __launch_bounds__ 限定符强制编译器保留寄存器。例如,保证至少
minBlocksPerMultiprocessor
个大小为maxThreadsPerBlock
的块适合单个 SM。请参阅 CUDA 编程指南< 的 B.16 节< /a> 的完整解释__launch_bounds__
。Compiling with
nvcc -Xptxas -v
will print out the diagnostic information Edric mentioned. Additionally, you can force the compiler to conserve registers using the__launch_bounds__
qualifier. For exampleguarantees that at least
minBlocksPerMultiprocessor
blocks of sizemaxThreadsPerBlock
will fit on a single SM. See Section B.16 of the CUDA Programming Guide for a complete explanation of__launch_bounds__
.寄存器数量的主要驱动因素之一是您在内核中声明的本地数据量。然而,PTX 汇编器在重用寄存器方面可以做得相当好,因此从 PTX 代码中计算出将使用多少寄存器并不总是那么容易 - 您需要运行 ptxas 来获取真正的答案。
One of the main drivers for the number of registers is amount of local data you declare in your kernel. However, the PTX assembler can do quite a good job of re-using registers, so it's not always easy to work out how many will be used from the PTX code - you need to run
ptxas
to get the real answer.