如何确定CUDA的哪些行使用最多的寄存器？

发布于 2024-11-30 18:08:11 字数 666 浏览 0 评论 0原文

我有一个有点复杂的内核，具有以下统计数据：

ptxas info    : Compiling entry function 'my_kernel' for 'sm_21'
ptxas info    : Function properties for my_kernel
    32 bytes stack frame, 64 bytes spill stores, 40 bytes spill loads
ptxas info    : Used 62 registers, 120 bytes cmem[0], 128 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]

我不清楚内核的哪一部分在寄存器使用方面是“高水位线”。内核的本质是这样的，即删除常量值的各个部分会导致优化器对后面的部分进行常量折叠，等等（至少看起来是这样，因为当我这样做时得到的数字并没有赚到太多钱）感觉）。

CUDA 分析器同样无济于事 AFAICT，只是告诉我我有寄存器压力。

有没有办法获得有关寄存器使用的更多信息？我更喜欢某种工具，但我也有兴趣了解直接挖掘已编译的二进制文件（如果需要的话）。

编辑：我当然有可能采用这种自下而上的方法（即进行实验性代码更改、检查对寄存器使用的影响等），但我宁愿从自上而下开始，或者至少获得一些关于何处的指导开始自下而上的调查。

原文

I have a somewhat complex kernel with the following stats:

ptxas info    : Compiling entry function 'my_kernel' for 'sm_21'
ptxas info    : Function properties for my_kernel
    32 bytes stack frame, 64 bytes spill stores, 40 bytes spill loads
ptxas info    : Used 62 registers, 120 bytes cmem[0], 128 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]

It's not clear to me which part of the kernel is the "high water mark" in terms of register usage. The nature of the kernel is such that stubbing out various parts for constant values causes the optimizer to constant-fold later parts, etc. (at least that's how it seems, since the numbers I get back when I do so don't make much sense).

The CUDA profiler is similarly unhelpful AFAICT, simply telling me that I have register pressure.

Is there a way to get more information about register usage? I'd prefer a tool of some kind, but I'd also be interested in hearing about digging into the compiled binary directly, if that's what it takes.

Edit: It is certainly possible for me to approach this bottom-up (ie. making experimental code changes, checking the impact on register usage, etc.) but I'd rather start top-down, or at least get some guidance on where to begin bottom-up investigation.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

巾帼英雄 2024-12-07 18:08:11

您可以通过编译到带注释的 PTX 来感受编译器输出的复杂性，如下所示：

nvcc -ptx -Xopencc="-LIST:source=on" branching.cu

它将发出一个 PTX 汇编程序文件，其中包含原始 C 代码作为注释：

        .entry _Z11branchTest0PfS_S_ (
                .param .u64 __cudaparm__Z11branchTest0PfS_S__a,
                .param .u64 __cudaparm__Z11branchTest0PfS_S__b,
                .param .u64 __cudaparm__Z11branchTest0PfS_S__d)
        {
        .reg .u16 %rh<4>;
        .reg .u32 %r<5>;
        .reg .u64 %rd<10>;
        .reg .f32 %f<5>;
        .loc    16      1       0
 //   1  __global__ void branchTest0(float *a, float *b, float *d)
$LDWbegin__Z11branchTest0PfS_S_:
        .loc    16      7       0
 //   3         unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
 //   4         float aval = a[tidx], bval = b[tidx];
 //   5         float z0 = (aval > bval) ? aval : bval;
 //   6  
 //   7         d[tidx] = z0;
        mov.u16         %rh1, %ctaid.x;
        mov.u16         %rh2, %ntid.x;
        mul.wide.u16    %r1, %rh1, %rh2;
        cvt.u32.u16     %r2, %tid.x;
        add.u32         %r3, %r2, %r1;
        cvt.u64.u32     %rd1, %r3;
        mul.wide.u32    %rd2, %r3, 4;
        ld.param.u64    %rd3, [__cudaparm__Z11branchTest0PfS_S__a];
        add.u64         %rd4, %rd3, %rd2;
        ld.global.f32   %f1, [%rd4+0];
        ld.param.u64    %rd5, [__cudaparm__Z11branchTest0PfS_S__b];
        add.u64         %rd6, %rd5, %rd2;
        ld.global.f32   %f2, [%rd6+0];
        max.f32         %f3, %f1, %f2;
        ld.param.u64    %rd7, [__cudaparm__Z11branchTest0PfS_S__d];
        add.u64         %rd8, %rd7, %rd2;
        st.global.f32   [%rd8+0], %f3;
        .loc    16      8       0
 //   8  }
        exit;
$LDWend__Z11branchTest0PfS_S_:
        } // _Z11branchTest0PfS_S_

请注意，这不会直接告诉您有关寄存器的任何信息用法，因为 PTX 使用静态单赋值，但它向您显示汇编器作为输入给出的内容以及它与原始代码的关系。使用 CUDA 4.0 工具包，您可以将 C 编译为 Fermi 架构的 cubin 文件：

$ nvcc -cubin -arch=sm_20 -Xptxas="-v" branching.cu
ptxas info    : Compiling entry function '_Z11branchTest1PfS_S_' for 'sm_20'
ptxas info    : Function properties for _Z11branchTest1PfS_S_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

并使用 cuobjdump 实用程序反汇编汇编器生成的机器代码。

$ cuobjdump -sass branching.cubin 

code for sm_20
    Function : _Z11branchTest0PfS_S_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
/*0010*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
/*0018*/     /*0x10015de218000000*/     MOV32I R5, 0x4;
/*0020*/     /*0x2000dc0320044000*/     IMAD.U32.U32 R3, R0, c [0x0] [0x8], R2;
/*0028*/     /*0x10311c435000c000*/     IMUL.U32.U32.HI R4, R3, 0x4;
/*0030*/     /*0x80319c03200b8000*/     IMAD.U32.U32 R6.CC, R3, R5, c [0x0] [0x20];
/*0038*/     /*0x9041dc4348004000*/     IADD.X R7, R4, c [0x0] [0x24];
/*0040*/     /*0xa0321c03200b8000*/     IMAD.U32.U32 R8.CC, R3, R5, c [0x0] [0x28];
/*0048*/     /*0x00609c8584000000*/     LD.E R2, [R6];
/*0050*/     /*0xb0425c4348004000*/     IADD.X R9, R4, c [0x0] [0x2c];
/*0058*/     /*0xc0329c03200b8000*/     IMAD.U32.U32 R10.CC, R3, R5, c [0x0] [0x30];
/*0060*/     /*0x00801c8584000000*/     LD.E R0, [R8];
/*0068*/     /*0xd042dc4348004000*/     IADD.X R11, R4, c [0x0] [0x34];
/*0070*/     /*0x00201c00081e0000*/     FMNMX R0, R2, R0, !pt;
/*0078*/     /*0x00a01c8594000000*/     ST.E [R10], R0;
/*0080*/     /*0x00001de780000000*/     EXIT;
    ......................................

通常可以从汇编程序追溯到 PTX，并至少粗略地了解“贪婪”代码部分在哪里。话虽如此，管理寄存器压力是目前 CUDA 编程更困难的方面之一。如果/当 NVIDIA 记录他们的设备代码 ELF 格式时，我认为一个合适的代码分析工具对某人来说将是一个很棒的项目。

You can get a feel for the complexity of the compiler output by compiling to annotated PTX like this:

nvcc -ptx -Xopencc="-LIST:source=on" branching.cu

which will issue a PTX assembler file with the original C code inside it as comments:

        .entry _Z11branchTest0PfS_S_ (
                .param .u64 __cudaparm__Z11branchTest0PfS_S__a,
                .param .u64 __cudaparm__Z11branchTest0PfS_S__b,
                .param .u64 __cudaparm__Z11branchTest0PfS_S__d)
        {
        .reg .u16 %rh<4>;
        .reg .u32 %r<5>;
        .reg .u64 %rd<10>;
        .reg .f32 %f<5>;
        .loc    16      1       0
 //   1  __global__ void branchTest0(float *a, float *b, float *d)
$LDWbegin__Z11branchTest0PfS_S_:
        .loc    16      7       0
 //   3         unsigned int tidx = threadIdx.x + blockDim.x*blockIdx.x;
 //   4         float aval = a[tidx], bval = b[tidx];
 //   5         float z0 = (aval > bval) ? aval : bval;
 //   6  
 //   7         d[tidx] = z0;
        mov.u16         %rh1, %ctaid.x;
        mov.u16         %rh2, %ntid.x;
        mul.wide.u16    %r1, %rh1, %rh2;
        cvt.u32.u16     %r2, %tid.x;
        add.u32         %r3, %r2, %r1;
        cvt.u64.u32     %rd1, %r3;
        mul.wide.u32    %rd2, %r3, 4;
        ld.param.u64    %rd3, [__cudaparm__Z11branchTest0PfS_S__a];
        add.u64         %rd4, %rd3, %rd2;
        ld.global.f32   %f1, [%rd4+0];
        ld.param.u64    %rd5, [__cudaparm__Z11branchTest0PfS_S__b];
        add.u64         %rd6, %rd5, %rd2;
        ld.global.f32   %f2, [%rd6+0];
        max.f32         %f3, %f1, %f2;
        ld.param.u64    %rd7, [__cudaparm__Z11branchTest0PfS_S__d];
        add.u64         %rd8, %rd7, %rd2;
        st.global.f32   [%rd8+0], %f3;
        .loc    16      8       0
 //   8  }
        exit;
$LDWend__Z11branchTest0PfS_S_:
        } // _Z11branchTest0PfS_S_

Note that this doesn't directly tell you anything about the register usage, because PTX uses static-single assignment, but it shows you what the assembler is given as an input and how it relates to your original code. With the CUDA 4.0 toolkit, you can then compile the C to a cubin file for the Fermi architecture:

$ nvcc -cubin -arch=sm_20 -Xptxas="-v" branching.cu
ptxas info    : Compiling entry function '_Z11branchTest1PfS_S_' for 'sm_20'
ptxas info    : Function properties for _Z11branchTest1PfS_S_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

and use the cuobjdump utility to disassemble the machine code the assembler produces.

$ cuobjdump -sass branching.cubin 

code for sm_20
    Function : _Z11branchTest0PfS_S_
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x94001c042c000000*/     S2R R0, SR_CTAid_X;
/*0010*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;
/*0018*/     /*0x10015de218000000*/     MOV32I R5, 0x4;
/*0020*/     /*0x2000dc0320044000*/     IMAD.U32.U32 R3, R0, c [0x0] [0x8], R2;
/*0028*/     /*0x10311c435000c000*/     IMUL.U32.U32.HI R4, R3, 0x4;
/*0030*/     /*0x80319c03200b8000*/     IMAD.U32.U32 R6.CC, R3, R5, c [0x0] [0x20];
/*0038*/     /*0x9041dc4348004000*/     IADD.X R7, R4, c [0x0] [0x24];
/*0040*/     /*0xa0321c03200b8000*/     IMAD.U32.U32 R8.CC, R3, R5, c [0x0] [0x28];
/*0048*/     /*0x00609c8584000000*/     LD.E R2, [R6];
/*0050*/     /*0xb0425c4348004000*/     IADD.X R9, R4, c [0x0] [0x2c];
/*0058*/     /*0xc0329c03200b8000*/     IMAD.U32.U32 R10.CC, R3, R5, c [0x0] [0x30];
/*0060*/     /*0x00801c8584000000*/     LD.E R0, [R8];
/*0068*/     /*0xd042dc4348004000*/     IADD.X R11, R4, c [0x0] [0x34];
/*0070*/     /*0x00201c00081e0000*/     FMNMX R0, R2, R0, !pt;
/*0078*/     /*0x00a01c8594000000*/     ST.E [R10], R0;
/*0080*/     /*0x00001de780000000*/     EXIT;
    ......................................

It is usually possible to trace back from assembler to PTX and get at least a rough idea where the "greedy" code sections are. Having said all that, managing register pressure is one of the more difficult aspects of CUDA programming at the moment. If/when NVIDIA ever document their ELF format for device code, I reckon a proper code analyzing tool would be a great project for someone.

回复收藏 0 原文

~没有更多了~