Fermi GPU (GTX 580) 中分析执行指令和发出指令的奇怪结果

发布于 2024-11-19 01:43:34 字数 1160 浏览 5 评论 0原文

我的内核有这样的ptx版本：

.version 2.2
.target sm_20, texmode_independent

.entry histogram(
        .param .u32 .ptr .global .align 4 histogram_param_0,
        .param .u32 .ptr .global .align 4 histogram_param_1
)
{
        .reg .f32       %f<2>;
        .reg .s32       %r<12>;

_histogram:
        mov.u32         %r1, %tid.x;
        mov.u32         %r2, %envreg3;
        add.s32         %r3, %r1, %r2;
        mov.u32         %r4, %ctaid.x;
        mov.u32         %r5, %ntid.x;
        mad.lo.s32      %r6, %r4, %r5, %r3;
        shl.b32         %r7, %r6, 2;
        ld.param.u32    %r8, [histogram_param_0];
        add.s32         %r9, %r8, %r7;
        ld.param.u32    %r10, [histogram_param_1];
        ld.global.f32   %f1, [%r9];
        add.s32         %r11, %r10, %r7;
        st.global.f32   [%r11], %f1;
        ret;
}

据我统计，我的内核中只有13条指令（不包括ret指令）。当我将工作项数设置为5120时，工作组大小为64。因为有16个SM，每个SM中有32个标量处理器，所以上面的代码将在一个SM中执行10次。正如我预期的那样，执行的指令数应该是 10*13 = 130。但是经过我的分析，结果是：发出的指令=130，执行的指令=100。 1. 为什么发出的指令数与执行的指令数不同？没有分支，所以它们不应该是相等的吗？ 2. 为什么执行的指令数比预期少？ ptx版本中的所有指令至少应该执行吗？ 3. 高速缓存未命中（L1 和 L2）对发出的指令数和执行的指令数有影响吗？谢谢

原文

My kernel has the ptx version like this:

.version 2.2
.target sm_20, texmode_independent

.entry histogram(
        .param .u32 .ptr .global .align 4 histogram_param_0,
        .param .u32 .ptr .global .align 4 histogram_param_1
)
{
        .reg .f32       %f<2>;
        .reg .s32       %r<12>;

_histogram:
        mov.u32         %r1, %tid.x;
        mov.u32         %r2, %envreg3;
        add.s32         %r3, %r1, %r2;
        mov.u32         %r4, %ctaid.x;
        mov.u32         %r5, %ntid.x;
        mad.lo.s32      %r6, %r4, %r5, %r3;
        shl.b32         %r7, %r6, 2;
        ld.param.u32    %r8, [histogram_param_0];
        add.s32         %r9, %r8, %r7;
        ld.param.u32    %r10, [histogram_param_1];
        ld.global.f32   %f1, [%r9];
        add.s32         %r11, %r10, %r7;
        st.global.f32   [%r11], %f1;
        ret;
}

I as I counted, there are only 13 instructions in my kernel (not including the ret instruction). When I set the number of work items to be 5120, workgroup size is 64. Because there are 16 SMs, in each of which there are 32 scalar processors, so the above code will be executed 10 times in a SM. As I expected the number of executed instructions should be 10*13 = 130. But after I profiled, the results are: issued instructions=130, executed intructions=100.
1. Why is the number of issued instructions different with the number of executed instructions? There is no branches, so aren't they supposed to be equal?
2. Why is the number of executed instruction smaller than expected? Should all the instructions in the ptx version executed at least?
3. Does cache misses (L1 and L2) have any impact on the number of issued instructions and the number of executed instructions?
Thanks

分享到QQ

分享到微博