Fermi GPU (GTX 580) 中分析执行指令和发出指令的奇怪结果
我的内核有这样的ptx版本:
.version 2.2
.target sm_20, texmode_independent
.entry histogram(
.param .u32 .ptr .global .align 4 histogram_param_0,
.param .u32 .ptr .global .align 4 histogram_param_1
)
{
.reg .f32 %f<2>;
.reg .s32 %r<12>;
_histogram:
mov.u32 %r1, %tid.x;
mov.u32 %r2, %envreg3;
add.s32 %r3, %r1, %r2;
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, %ntid.x;
mad.lo.s32 %r6, %r4, %r5, %r3;
shl.b32 %r7, %r6, 2;
ld.param.u32 %r8, [histogram_param_0];
add.s32 %r9, %r8, %r7;
ld.param.u32 %r10, [histogram_param_1];
ld.global.f32 %f1, [%r9];
add.s32 %r11, %r10, %r7;
st.global.f32 [%r11], %f1;
ret;
}
据我统计,我的内核中只有13条指令(不包括ret指令)。当我将工作项数设置为5120时,工作组大小为64。因为有16个SM,每个SM中有32个标量处理器,所以上面的代码将在一个SM中执行10次。正如我预期的那样,执行的指令数应该是 10*13 = 130。但是经过我的分析,结果是:发出的指令=130,执行的指令=100。 1. 为什么发出的指令数与执行的指令数不同?没有分支,所以它们不应该是相等的吗? 2. 为什么执行的指令数比预期少? ptx版本中的所有指令至少应该执行吗? 3. 高速缓存未命中(L1 和 L2)对发出的指令数和执行的指令数有影响吗? 谢谢
My kernel has the ptx version like this:
.version 2.2
.target sm_20, texmode_independent
.entry histogram(
.param .u32 .ptr .global .align 4 histogram_param_0,
.param .u32 .ptr .global .align 4 histogram_param_1
)
{
.reg .f32 %f<2>;
.reg .s32 %r<12>;
_histogram:
mov.u32 %r1, %tid.x;
mov.u32 %r2, %envreg3;
add.s32 %r3, %r1, %r2;
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, %ntid.x;
mad.lo.s32 %r6, %r4, %r5, %r3;
shl.b32 %r7, %r6, 2;
ld.param.u32 %r8, [histogram_param_0];
add.s32 %r9, %r8, %r7;
ld.param.u32 %r10, [histogram_param_1];
ld.global.f32 %f1, [%r9];
add.s32 %r11, %r10, %r7;
st.global.f32 [%r11], %f1;
ret;
}
I as I counted, there are only 13 instructions in my kernel (not including the ret instruction). When I set the number of work items to be 5120, workgroup size is 64. Because there are 16 SMs, in each of which there are 32 scalar processors, so the above code will be executed 10 times in a SM. As I expected the number of executed instructions should be 10*13 = 130. But after I profiled, the results are: issued instructions=130, executed intructions=100.
1. Why is the number of issued instructions different with the number of executed instructions? There is no branches, so aren't they supposed to be equal?
2. Why is the number of executed instruction smaller than expected? Should all the instructions in the ptx version executed at least?
3. Does cache misses (L1 and L2) have any impact on the number of issued instructions and the number of executed instructions?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
PTX只是编译代码的中间表示。这并不是 GPU 实际执行的内容。还有一个进一步的汇编步骤,它发出 GPU 运行的代码,这可以在编译时发生,也可以在驱动程序中使用 JIT 编译。因此,您的指令很重要,您从中推断出的任何内容都是无效的。
NVIDIA 推出了一个名为 cuobjdump 的工具,它可以反汇编为 Fermi 卡生成的汇编器输出,并显示在 GPU 上运行的实际机器代码
PTX is only an intermediate representation of compiled code. It is not what the GPU actually executes. There is a further assembly step which emits the code which the GPU runs, this can happen either at compile time, or using JIT compilation in the driver. As a result, your instruction counts and anything you infer from them are invalid.
NVIDIA ship a tool called
cuobjdump
which can disassemble the assembler output generated for Fermi cards and show the actual machine code run on the GPU请记住,PTX 并不完全是 GPU 上执行的内容。 PTX 只是一种中间表示。真正的代码位于 .cubin 文件中。
这就是为什么基于 ptx 源代码进行如此准确的计算毫无意义。
您可以使用 CUDA 4.0 提供的 cuobjdump --sass 工具将 .cubin 文件中的 GPU 汇编代码提取为更具可读性的内容。
Keep in mind that PTX is not exactly what is being executed on the GPU. PTX is merely an intermediate representation. The real code is in .cubin files.
That's why making such accurrate calculations based on ptx source code makes no point.
You can use
cuobjdump --sass
tool provided with CUDA 4.0 to extract the GPU assembly code from .cubin files into something more readable.