我应该研究 PTX 来优化我的内核吗?如果是这样,怎么办?
您是否建议阅读内核的 PTX 代码以进一步优化内核?
一个例子:我读到,可以从 PTX 代码中找出自动循环展开是否有效。如果不是这种情况,则必须在内核代码中手动展开循环。
- PTX 代码还有其他用例吗?
- 你查看过你的 PTX 代码吗?
- 在哪里可以找到如何读取 CUDA 为我的内核生成的 PTX 代码?
Do you recommend reading your kernel's PTX code to find out to optimize your kernels further?
One example: I read, that one can find out from the PTX code if the automatic loop unrolling worked. If this is not the case, one would have to unroll the loops manually in the kernel code.
- Are there other use-cases for the PTX code?
- Do you look into your PTX code?
- Where can I find out how to be able to read the PTX code CUDA generates for my kernels?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
关于 PTX,首先要说明的是,它只是 GPU 上运行的代码的中间表示形式——一种虚拟机汇编语言。 PTX 在编译时由 ptxas 或由驱动程序在运行时汇编为目标机器代码。因此,当您查看 PTX 时,您查看的是编译器发出的内容,而不是 GPU 实际运行的内容。还可以编写自己的 PTX 代码,无论是从头开始(这是 CUDA 中支持的唯一 JIT 编译模型),还是作为 CUDA C 代码中内联汇编器部分的一部分(后者自 CUDA 4.0 起正式支持,但“非官方”的支持时间比这长得多)。 CUDA 始终随工具包附带 PTX 语言的完整指南,并且有完整的文档记录。 ocelot 项目已使用此文档来实现他们自己的 PTX 交叉编译器,该编译器允许 CUDA 代码在其他硬件上本机运行,最初是 x86 处理器,但最近是 AMD GPU。
如果您想查看 GPU 实际运行的内容(而不是编译器发出的内容),NVIDIA 现在提供了一个名为
cudaobjdump
的二进制反汇编工具,它可以显示编译的代码中的实际机器代码段费米 GPU。有一个名为 decuda 的较旧的非官方工具,适用于 G80 和 G90 GPU。话虽如此,从 PTX 输出中可以学到很多东西,特别是编译器如何应用优化以及它发出哪些指令来实现某些 C 结构。每个版本的 NVIDIA CUDA 工具包都附带
nvcc< /code>
和 文档PTX 语言。两个文档中都包含大量信息,可帮助您了解如何将 CUDA C/C++ 内核代码编译为 PTX,并了解 PTX 指令的用途。
The first point to make about PTX is that it is only an intermediate representation of the code run on the GPU -- a virtual machine assembly language. PTX is assembled to target machine code either by
ptxas
at compile time, or by the driver at runtime. So when you are looking at PTX, you are looking at what the compiler emitted, but not at what the GPU will actually run. It is also possible to write your own PTX code, either from scratch (this is the only JIT compilation model supported in CUDA), or as part of inline-assembler sections in CUDA C code (the latter officially supported since CUDA 4.0, but "unofficially" supported for much longer than that). CUDA has always shipped with a complete guide to the PTX language with the toolkit, and it is fully documented. The ocelot project has used this documentation to implement their own PTX cross compiler, which allows CUDA code to run natively on other hardware, initially x86 processors, but more recently AMD GPUs.If you want to see what the GPU is actualy running (as opposed to what the compiler is emitting), NVIDIA now supply a binary disassembler tool called
cudaobjdump
which can show the actual machine code segments in code compiled for Fermi GPUs. There was an older, unofficialy tool calleddecuda
which worked for G80 and G90 GPUs.Having said that, there is a lot to be learned from PTX output, particularly at how the compiler is applying optimizations and what instructions it is emitting to implement certain C contructs. Every version of the NVIDIA CUDA toolkit comes with a guide to
nvcc
and documentation for the PTX language. There is plenty of information contained in both documents to both learn how to compile a CUDA C/C++ kernel code to PTX, and to understand what the PTX instructions will do.