CUDA 设备如何处理立即操作数?
使用立即数(整数)操作数编译 CUDA 代码,它们是保存在指令流中,还是放入内存中?具体来说,我正在考虑 24 位或 32 位无符号整数操作数。
到目前为止,我在我检查过的任何 CUDA 文档中都无法找到有关此问题的信息。因此,对像这样的特定 uarch 细节的任何文档的引用都是完美的,因为我目前没有关于 CUDA 在这个级别如何工作的良好模型。
Compiling CUDA code with immediate (integer) operands, are they held in the instruction stream, or are they placed into memory? Specifically I'm thinking about 24 or 32 bit unsigned integer operands.
I haven't been able to find information about this in any of the CUDA documentation I've examined so far. So references to any documents on specific uarch details like this would be perfect, as I don't currently have a good model for how CUDA works at this level.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
NVIDIA 没有发布任何有关该级别设备如何工作的信息。有一个叫decuda的工具可以反编译cubins,这样就可以看到机器码了。如果我记得的话,立即数进入指令流,至少就 decuda 能够推断的而言。 decuda 的问题是它仅适用于 CUDA 2.3 或更低版本。他们在CUDA 3.0中将可执行格式更改为elf,而decuda已经很长时间没有维护了。
最好的官方文档是 PTX 文档,但该文件记录的是虚拟机isa,而不是真实设备。
NVIDIA doesn't release any information about how the devices work at this level. There is a tool called decuda that can decompile cubins, so you can see the machine code. If I recall, immediates go into the instruction stream, at least as far a decuda is able to deduce. The problem with decuda is that it only works for CUDA 2.3 or lower. They changed the executable format to elf in CUDA 3.0, and decuda hasn't been maintained in a long time.
The best official documentation is the PTX documentation, but that documents a virtual machine isa, not the real device.
如果我没记错的话,整数除法(例如)的成本非常高,而有些浮点运算(如 sinf(..))完全在硬件中实现,因此速度很快。
这次演讲给了我一些见解:“计算物理的 CUDA 技巧”http://physicals.bu。 edu/~kbarros/会谈/
If I recall correctly integer division (for example) is very costly, some while floating point operations (like sinf(..)) are completely implemented in hardware and therefore fast.
This talk gave me some insight: "CUDA Tricks for Computational Physics" http://physics.bu.edu/~kbarros/talks/