CUDA:nvcc 需要几分钟的编译时间可能是什么原因?
我有一些 CUDA 代码,nvcc
(嗯,技术上是 ptxas
)需要花费 10 分钟以上的时间来编译。虽然它不小,但它当然也不是很大。 (~5000 行)。
CUDA 版本更新之间似乎出现了延迟,但之前只花了一分钟左右,而不是 10 分钟。
当我使用 -v
选项时,显示以下内容后似乎卡住了:
ptxas --key="09ae2a85bb2d44b6" -arch=sm_13 "/tmp/tmpxft_00002ab1_00000000-2_trip3dgpu_kernel.ptx" -o "/tmp/tmpxft_00002ab1_00000000-9_trip3dgpu_kernel.sm_13.cubin"
内核确实有一个相当大的参数列表,并且传递了一个带有大量指针的结构,但我确实知道至少有一个时间点,在短短几秒钟内编译了几乎完全相同的代码。
我正在运行 64 位 Ubuntu 9.04,如果有帮助的话。
有什么想法吗?
I have some CUDA code that nvcc
(well, technically ptxas
) likes to take upwards of 10 minutes to compile. While it isn't small, it certainly isn't huge. (~5000 lines).
The delay seems to come and go between CUDA version updates, but previously it only took a minute or so instead of 10.
When I used the -v
option, it seemed to get stuck after displaying the following:
ptxas --key="09ae2a85bb2d44b6" -arch=sm_13 "/tmp/tmpxft_00002ab1_00000000-2_trip3dgpu_kernel.ptx" -o "/tmp/tmpxft_00002ab1_00000000-9_trip3dgpu_kernel.sm_13.cubin"
The kernel does have a fairly large parameter list and a structure with a good number of pointers is passed around, but I do know that there was at least one point in time in which very nearly the exact same code compiled in only a couple seconds.
I am running 64 bit Ubuntu 9.04 if it helps.
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我遇到了类似的问题 - 没有优化,编译失败,耗尽了寄存器,而经过优化,花了将近半个小时。我的内核有像
and 当我重写它们时的表达式:
它显着减少了编译时间和寄存器使用。
I had similar problem - without optimization, compilation failed running out of registers, and with optimizations it took nearly half an hour. My kernel had expressions like
and when i rewrote them:
it significantly reduced compilation time and register usage.
您应该注意,可以传递给函数的参数列表的大小有限制,当前为 256 字节(请参阅 CUDA 编程指南的 B.1.4 节)。功能有什么变化吗?
每个内核还存在 200 万条 PTX 指令的限制,但您不应该接近该限制;-)
您使用的是哪个版本的工具包?如果您是注册开发者,则可以使用 3.0 测试版,这是一个重大更新。如果您仍然遇到问题,您应该联系 NVIDIA,他们当然需要能够重现该问题。
You should note that there is a limit on the the size of the parameter list that can be passed to a function, currently 256 bytes (see section B.1.4 of the CUDA Programming Guide). Has the function changed at all?
There is also a limit of 2 million PTX instructions per kernel, but you shouldn't be approaching that ;-)
What version of the toolkit are you using? The 3.0 beta is available if you are a registered developer which is a major update. If you still have the problem you should contact NVIDIA, they will need to be able to reproduce the problem of course.
在编译行设置 -maxrregcount 64 会有所帮助,因为它会导致寄存器分配器提前溢出到 lmem
Setting
-maxrregcount 64
on the compile line helps since it causes the register allocator to spill to lmem earlier