帮助!使用过多内存后 CUDA 内核将不再启动
我正在编写一个需要以下内核启动的程序:
dim3 blocks(16,16,16); //grid dimensions
dim3 threads(32,32); //block dimensions
get_gaussian_responses<<<blocks,threads>>>(pDeviceIntegral,itgStepSize,pScaleSpace);
我忘记在程序结束时释放 pScaleSpace 数组,然后通过 CUDA 分析器运行该程序,该程序连续运行 15 次,消耗了大量资源内存/造成大量碎片。现在,每当我运行该程序时,内核甚至都不会启动。如果我查看探查器记录的函数调用列表,就会发现内核不在那里。我意识到这是一个非常愚蠢的错误,但我不知道此时我可以做什么来让程序再次运行。我已经重新启动了计算机,但这没有帮助。如果我减小内核的尺寸,它运行良好,但当前尺寸完全在我的卡允许的最大值之内。
Max threads per block: 1024
Max grid dimensions: 65535,65535,65535
任何建议表示赞赏,提前致谢!
I'm writing a program that requires the following kernel launch:
dim3 blocks(16,16,16); //grid dimensions
dim3 threads(32,32); //block dimensions
get_gaussian_responses<<<blocks,threads>>>(pDeviceIntegral,itgStepSize,pScaleSpace);
I forgot to free the pScaleSpace array at the end of the program, and then ran the program through the CUDA profiler, which runs it 15 times in succession, using up a lot of memory / causing a lot of fragmentation. Now whenever I run the program, the kernel doesn't even launch. If I look at the list of function calls recorded by the profiler, the kernel is not there. I realize this is a pretty stupid error, but I don't know what I can do at this point to get the program to run again. I have restarted my computer, but that did not help. If I reduce the dimensions of the kernel, it runs fine, but the current dimensions are well within the allowed maximum for my card.
Max threads per block: 1024
Max grid dimensions: 65535,65535,65535
Any suggestions appreciated, thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
尝试使用较少数量的线程启动。如果有效,则意味着每个线程都在执行大量工作或使用大量内存。因此,CUDA 在您的硬件上实际上不可能启动最大可能的线程数。
您可能必须提高 CUDA 代码的效率才能启动更多线程。如果内核内部有复杂的逻辑,您可以尝试将内核切成更小的部分。或者获得更强大的硬件。
Try launching with lesser number of threads. If that works, it means that each of your threads is doing a lot of work or using a lot of memory. Thus the maximum possible number of threads cannot possibly be practically launched by CUDA on your hardware.
You may have to make your CUDA code more efficient to be able to launch more threads. You could try slicing your kernel into smaller pieces if it has complex logic inside it. Or get more powerful hardware.
如果您像这样编译代码:
汇编器将报告代码所需的本地堆内存数量。这对于查看内核的内存占用情况来说是一个有用的诊断。还有一个 API 调用 cudaThreadSetLimit它可用于控制每个线程堆内存的数量内核将在执行期间尝试并消耗。
最近的工具包附带了一个名为 cuda-memchk 的实用程序,它提供类似 valgrind 的内核内存访问分析,包括缓冲区溢出和非法内存使用。您的代码可能会溢出某处的某些内存并覆盖 GPU 内存的其他部分,从而使卡处于危险状态。
If you compile your code like this:
the assembler will report the number of local heap memory that the code requires. This can be a useful diagnostic to see what the memory footprint of the kernel is. There is also an API call cudaThreadSetLimit which can be used to control the amount of per thread heap memory which a kernel will try and consume during execution.
Recent toolkits ship with a utility called cuda-memchk, which provides valgrind like analysis of kernel memory access, including buffer overflows and illegal memory usage. It might be that your code is overflowing some memory somewhere and overwriting other parts of GPU memory, leaving the card in a parlous state.
我得到了它! nVidia NSight 2.0 - 据说支持 CUDA 4 - 更改了我的 CUDA_INC_PATH 以使用 CUDA 3.2。难怪它不允许我为每个块分配 1024 个线程。抛开所有的欣慰和喜悦不谈,考虑到我已经安装了 CUDA 4.0 RC2,这确实是一个愚蠢而烦人的错误。
I got it! nVidia NSight 2.0 - which supposedly supports CUDA 4 - changed my CUDA_INC_PATH to use CUDA 3.2. No wonder it wouldn't let me allocate 1024 threads per block. All relief and jubilation aside, that is a really stupid and annoying bug considering I already had CUDA 4.0 RC2 installed.