帮助！使用过多内存后 CUDA 内核将不再启动

发布于 2024-11-03 13:08:02 字数 583 浏览 0 评论 0原文

我正在编写一个需要以下内核启动的程序：

dim3 blocks(16,16,16); //grid dimensions
dim3 threads(32,32); //block dimensions
get_gaussian_responses<<<blocks,threads>>>(pDeviceIntegral,itgStepSize,pScaleSpace);

我忘记在程序结束时释放 pScaleSpace 数组，然后通过 CUDA 分析器运行该程序，该程序连续运行 15 次，消耗了大量资源内存/造成大量碎片。现在，每当我运行该程序时，内核甚至都不会启动。如果我查看探查器记录的函数调用列表，就会发现内核不在那里。我意识到这是一个非常愚蠢的错误，但我不知道此时我可以做什么来让程序再次运行。我已经重新启动了计算机，但这没有帮助。如果我减小内核的尺寸，它运行良好，但当前尺寸完全在我的卡允许的最大值之内。

Max threads per block: 1024
Max grid dimensions: 65535,65535,65535

任何建议表示赞赏，提前致谢！

原文

I'm writing a program that requires the following kernel launch:

dim3 blocks(16,16,16); //grid dimensions
dim3 threads(32,32); //block dimensions
get_gaussian_responses<<<blocks,threads>>>(pDeviceIntegral,itgStepSize,pScaleSpace);

I forgot to free the pScaleSpace array at the end of the program, and then ran the program through the CUDA profiler, which runs it 15 times in succession, using up a lot of memory / causing a lot of fragmentation. Now whenever I run the program, the kernel doesn't even launch. If I look at the list of function calls recorded by the profiler, the kernel is not there. I realize this is a pretty stupid error, but I don't know what I can do at this point to get the program to run again. I have restarted my computer, but that did not help. If I reduce the dimensions of the kernel, it runs fine, but the current dimensions are well within the allowed maximum for my card.

Max threads per block: 1024
Max grid dimensions: 65535,65535,65535

Any suggestions appreciated, thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薯片软お妹 2024-11-10 13:08:02

尝试使用较少数量的线程启动。如果有效，则意味着每个线程都在执行大量工作或使用大量内存。因此，CUDA 在您的硬件上实际上不可能启动最大可能的线程数。

您可能必须提高 CUDA 代码的效率才能启动更多线程。如果内核内部有复杂的逻辑，您可以尝试将内核切成更小的部分。或者获得更强大的硬件。

回复收藏 0 原文

指尖凝香 2024-11-10 13:08:02

如果您像这样编译代码：

nvcc -Xptxas="-v" [other compiler options]

汇编器将报告代码所需的本地堆内存数量。这对于查看内核的内存占用情况来说是一个有用的诊断。还有一个 API 调用 cudaThreadSetLimit它可用于控制每个线程堆内存的数量内核将在执行期间尝试并消耗。

最近的工具包附带了一个名为 cuda-memchk 的实用程序，它提供类似 valgrind 的内核内存访问分析，包括缓冲区溢出和非法内存使用。您的代码可能会溢出某处的某些内存并覆盖 GPU 内存的其他部分，从而使卡处于危险状态。

If you compile your code like this:

nvcc -Xptxas="-v" [other compiler options]

the assembler will report the number of local heap memory that the code requires. This can be a useful diagnostic to see what the memory footprint of the kernel is. There is also an API call cudaThreadSetLimit which can be used to control the amount of per thread heap memory which a kernel will try and consume during execution.

Recent toolkits ship with a utility called cuda-memchk, which provides valgrind like analysis of kernel memory access, including buffer overflows and illegal memory usage. It might be that your code is overflowing some memory somewhere and overwriting other parts of GPU memory, leaving the card in a parlous state.

回复收藏 0 原文