CUDA 卡偶尔会因运行中“启动失败”而崩溃，以及 Snow

发布于 2024-11-02 09:31:59 字数 433 浏览 6 评论 0原文

我想拍一张屏幕上发生的情况的照片，但屏幕截图无法捕获它，但最好的描述是雪。

我的一个项目有一个习惯，就是在新的迭代中随机失败，我总是认为这是一个“你使用了太多内存的傻瓜！”错误，所以很高兴重新启动，处理它，并尝试解决问题。

然后我开始实际监控分配的全局内存；在整个执行过程中，它的空闲率一直保持在 70% 左右，直到突然死在新的 malloc 上。

更令人担忧的是，这些古鲁冥想已经开始习惯性地出现在我的 dmesg 中；所有（我注意到）都具有相同的地址。

NVRM: Xid (0000:01:00): 13, 0008 00000000 000050c0 00000368 00000000 00000080

智者有什么说法可以说明这到底是怎么回事吗？我仍在继续调查寄存器和共享内存的问题，但想开始这个问题以了解其他人的任何想法。

原文

I would like to take a picture of whats happenning to my screen, but screenshot won't capture it, but the best description is snow.

One of my projects has a habit of randomly failing on a new iteration, and I always assumed it was a 'You're using too much memory fool!' error, so was happy to restart, deal with it, and try to fix the problem.

Then I started to actually monitor the global memory assigned; Its constant at around 70% free throughout execution until suddenly dying on a fresh malloc.

To make matters more worrying, these Guru Meditations have started to habitually appear in my dmesg; all (that I've noticed) with the same address.

NVRM: Xid (0000:01:00): 13, 0008 00000000 000050c0 00000368 00000000 00000080

Any words from the wise on what the hell is going on? I'm still continuing investigation into issues with register and shared memory, but wanted to start this question for any ideas anyone else has.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

·深蓝 2024-11-09 09:32:00

如果您的 CUDA 内存分配都没有失败，那么您的问题不是内存不足（如果是的话，可能是由于碎片，不一定是由于 100% 以上的消耗）。

如果您得到了圣诞树效果，那么您可能有一个正在分配的内存之外进行写入的内核。检查您正在访问的像素/数组单元的索引以及它们在输出缓冲区中的位置的内存偏移计算。

您还可以尝试在调用内核时使用一维索引，以使计算更简单。
（您可以将任何多维数组建模为长一维数组。）

回复收藏 0 原文

尤怨 2024-11-09 09:32:00

请使用 cudaSafeCall() 封装对 CUDA Runtime API 的所有调用，并在所有内核调用后添加 cudaCheckError()。这些实用函数在 cutil.h 中公开。这应该可以帮助您在实际发生时捕获任何 CUDA 错误，并且它们的错误消息应该有助于您的调查。