诊断 CUDA 内核问题

发布于 2024-11-01 22:47:15 字数 871 浏览 4 评论 0原文

CUDA 到处都有很多文档和指南，但我找不到任何形式的说明，如何诊断编译但收到神秘、模糊的错误消息（例如“未指定的启动失败”）的内核正常的“这些块/网格结构有意义吗？” ？

我可以以某种方式拦截 cubin 文件并对内存结构等进行一些静态分析吗请原谅我的菜鸟，但我在任何地方都找不到任何明确的白痴指南。

祝大家周末愉快。

我在寻找什么

如何分离出 cubin 中间文件
然后如何处理它以弄清楚发生了什么，特别是寄存器和内存配置，以查看我的代码是否违反了任何硬件要求，或者我是否只是错过了某处出现相差一错误。

对于后来遇到这个问题的任何人（我似乎有创建一些问题的习惯，这些问题在几个月后不断出现在我自己的查询中......） CUDA-Memcheck 提供比“检查错误”处理程序更有趣的响应。例如，

========= Error: process didn't terminate successfully
========= Invalid __global__ write of size 4
=========     at 0x00000040 in decomp
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0x00101024 is out of bounds
=========
========= ERROR SUMMARY: 1 error

我什至不必解释该错误消息......

原文

CUDA has lots of documentation and guides all over the place, but one I haven't been able to find has been any form of instruction in how to diagnose kernels that compile but get cryptic, vague error messages such as 'unspecified launch failure' beyond the normal "Do these block/grid structures make sense?" etc.

Can I intercept the cubin file somehow and do some static analysis on the memory structures etc? Forgive my noobness but I can't find any definitive, idiots guide, anywhere.

Have a good weekend everyone.

What I'm looking for

How to separate out the cubin intermediate file
What to do with it afterwards to work out what's going on, specifically register and memory configuration to see if my code is violating any hardware requirements, or if I'm just missing an off-by-one error somewhere.

For anyone coming across this later (I seem to have a habit of creating SO questions that keep showing up in my own queries months later...) CUDA-Memcheck gives much more interesting responses that the 'check error' handlers. eg

========= Error: process didn't terminate successfully
========= Invalid __global__ write of size 4
=========     at 0x00000040 in decomp
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0x00101024 is out of bounds
=========
========= ERROR SUMMARY: 1 error

I don't even have to explain that error message...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

因为看清所以看轻 2024-11-08 22:47:15

在 CUDA 中，“未指定的启动失败”相当于段错误。

最近的工具包版本附带了一个名为 cuda-memcheck 的实用程序。它执行类似 valgrind 的分析，对正在执行的内核中的内存事务进行分析，并报告缓冲区溢出或内核中任何非法指针的使用。您可以将其用作进一步分析的启动点。如果您使用 Fermi 卡，还有内核内的 printf 支持，那么生成您自己的断言函数来测试和报告内核内的错误情况并不困难。

CUDA 还附带源级调试器，但您需要专用 GPU 才能使用它。如果您使用的是 Linux 并且只有一个 GPU，请退出 X11 并从控制台 TTY 运行它。

回复收藏 0 原文