pycuda.debug 实际上做了什么?
作为一个更大项目的一部分,我遇到了一个奇怪的一致错误,我无法理解它,但它是一个典型的“黑匣子”错误;当使用 cuda-gdb python -m pycuda.debug prog.py -args 运行时,它运行良好,但速度很慢。如果我删除 pycuda.debug,它就会崩溃。一致地,在多内核执行中的完全相同的点上。
解释;我有(目前三个)内核,用于不同的网格和块排列来解决更大的优化问题的“切片”。严格来说,这些应该有效,也可以无效,因为函数本身只被告知“这里有更多数据”,并且除了数据内容之外,不知道任何内容,例如迭代次数,它们的输入数据是否已分区或不是,直到这一点为止,它们都表现得非常完美。
基本上,如果没有 pycuda.debug 将调试符号暴露给 GDB,我就看不到发生了什么,但我也看不到 pycuda.debug 的问题。
pycuda 实际上做了什么,以便我知道在内核代码中寻找什么?
As part of a larger project, I've come across a strangely consistent bug that I can't get my head around, but is an archetypical 'black box' bug; when running with cuda-gdb python -m pycuda.debug prog.py -args
, it runs fine, but slow. If i drop pycuda.debug, it breaks. Consistently, at exactly the same point in multiple-kernel execution.
To explain; I have (currently three) kernels, used in different grid and block arrangements to solve 'slices' of a larger optimisation problem. These strictly speaking should either work, or not, as the functions themselves are told nothing but 'here's some more data', and other than the contents of the data, don't know anything such as iteration number whether their input data is partitioned or not, and up until this one point, they perform perfectly.
Basically, I can't see what's happening without pycuda.debug exposing the debugging symbols to GDB, but I also can't see the problem WITH pycuda.debug.
What does pycuda actually do so I know what to look for in my kernel code?
几乎什么都没有。它主要在 pycuda.driver 模块中设置编译器标志,以便使用必要的调试符号来编译 CUDA 代码,并按照 CUDA-gdb 所需的方式进行组装。剩下的部分是一个很小的包装器,它很好地封装了 pycuda 库,因此一切正常。整个事情大约有20行Python,如果你愿意,你可以在源代码分发中查看代码。
这里的关键是调试器中运行的代码会将所有内容从寄存器和共享内存溢出到本地内存,以便驱动程序可以读取本地程序状态。因此,如果您的代码在为调试器构建时运行,但在正常构建时失败,通常意味着存在共享内存缓冲区溢出或指针错误,从而导致 GPU 相当于段错误。
Almost nothing. It mostly sets compiler flags in the pycuda.driver module so that CUDA code gets compiled with the necessary debugging symbols and assembled in the way CUDA-gdb requires. The rest is a tiny wrapper that nicely encapsulates the pycuda libraries so thar everything works. The whole thing is about 20 lines of python, you can see the code in the source distribution if you want.
The key thing here is that code run in the debugger spills everything from register and shared memory to local memory, so that the driver can read local program state. So if you have code that runs when built for the debugger and fails when built normally, it usually means there is a shared memory buffer overflow or pointer error which is causing the GPU equivalent of a segfault.