PyCUDA+Threading = 内核调用的无效句柄

发布于 2024-11-05 18:17:49 字数 1263 浏览 7 评论 0原文

我会尽力澄清这一点；

我有两节课； GPU(Object)，用于对 GPU 功能的一般访问；multifunc(threading.Thread) 用于我尝试多设备化的特定功能。 GPU 包含所有后续用例所需的大部分“首次”处理，因此 multifunc 通过其 self< 从 GPU 调用/code> 实例作为 __init__ 参数传递（以及通常的队列等）。

不幸的是，multifunc 崩溃了：

File "/home/bolster/workspace/project/gpu.py", line 438, in run
    prepare(d_A,d_B,d_XTG,offset,grid=N_grid,block=N_block)
  File "/usr/local/lib/python2.7/dist-packages/pycuda-0.94.2-py2.7-linux-x86_64.egg/pycuda/driver.py", line 158, in function_call
    func.set_block_shape(*block)
LogicError: cuFuncSetBlockShape failed: invalid handle

第一个调用端口当然是块尺寸，但它们完全在范围内（即使我强制 block=(1,1,1) ，行为也是相同的），同样网格。

基本上，在 multifunc 中，所有常见的 CUDA memalloc 等函数都可以正常工作（这意味着它不是上下文问题），所以问题一定出在 SourceModule 上。内核函数本身的

内核模板包含文件范围内的所有 CUDA 代码，并且模板是在 GPU 初始化中使用 jinja2 完成的。无论该模板化对象是在 GPU 中转换为 SourceModule 对象并传递给 multifunc，还是在 multifunc 中转换。 同样的事情发生了。Google

对于这个特定问题基本上没有什么用处，但根据堆栈，我假设所引用的 Invalid Handle 是内核函数句柄，而不是任何奇怪的东西。继续处理块尺寸。

我知道这是一个非常极端的情况，但我确信有人可以看到我错过的问题。

原文

I'll try and make this clear;

I've got two classes; GPU(Object), for general access to GPU functionality, and multifunc(threading.Thread) for a particular function I'm trying to multi-device-ify. GPU contains most of the 'first time' processing needed for all subsequent usecases, so multifunc gets called from GPU with its self instance passed as an __init__ argument (along with the usual queues and such).

Unfortunately, multifunc craps out with:

File "/home/bolster/workspace/project/gpu.py", line 438, in run
    prepare(d_A,d_B,d_XTG,offset,grid=N_grid,block=N_block)
  File "/usr/local/lib/python2.7/dist-packages/pycuda-0.94.2-py2.7-linux-x86_64.egg/pycuda/driver.py", line 158, in function_call
    func.set_block_shape(*block)
LogicError: cuFuncSetBlockShape failed: invalid handle

First port of call was of course the block dimensions, but they are well within range (same behaviour even if I force block=(1,1,1), likewise grid.

Basically, within multifunc, all of the usual CUDA memalloc etc functions work fine, (implying its not a context problem) So the problem must be with the SourceModuleing of the kernel function itself.

I have a kernel template containing all my CUDA code that's file-scoped, and templating is done with jinja2 in the GPU initialisation. Regardless of whether that templated object is converted to a SourceModule object in GPU and passed to multifunc, or if its converted in multifunc the same thing happens.

Google has been largely useless for this particular issue, but following the stack, I'm assuming the Invalid Handle being referred to is the kernel function handle rather than anything strange going on with the block dimensions.

I'm aware this is a very corner-case situation, but I'm sure someone can see a problem that I've missed.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

云柯 2024-11-12 18:17:49

原因是上下文亲和力。每个 CUDA 函数实例都与上下文相关联，并且它们不可移植（这同样适用于内存分配和纹理引用）。因此，每个上下文必须单独加载函数实例，然后使用该加载操作返回的函数句柄。

如果您根本不使用元编程，您可能会发现将 CUDA 代码编译为 cubin 文件，然后使用 driver.module_from_file 将所需的函数从 cubin 加载到每个上下文中会更简单。直接从我的一些生产代码中剪切和粘贴：

# Context establishment
try:
    if (autoinit):
        import pycuda.autoinit
        self.context = None
        self.device = pycuda.autoinit.device
        self.computecc = self.device.compute_capability()
    else:
        driver.init()
        self.context = tools.make_default_context()
        self.device = self.context.get_device()
        self.computecc = self.device.compute_capability()

    # GPU code initialization
    # load pre compiled CUDA code from cubin file
    # Select the cubin based on the supplied dtype
    # cubin names contain C++ mangling because of
    # templating. Ugly but no easy way around it
    if self.computecc == (1,3):
        self.fimcubin = "fim_sm13.cubin"
    elif self.computecc[0] == 2:
        self.fimcubin = "fim_sm20.cubin"
    else:
        raise NotImplementedError("GPU architecture not supported")

    fimmod = driver.module_from_file(self.fimcubin)

    IterateName32 = "_Z10fimIterateIfLj8EEvPKT_PKiPS0_PiS0_S0_S0_jjji"
    IterateName64 = "_Z10fimIterateIdLj8EEvPKT_PKiPS0_PiS0_S0_S0_jjji"

    if (self.dtype == np.float32):
        IterateName = IterateName32
    elif (self.dtype == np.float64):
        IterateName = IterateName64
    else:
        raise TypeError

    self.fimIterate = fimmod.get_function(IterateName)

except ImportError:
    warn("Could not initialise CUDA context")

The reason is context affinity. Every CUDA function instance is tied to a context, and they are not portable (the same applies to memory allocations and texture references). So each context must load the function instance separately, and then use the function handle returned by that load operation.

If you are not using metaprogramming at all, you might find it simpler to compile your CUDA code to a cubin file, and then load the functions you need from the cubin to each context with driver.module_from_file. Cutting and pasting directly from some production code of mine:

# Context establishment
try:
    if (autoinit):
        import pycuda.autoinit
        self.context = None
        self.device = pycuda.autoinit.device
        self.computecc = self.device.compute_capability()
    else:
        driver.init()
        self.context = tools.make_default_context()
        self.device = self.context.get_device()
        self.computecc = self.device.compute_capability()

    # GPU code initialization
    # load pre compiled CUDA code from cubin file
    # Select the cubin based on the supplied dtype
    # cubin names contain C++ mangling because of
    # templating. Ugly but no easy way around it
    if self.computecc == (1,3):
        self.fimcubin = "fim_sm13.cubin"
    elif self.computecc[0] == 2:
        self.fimcubin = "fim_sm20.cubin"
    else:
        raise NotImplementedError("GPU architecture not supported")

    fimmod = driver.module_from_file(self.fimcubin)

    IterateName32 = "_Z10fimIterateIfLj8EEvPKT_PKiPS0_PiS0_S0_S0_jjji"
    IterateName64 = "_Z10fimIterateIdLj8EEvPKT_PKiPS0_PiS0_S0_S0_jjji"

    if (self.dtype == np.float32):
        IterateName = IterateName32
    elif (self.dtype == np.float64):
        IterateName = IterateName64
    else:
        raise TypeError

    self.fimIterate = fimmod.get_function(IterateName)

except ImportError:
    warn("Could not initialise CUDA context")

回复收藏 0 原文