使用 PyCUDA 进行 Python 多处理
我有一个问题,想要在多个 CUDA 设备上拆分,但我怀疑我当前的系统架构阻碍了我;
我设置的是一个 GPU 类,其中包含在 GPU 上执行操作的函数(这很奇怪)。这些操作是我想象的风格
for iteration in range(maxval):
result[iteration]=gpuinstance.gpufunction(arguments,iteration)
,N 个设备有 N 个 gpu 实例,但我对多重处理了解不够,无法看到应用它的最简单方法,以便每个设备都是异步分配的,奇怪的是很少有我遇到的例子具体演示了处理后的整理结果。
谁能给我在这方面的任何指示?
更新 感谢 Kaloyan 在多处理领域的指导;如果 CUDA 不是具体的症结所在,我会将您标记为已回答。对不起。
之前在使用此实现时,gpuinstance 类使用 import pycuda.autoinit 启动了 CUDA 设备,但这似乎不起作用,每次都会抛出无效上下文错误(范围正确)线程遇到了 cuda 命令。然后,我尝试在类的 __init__ 构造函数中手动初始化...
pycuda.driver.init()
self.mydev=pycuda.driver.Device(devid) #this is passed at instantiation of class
self.ctx=self.mydev.make_context()
self.ctx.push()
我的假设是,在创建 gpu 实例列表和线程使用它们时,上下文被保留,因此每个设备都是在它自己的背景下坐得很好。
(我还实现了一个析构函数来处理pop/detach
清理)
问题是,一旦线程尝试接触CUDA,无效上下文
异常仍然会出现。
大家有什么想法吗?感谢走到这一步。自动为那些在答案中使用“香蕉”的人投票! :P
I've got a problem that I want to split across multiple CUDA devices, but I suspect my current system architecture is holding me back;
What I've set up is a GPU class, with functions that perform operations on the GPU (strange that). These operations are of the style
for iteration in range(maxval):
result[iteration]=gpuinstance.gpufunction(arguments,iteration)
I'd imagined that there would be N gpuinstances for N devices, but I don't know enough about multiprocessing to see the simplest way of applying this so that each device is asynchronously assigned, and strangely few of the examples that I came across gave concrete demonstrations of collating results after processing.
Can anyone give me any pointers in this area?
UPDATE
Thank you Kaloyan for your guidance in terms of the multiprocessing area; if CUDA wasn't specifically the sticking point I'd be marking you as answered. Sorry.
Perviously to playing with this implementation, the gpuinstance class initiated the CUDA device with import pycuda.autoinit
But that didn't appear to work, throwing invalid context
errors as soon as each (correctly scoped) thread met a cuda command. I then tried manual initialisation in the __init__
constructor of the class with...
pycuda.driver.init()
self.mydev=pycuda.driver.Device(devid) #this is passed at instantiation of class
self.ctx=self.mydev.make_context()
self.ctx.push()
My assumption here is that the context is preserved between the list of gpuinstances is created and when the threads use them, so each device is sitting pretty in its own context.
(I also implemented a destructor to take care of pop/detach
cleanup)
Problem is, invalid context
exceptions are still appearing as soon as the thread tries to touch CUDA.
Any ideas folks? And Thanks to getting this far. Automatic upvotes for people working 'banana' into their answer! :P
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你需要首先在 CUDA 方面把所有的香蕉都排好,然后考虑用 Python 完成这件事的最佳方法 [我知道无耻的代表嫖娼]。
CUDA 多 GPU 模型在 4.0 之前非常简单 - 每个 GPU 都有自己的上下文,并且每个上下文必须由不同的主机线程建立。所以伪代码的想法是:
在 Python 中,这可能看起来像这样:
这假设在不事先检查设备的情况下建立上下文是安全的。理想情况下,您将检查计算模式以确保尝试安全,然后在设备繁忙时使用异常处理程序。但希望这给出了基本的想法。
You need to get all your bananas lined up on the CUDA side of things first, then think about the best way to get this done in Python [shameless rep whoring, I know].
The CUDA multi-GPU model is pretty straightforward pre 4.0 - each GPU has its own context, and each context must be established by a different host thread. So the idea in pseudocode is:
In Python, this might look something like this:
This assumes it is safe to just establish a context without any checking of the device beforehand. Ideally you would check the compute mode to make sure it is safe to try, then use an exception handler in case a device is busy. But hopefully this gives the basic idea.
您需要的是
map
内置函数的多线程实现。 这里是一种实现。通过稍加修改来满足您的特定需求,您将得到:它与上面的内容或多或少相同,最大的区别在于您不必花时间等待
的每次完成GPU函数
。What you need is a multi-threaded implementation of the
map
built-in function. Here is one implementation. That, with a little modification to suit your particular needs, you get:It is more or less the same as what you have above, with the big difference being that you don't spend time waiting for each single completion of the
gpufunction
.