使用 PyCUDA 进行 Python 多处理

发布于 2024-11-05 12:23:21 字数 977 浏览 8 评论 0原文

我有一个问题,想要在多个 CUDA 设备上拆分,但我怀疑我当前的系统架构阻碍了我;

我设置的是一个 GPU 类,其中包含在 GPU 上执行操作的函数(这很奇怪)。这些操作是我想象的风格

for iteration in range(maxval):
    result[iteration]=gpuinstance.gpufunction(arguments,iteration)

,N 个设备有 N 个 gpu 实例,但我对多重处理了解不够,无法看到应用它的最简单方法,以便每个设备都是异步分配的,奇怪的是很少有我遇到的例子具体演示了处理后的整理结果。

谁能给我在这方面的任何指示?

更新 感谢 Kaloyan 在多处理领域的指导;如果 CUDA 不是具体的症结所在,我会将您标记为已回答。对不起。

之前在使用此实现时,gpuinstance 类使用 import pycuda.autoinit 启动了 CUDA 设备,但这似乎不起作用,每次都会抛出无效上下文错误(范围正确)线程遇到了 cuda 命令。然后,我尝试在类的 __init__ 构造函数中手动初始化...

pycuda.driver.init()
self.mydev=pycuda.driver.Device(devid) #this is passed at instantiation of class
self.ctx=self.mydev.make_context()
self.ctx.push()    

我的假设是,在创建 gpu 实例列表和线程使用它们时,上下文被保留,因此每个设备都是在它自己的背景下坐得很好。

(我还实现了一个析构函数来处理pop/detach清理)

问题是,一旦线程尝试接触CUDA,无效上下文异常仍然会出现。

大家有什么想法吗?感谢走到这一步。自动为那些在答案中使用“香蕉”的人投票! :P

I've got a problem that I want to split across multiple CUDA devices, but I suspect my current system architecture is holding me back;

What I've set up is a GPU class, with functions that perform operations on the GPU (strange that). These operations are of the style

for iteration in range(maxval):
    result[iteration]=gpuinstance.gpufunction(arguments,iteration)

I'd imagined that there would be N gpuinstances for N devices, but I don't know enough about multiprocessing to see the simplest way of applying this so that each device is asynchronously assigned, and strangely few of the examples that I came across gave concrete demonstrations of collating results after processing.

Can anyone give me any pointers in this area?

UPDATE
Thank you Kaloyan for your guidance in terms of the multiprocessing area; if CUDA wasn't specifically the sticking point I'd be marking you as answered. Sorry.

Perviously to playing with this implementation, the gpuinstance class initiated the CUDA device with import pycuda.autoinit But that didn't appear to work, throwing invalid context errors as soon as each (correctly scoped) thread met a cuda command. I then tried manual initialisation in the __init__ constructor of the class with...

pycuda.driver.init()
self.mydev=pycuda.driver.Device(devid) #this is passed at instantiation of class
self.ctx=self.mydev.make_context()
self.ctx.push()    

My assumption here is that the context is preserved between the list of gpuinstances is created and when the threads use them, so each device is sitting pretty in its own context.

(I also implemented a destructor to take care of pop/detach cleanup)

Problem is, invalid context exceptions are still appearing as soon as the thread tries to touch CUDA.

Any ideas folks? And Thanks to getting this far. Automatic upvotes for people working 'banana' into their answer! :P

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

回梦 2024-11-12 12:23:21

你需要首先在 CUDA 方面把所有的香蕉都排好,然后考虑用 Python 完成这件事的最佳方法 [我知道无耻的代表嫖娼]。

CUDA 多 GPU 模型在 4.0 之前非常简单 - 每个 GPU 都有自己的上下文,并且每个上下文必须由不同的主机线程建立。所以伪代码的想法是:

  1. 应用程序启动,进程使用 API 来确定可用 GPU 的数量(注意 Linux 中的计算模式等)
  2. 应用程序为每个 GPU 启动一个新的主机线程,并传递 GPU id。每个线程隐式/显式调用相当于 cuCtxCreate() 的函数,传递已为其分配的 GPU id
  3. 利润!

在 Python 中,这可能看起来像这样:

import threading
from pycuda import driver

class gpuThread(threading.Thread):
    def __init__(self, gpuid):
        threading.Thread.__init__(self)
        self.ctx  = driver.Device(gpuid).make_context()
        self.device = self.ctx.get_device()

    def run(self):
        print "%s has device %s, api version %s"  \
             % (self.getName(), self.device.name(), self.ctx.get_api_version())
        # Profit!

    def join(self):
        self.ctx.detach()
        threading.Thread.join(self)

driver.init()
ngpus = driver.Device.count()
for i in range(ngpus):
    t = gpuThread(i)
    t.start()
    t.join()

这假设在不事先检查设备的情况下建立上下文是安全的。理想情况下,您将检查计算模式以确保尝试安全,然后在设备繁忙时使用异常处理程序。但希望这给出了基本的想法。

You need to get all your bananas lined up on the CUDA side of things first, then think about the best way to get this done in Python [shameless rep whoring, I know].

The CUDA multi-GPU model is pretty straightforward pre 4.0 - each GPU has its own context, and each context must be established by a different host thread. So the idea in pseudocode is:

  1. Application starts, process uses the API to determine the number of usable GPUS (beware things like compute mode in Linux)
  2. Application launches a new host thread per GPU, passing a GPU id. Each thread implicitly/explicitly calls equivalent of cuCtxCreate() passing the GPU id it has been assigned
  3. Profit!

In Python, this might look something like this:

import threading
from pycuda import driver

class gpuThread(threading.Thread):
    def __init__(self, gpuid):
        threading.Thread.__init__(self)
        self.ctx  = driver.Device(gpuid).make_context()
        self.device = self.ctx.get_device()

    def run(self):
        print "%s has device %s, api version %s"  \
             % (self.getName(), self.device.name(), self.ctx.get_api_version())
        # Profit!

    def join(self):
        self.ctx.detach()
        threading.Thread.join(self)

driver.init()
ngpus = driver.Device.count()
for i in range(ngpus):
    t = gpuThread(i)
    t.start()
    t.join()

This assumes it is safe to just establish a context without any checking of the device beforehand. Ideally you would check the compute mode to make sure it is safe to try, then use an exception handler in case a device is busy. But hopefully this gives the basic idea.

往昔成烟 2024-11-12 12:23:21

您需要的是 map 内置函数的多线程实现。 这里是一种实现。通过稍加修改来满足您的特定需求,您将得到:

import threading

def cuda_map(args_list, gpu_instances):

    result = [None] * len(args_list)

    def task_wrapper(gpu_instance, task_indices):
        for i in task_indices:
            result[i] = gpu_instance.gpufunction(args_list[i])

    threads = [threading.Thread(
                    target=task_wrapper, 
                    args=(gpu_i, list(xrange(len(args_list)))[i::len(gpu_instances)])
              ) for i, gpu_i in enumerate(gpu_instances)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

    return result

它与上面的内容或多或少相同,最大的区别在于您不必花时间等待 的每次完成GPU函数

What you need is a multi-threaded implementation of the map built-in function. Here is one implementation. That, with a little modification to suit your particular needs, you get:

import threading

def cuda_map(args_list, gpu_instances):

    result = [None] * len(args_list)

    def task_wrapper(gpu_instance, task_indices):
        for i in task_indices:
            result[i] = gpu_instance.gpufunction(args_list[i])

    threads = [threading.Thread(
                    target=task_wrapper, 
                    args=(gpu_i, list(xrange(len(args_list)))[i::len(gpu_instances)])
              ) for i, gpu_i in enumerate(gpu_instances)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

    return result

It is more or less the same as what you have above, with the big difference being that you don't spend time waiting for each single completion of the gpufunction.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文