Gunicorn＆＃x2B; CUDA：无法在分叉子过程中重新定位CUDA

发布于 2025-02-11 05:26:13 字数 2524 浏览 4 评论 0原文

我正在用火炬，炮魔和烧瓶创建推理服务，该服务应使用CUDA。为了减少资源需求，我使用枪支的预加载选项，因此该模型在工作过程之间共享。但是，这导致了CUDA的问题。以下代码显示了一个最小的复制示例：

from flask import Flask, request
import torch

app = Flask('dummy')

model = torch.rand(500)
model = model.to('cuda:0')


@app.route('/', methods=['POST'])
def f():
    data = request.get_json()
    x = torch.rand((data['number'], 500))
    x = x.to('cuda:0')
    res = x * model
    return {
        "result": res.sum().item()
    }

使用cuda_visible_devices = 1 Gunicorn -W 3 -B $ HOST_IP：8080 -PRELOAD RUN_SERVER：APP让服务成功开始。但是，一旦执行第一个请求（curl -x post -d'{“ number”：1}'），工人会抛出以下错误：

[2022-06-28 09:42:00,378] ERROR in app: Exception on / [POST]
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/user/.local/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/user/project/run_server.py", line 14, in f
    x = x.to('cuda:0')
  File "/home/user/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 195, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

我将模型加载到父进程中，并且它是可访问的每个分叉的工作过程。在工作过程中创建CUDA支持的张量时，就会发生问题。这使工作过程中的CUDA上下文重新定位，该上下文失败了，因为它已经在父过程中初始化了。如果我们设置x = data ['number']并删除x = x.to（'cuda：0'），推理成功。

添加torch.multiprocessing.set_start_method（'spawn'）或supproprocessing.set_start_method（'Spawn'） 不会更改任何东西，可能是因为Gunicorn肯定会使用fork fork fork fork fork fork fork fork fork从开始时 - 预交选项。

解决方案可能不使用- 预紧量选项，该选项会导致内存/GPU中模型的多个副本。但这就是我要避免的。

是否有可能在每个工作过程中分别加载模型的情况下克服此问题？

原文

I am creating an inference service with torch, gunicorn and flask that should use CUDA. To reduce resource requirements, I use the preload option of gunicorn, so the model is shared between the worker processes. However, this leads to an issue with CUDA. The following code snipped shows a minimal reproducing example:

from flask import Flask, request
import torch

app = Flask('dummy')

model = torch.rand(500)
model = model.to('cuda:0')


@app.route('/', methods=['POST'])
def f():
    data = request.get_json()
    x = torch.rand((data['number'], 500))
    x = x.to('cuda:0')
    res = x * model
    return {
        "result": res.sum().item()
    }

Starting the server with CUDA_VISIBLE_DEVICES=1 gunicorn -w 3 -b $HOST_IP:8080 --preload run_server:app lets the service start successfully. However, once doing the first request (curl -X POST -d '{"number": 1}'), the worker throws the following error:

[2022-06-28 09:42:00,378] ERROR in app: Exception on / [POST]
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/user/.local/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/user/.local/lib/python3.6/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/user/project/run_server.py", line 14, in f
    x = x.to('cuda:0')
  File "/home/user/.local/lib/python3.6/site-packages/torch/cuda/__init__.py", line 195, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

I load the model in the parent process and it's accessible to each forked worker process. The problem occurs when creating a CUDA-backed tensor in the worker process. This re-initializes the CUDA context in the worker process, which fails because it was already initialized in the parent process. If we set x = data['number'] and remove x = x.to('cuda:0'), the inference succeeds.

Adding torch.multiprocessing.set_start_method('spawn') or multiprocessing.set_start_method('spawn') won't change anything, probably because gunicorn will definitely use fork when being started with the --preload option.

A solution could be not using the --preload option, which leads to multiple copies of the model in memory/GPU. But this is what I am trying to avoid.

Is there any possibility to overcome this issue without loading the model separately in each worker process?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

萌︼了一个春 2025-02-18 05:26:13

错误原因

@newbie评论中正确说明的，问题不是模型本身，而是CUDA上下文。当新的子进程分叉时，父母的内存将与孩子共享，但是CUDA上下文不支持此共享，必须向孩子复制。因此，它报告了上述错误。

`Spawn`而不是`fork`

要解决此问题，我们必须将子进程的开始方法从fork spawn 带有多处理.set_start_method。以下简单示例正常工作：

import torch
import torch.multiprocessing as mp


def f(y):
    y[0] = 1000


if __name__ == '__main__':
    x = torch.zeros(1).cuda()
    x.share_memory_()

    mp.set_start_method('spawn')
    p = mp.Process(target=f, args=(x,), daemon=True)
    p.start()
    p.join()
    print("x =", x.item())

在运行此代码时，初始化了第二个CUDA上下文（可以在第二个窗口中通过观看-N 1 NVIDIA -SMI观察），以及f <f < /代码>在上下文完全初始化后执行。此后，x = 1000.0在控制台上打印出来，因此，我们确认张量x在过程之间成功共享。

但是，Gunicorn内部使用os.fork来启动工作过程，因此Multiprocessing.set_start_method对Gunicorn的行为没有影响。因此，必须避免在根过程中初始化CUDA上下文。

为了在工作过程中共享模型的解决方案

，我们必须将模型加载到一个过程中并与工人共享。幸运的是，通过torch.multiprocessing.queue将CUDA张量发送到另一个过程不会复制GPU上的参数，因此我们可以将这些队列用于此问题。

import time

import torch
import torch.multiprocessing as mp


def f(q):
    y = q.get()
    y[0] = 1000


def g(q):
    x = torch.zeros(1).cuda()
    x.share_memory_()
    q.put(x)
    q.put(x)
    while True:
        time.sleep(1)  # this process must live as long as x is in use


if __name__ == '__main__':
    queue = mp.Queue()
    pf = mp.Process(target=f, args=(queue,), daemon=True)
    pf.start()
    pg = mp.Process(target=g, args=(queue,), daemon=True)
    pg.start()
    pf.join()
    x = queue.get()
    print("x =", x.item())  # Prints x = 1000.0

对于Gunicorn Server，我们可以使用相同的策略：模型服务器进程加载模型并将其用于每个新工作后的叉子后。在post_fork挂钩工作者请求并从模型服务器接收模型。 Gunicorn配置看起来像这样：

import logging

from client import request_model
from app import app

logging.basicConfig(level=logging.INFO)

bind = "localhost:8080"
workers = 1
zmq_url = "tcp://127.0.0.1:5555"


def post_fork(server, worker):
    app.config['MODEL'], app.config['COUNTER'] = request_model(zmq_url)

在POST_FORK Hook中，我们调用request_model从模型服务器中获取模型，并将模型存储在烧瓶应用程序的配置中。方法request_model在我的示例中定义了client.py.py，定义如下：

import logging
import os

from torch.multiprocessing.reductions import ForkingPickler
import zmq


def request_model(zmq_url: str):
    logging.info("Connecting")
    context = zmq.Context()
    with context.socket(zmq.REQ) as socket:
        socket.connect(zmq_url)
        logging.info("Sending request")
        socket.send(ForkingPickler.dumps(os.getpid()))
        logging.info("Waiting for a response")
        model = ForkingPickler.loads(socket.recv())
    logging.info("Got response from object server")
    return model

我们使用 Zeromq 用于此处的程序间通信，因为它允许我们通过名称/地址引用服务器并将服务器代码外包到其自己的应用程序中。 多仔号和多处理。 -Starting-worker/75307381“>与Gunicorn不能很好地工作。多处理在内部使用forkingpickler来序列化对象，并且模块torch.multiprocessing以一种可以以torch数据结构为单位来改变它适当，可靠地序列化。因此，我们使用此类序列化模型将其发送到工作过程中。

该模型已加载并提供在一个完全分开的应用程序中，该应用程序完全分开并在server.py.py.py中定义：

from argparse import ArgumentParser
import logging

import torch
from torch.multiprocessing.reductions import ForkingPickler
import zmq


def load_model():
    model = torch.nn.Linear(10000, 50000)
    model.cuda()
    model.share_memory()

    counter = torch.zeros(1).cuda()
    counter.share_memory_()
    return model, counter


def share_object(obj, url):
    context = zmq.Context()
    socket = context.socket(zmq.REP)
    socket.bind(url)
    while True:
        logging.info("Waiting for requests on %s", url)
        message = socket.recv()
        logging.info("Got a message from %d", ForkingPickler.loads(message))
        socket.send(ForkingPickler.dumps(obj))


if __name__ == '__main__':
    parser = ArgumentParser(description="Serve model")
    parser.add_argument("--listen-address", default="tcp://127.0.0.1:5555")
    args = parser.parse_args()

    logging.basicConfig(level=logging.INFO)
    logging.info("Loading model")
    model = load_model()
    share_object(model, args.listen_address)

对于此测试，我们使用大约2GB的模型，以查看GPU内存的影响nvidia-smi中的分配和一个小张量来验证数据实际上在过程之间共享。

我们的示例瓶应用程序以随机输入运行模型，计算请求的数量并返回两个结果：

from flask import Flask
import torch

app = Flask(__name__)


@app.route("/", methods=["POST"])
def infer():
    model: torch.nn.Linear = app.config['MODEL']
    counter: torch.Tensor = app.config['COUNTER']
    counter[0] += 1  # not thread-safe
    input_features = torch.rand(model.in_features).cuda()
    return {
        "result": model(input_features).sum().item(),
        "counter": counter.item()
    }

测试

示例可以如下运行：

$ python server.py &
INFO:root:Waiting for requests on tcp://127.0.0.1:5555 
$ gunicorn -c config.py app:app
[2023-02-01 16:45:34 +0800] [24113] [INFO] Starting gunicorn 20.1.0
[2023-02-01 16:45:34 +0800] [24113] [INFO] Listening at: http://127.0.0.1:8080 (24113)
[2023-02-01 16:45:34 +0800] [24113] [INFO] Using worker: sync
[2023-02-01 16:45:34 +0800] [24186] [INFO] Booting worker with pid: 24186
INFO:root:Connecting
INFO:root:Sending request
INFO:root:Waiting for a response
INFO:root:Got response from object server

使用nvidia-smi，我们可以观察到，现在，两个过程正在使用GPU，其中一个分配了2GB的VRAM，而不是另一个。查询烧瓶应用程序也可以按预期工作：

$ curl -X POST localhost:8080
{"counter":1.0,"result":-23.956459045410156} 
$ curl -X POST localhost:8080
{"counter":2.0,"result":-8.161510467529297}
$ curl -X POST localhost:8080
{"counter":3.0,"result":-37.823692321777344}

让我们介绍一些混乱并终止我们唯一的枪手工人：

$ kill 24186
[2023-02-01 18:02:09 +0800] [24186] [INFO] Worker exiting (pid: 24186)
[2023-02-01 18:02:09 +0800] [4196] [INFO] Booting worker with pid: 4196
INFO:root:Connecting
INFO:root:Sending request
INFO:root:Waiting for a response
INFO:root:Got response from object server

它正在正确重新启动并准备回答我们的请求。

最初的好处

，我们服务所需的VRAM的数量为（sizeof（model） + sizeof（cuda context）） * num（工人）。通过共享模型的权重，我们可以通过sizeof（model） *（num（Workers）-1） to size> sizeof（model） + sizeof（cuda context） * num（num（num）工人）。

警告

这种方法的可靠性取决于单个模型服务器进程。如果该过程终止，新启动的工人不仅会陷入困境，而且现有工人中的模型将变得不可用，所有工人都会一次崩溃。只要服务器进程正在运行，共享张量/型号只能使用。即使重新启动模型服务器和枪支工人，也不可避免地停电也是不可避免的。因此，在生产环境中，您应确保此服务器流程保持活力。

此外，在不同过程之间共享数据可能会产生副作用。共享可变的数据时，必须使用适当的锁来避免比赛条件。

Reason for the Error

As correctly stated in the comments by @Newbie, the issue isn't the model itself, but the CUDA context. When new child processes are forked, the parent's memory is shared read-only with the child, but the CUDA context doesn't support this sharing, it must be copied to the child. Hence, it reports above-mentioned error.

`Spawn` instead of `Fork`

To resolve this issue, we have to change the start method for the child processes from fork to spawn with multiprocessing.set_start_method. The following simple example works fine:

import torch
import torch.multiprocessing as mp


def f(y):
    y[0] = 1000


if __name__ == '__main__':
    x = torch.zeros(1).cuda()
    x.share_memory_()

    mp.set_start_method('spawn')
    p = mp.Process(target=f, args=(x,), daemon=True)
    p.start()
    p.join()
    print("x =", x.item())

When running this code, a second CUDA context is initialized (this can be observed via watch -n 1 nvidia-smi in a second window), and f is executed after the context was initialized completely. After this, x = 1000.0 is printed on the console, thus, we confirmed that the tensor x was successfully shared between the processes.

However, Gunicorn internally uses os.fork to start the worker processes, so multiprocessing.set_start_method has no influence on Gunicorn's behavior. Consequently, initializing the CUDA context in the root process must be avoided.

Solution for Gunicorn

In order to share the model among the worker processes, we thus must load the model in one single process and share it with the workers. Luckily, sending a CUDA tensor via a torch.multiprocessing.Queue to another process doesn't copy the parameters on the GPU, so we can use those queues for this problem.

import time

import torch
import torch.multiprocessing as mp


def f(q):
    y = q.get()
    y[0] = 1000


def g(q):
    x = torch.zeros(1).cuda()
    x.share_memory_()
    q.put(x)
    q.put(x)
    while True:
        time.sleep(1)  # this process must live as long as x is in use


if __name__ == '__main__':
    queue = mp.Queue()
    pf = mp.Process(target=f, args=(queue,), daemon=True)
    pf.start()
    pg = mp.Process(target=g, args=(queue,), daemon=True)
    pg.start()
    pf.join()
    x = queue.get()
    print("x =", x.item())  # Prints x = 1000.0

For the Gunicorn server, we can use the same strategy: A model server process loads the model and serves it to each new worker process after its fork. In the post_fork hook the worker requests and receives the model from the model server. A Gunicorn configuration could look like this:

import logging

from client import request_model
from app import app

logging.basicConfig(level=logging.INFO)

bind = "localhost:8080"
workers = 1
zmq_url = "tcp://127.0.0.1:5555"


def post_fork(server, worker):
    app.config['MODEL'], app.config['COUNTER'] = request_model(zmq_url)

In the post_fork hook, we call request_model to get a model from the model server and store the model in the configuration of the Flask application. The method request_model is defined in my example in the file client.py and defined as follows:

import logging
import os

from torch.multiprocessing.reductions import ForkingPickler
import zmq


def request_model(zmq_url: str):
    logging.info("Connecting")
    context = zmq.Context()
    with context.socket(zmq.REQ) as socket:
        socket.connect(zmq_url)
        logging.info("Sending request")
        socket.send(ForkingPickler.dumps(os.getpid()))
        logging.info("Waiting for a response")
        model = ForkingPickler.loads(socket.recv())
    logging.info("Got response from object server")
    return model

We make use of ZeroMQ for inter-process communication here because it allows us to reference servers by name/address and to outsource the server code into its own application. multiprocessing.Queue and multiprocessing.Process apparently don't work well with Gunicorn. multiprocessing.Queue uses the ForkingPickler internally to serialize the objects, and the module torch.multiprocessing alters it in a way that Torch data structures can be serialized appropriately and reliably. So, we use this class to serialize our model to send it to the worker processes.

The model is loaded and served in an application that is completely separate from Gunicorn and defined in server.py:

from argparse import ArgumentParser
import logging

import torch
from torch.multiprocessing.reductions import ForkingPickler
import zmq


def load_model():
    model = torch.nn.Linear(10000, 50000)
    model.cuda()
    model.share_memory()

    counter = torch.zeros(1).cuda()
    counter.share_memory_()
    return model, counter


def share_object(obj, url):
    context = zmq.Context()
    socket = context.socket(zmq.REP)
    socket.bind(url)
    while True:
        logging.info("Waiting for requests on %s", url)
        message = socket.recv()
        logging.info("Got a message from %d", ForkingPickler.loads(message))
        socket.send(ForkingPickler.dumps(obj))


if __name__ == '__main__':
    parser = ArgumentParser(description="Serve model")
    parser.add_argument("--listen-address", default="tcp://127.0.0.1:5555")
    args = parser.parse_args()

    logging.basicConfig(level=logging.INFO)
    logging.info("Loading model")
    model = load_model()
    share_object(model, args.listen_address)

For this test, we use a model of about 2GB in size to see an effect on the GPU memory allocation in nvidia-smi and a small tensor to verify that the data is actually shared among the processes.

Our sample flask application runs the model with a random input, counts the number of requests and returns both results:

from flask import Flask
import torch

app = Flask(__name__)


@app.route("/", methods=["POST"])
def infer():
    model: torch.nn.Linear = app.config['MODEL']
    counter: torch.Tensor = app.config['COUNTER']
    counter[0] += 1  # not thread-safe
    input_features = torch.rand(model.in_features).cuda()
    return {
        "result": model(input_features).sum().item(),
        "counter": counter.item()
    }

Test

The example can be run as follows:

$ python server.py &
INFO:root:Waiting for requests on tcp://127.0.0.1:5555 
$ gunicorn -c config.py app:app
[2023-02-01 16:45:34 +0800] [24113] [INFO] Starting gunicorn 20.1.0
[2023-02-01 16:45:34 +0800] [24113] [INFO] Listening at: http://127.0.0.1:8080 (24113)
[2023-02-01 16:45:34 +0800] [24113] [INFO] Using worker: sync
[2023-02-01 16:45:34 +0800] [24186] [INFO] Booting worker with pid: 24186
INFO:root:Connecting
INFO:root:Sending request
INFO:root:Waiting for a response
INFO:root:Got response from object server

Using nvidia-smi, we can observe that now, two processes are using the GPU, and one of them allocates 2GB more VRAM than the other. Querying the flask application also works as expected:

$ curl -X POST localhost:8080
{"counter":1.0,"result":-23.956459045410156} 
$ curl -X POST localhost:8080
{"counter":2.0,"result":-8.161510467529297}
$ curl -X POST localhost:8080
{"counter":3.0,"result":-37.823692321777344}

Let's introduce some chaos and terminate our only Gunicorn worker:

$ kill 24186
[2023-02-01 18:02:09 +0800] [24186] [INFO] Worker exiting (pid: 24186)
[2023-02-01 18:02:09 +0800] [4196] [INFO] Booting worker with pid: 4196
INFO:root:Connecting
INFO:root:Sending request
INFO:root:Waiting for a response
INFO:root:Got response from object server

It's restarting properly and ready to answer our requests.

Benefit

Initially, the amount of required VRAM for our service was (SizeOf(Model) + SizeOf(CUDA context)) * Num(Workers). By sharing the weights of the model, we can reduce this by SizeOf(Model) * (Num(Workers) - 1) to SizeOf(Model) + SizeOf(CUDA context) * Num(Workers).

Caveats

The reliability of this approach relies on the single model server process. If that process terminates, not only will newly started workers get stuck, but the models in the existing workers will become unavailable and all workers crash at once. The shared tensors/models are only available as long as the server process is running. Even if the model server and Gunicorn workers are restarted, a short outage is certainly unavoidable. In a production environment, you thus should make sure this server process is kept alive.

Additionally, sharing data among different processes can have side effects. When sharing changeable data, proper locks must be used to avoid race conditions.

回复收藏 0 原文

假扮的天使 2025-02-18 05:26:13

我在推理时遇到了这个错误，我试图使用kserve == 0.10.0来使用模型。
花了一周的时间尝试以其他方式修复此操作后，只是将其升级到0.11.0为我解决了。

回复收藏 0 原文

相对绾红妆 2025-02-18 05:26:13

当我添加导入火炬时。多处理为MP MP.SET_START_METHOD（'SPAWN'）并删除precision = 16此代码trainer = pl.Trainer（Accelerator =“ GPU”，DESKITICS = 2，MAX_EPOCHS = 2，MAX_EPOCHS = 2，MAX_EPOCHS = 2 2）对我有用。

回复收藏 0 原文