训练ML模型时,GPU的内存用完了

发布于 2025-01-21 23:15:23 字数 1968 浏览 0 评论 0原文

我正在尝试使用DASK训练ML模型。我正在用1 GPU在本地机器上进行培训。我的GPU有24个gibs的记忆。

from dask_cuda import LocalCUDACluster
from dask.distributed import Client, LocalCluster

import dask.dataframe as dd
import pandas as pd
import numpy as np
import os
import xgboost as xgb

np.random.seed(42)


def get_columns(filename):
    return pd.read_csv(filename, nrows=10).iloc[:, :NUM_FEATURES].columns


def get_data(filename, target):
    import dask_cudf
    X = dask_cudf.read_csv(filename)
    # X = dd.read_csv(filename, assume_missing=True)
    y = X[[target]]
    X = X.iloc[:, :NUM_FEATURES]
    return X, y


def main(client: Client) -> None:
    X, y = get_data(FILENAME, TARGET)
    model = xgb.dask.DaskXGBRegressor(
        tree_method="gpu_hist",
        objective="reg:squarederror",
        seed=42,
        max_depth=5,
        eta=0.01,
        n_estimators=10)

    model.client = client
    model.fit(X, y, eval_set=[(X, y)])
    print("Saving the model..")
    model.get_booster().save_model("xgboost.model")

    print("Doing model importance..")
    columns = get_columns(FILENAME)
    pd.Series(model.feature_importances_, index=columns).sort_values(ascending=False).to_pickle("~/yolo.pkl")


if __name__ == "__main__":
    os.environ["MALLOC_TRIM_THRESHOLD_"]="65536"
    with LocalCUDACluster(device_memory_limit="15 GiB", rmm_pool_size="20 GiB") as cluster:
    # with LocalCluster() as cluster:
        with Client(cluster) as client:
            print(client)
            main(client)

错误如下。

MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/workspace/.conda-bld/work/include/rmm/mr/device/pool_memory_resource.hpp:192: Maximum pool size exceeded

基本上,当我调用型号。当我使用具有64100行的CSV时,当我使用带有128198行(2x行)的CSV时,它会工作。这些不是大文件,所以我认为我做错了什么。

我尝试使用

  • LocalCudAcluster来解决:device_memory_limit和rmm_pool_size
  • dask_cudf.read_csv:块size

毫无用处。

我整天都陷入困境,因此任何帮助都将不胜感激。

I am trying to train a ml model using dask. I am training on my local machine with 1 GPU. My GPU has 24 GiBs of memory.

from dask_cuda import LocalCUDACluster
from dask.distributed import Client, LocalCluster

import dask.dataframe as dd
import pandas as pd
import numpy as np
import os
import xgboost as xgb

np.random.seed(42)


def get_columns(filename):
    return pd.read_csv(filename, nrows=10).iloc[:, :NUM_FEATURES].columns


def get_data(filename, target):
    import dask_cudf
    X = dask_cudf.read_csv(filename)
    # X = dd.read_csv(filename, assume_missing=True)
    y = X[[target]]
    X = X.iloc[:, :NUM_FEATURES]
    return X, y


def main(client: Client) -> None:
    X, y = get_data(FILENAME, TARGET)
    model = xgb.dask.DaskXGBRegressor(
        tree_method="gpu_hist",
        objective="reg:squarederror",
        seed=42,
        max_depth=5,
        eta=0.01,
        n_estimators=10)

    model.client = client
    model.fit(X, y, eval_set=[(X, y)])
    print("Saving the model..")
    model.get_booster().save_model("xgboost.model")

    print("Doing model importance..")
    columns = get_columns(FILENAME)
    pd.Series(model.feature_importances_, index=columns).sort_values(ascending=False).to_pickle("~/yolo.pkl")


if __name__ == "__main__":
    os.environ["MALLOC_TRIM_THRESHOLD_"]="65536"
    with LocalCUDACluster(device_memory_limit="15 GiB", rmm_pool_size="20 GiB") as cluster:
    # with LocalCluster() as cluster:
        with Client(cluster) as client:
            print(client)
            main(client)

Error as follows.

MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/workspace/.conda-bld/work/include/rmm/mr/device/pool_memory_resource.hpp:192: Maximum pool size exceeded

Basically my GPU runs out of memory when I call model.fit. It works when I use a csv with 64100 rows and fails when I use a csv with 128198 rows (2x rows). These aren't large files so I assume I am doing something wrong.

I have tried fiddling around with

  • LocalCUDACluster: device_memory_limit and rmm_pool_size
  • dask_cudf.read_csv: chunksize

Nothing has worked.

I have been stuck on this all day so any help would be much appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

软甜啾 2025-01-28 23:15:23

您不能在模型生长大于其余的GPU内存大小的情况下训练XGBoost模型。您可以使用dask_xgboost扩展,但是您需要确保总GPU存储器足够。

这是盘绕的一个很棒的博客: https://coiled.io/blog /dask-xgboost-python-example/

You cannot train an xgboost model where the model grows larger than the remaining GPU memory size. You can scale out with dask_xgboost, but you need to ensure that the total GPU memory is sufficient.

Here is a great blog on this by Coiled: https://coiled.io/blog/dask-xgboost-python-example/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文