训练ML模型时,GPU的内存用完了
我正在尝试使用DASK训练ML模型。我正在用1 GPU在本地机器上进行培训。我的GPU有24个gibs的记忆。
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import pandas as pd
import numpy as np
import os
import xgboost as xgb
np.random.seed(42)
def get_columns(filename):
return pd.read_csv(filename, nrows=10).iloc[:, :NUM_FEATURES].columns
def get_data(filename, target):
import dask_cudf
X = dask_cudf.read_csv(filename)
# X = dd.read_csv(filename, assume_missing=True)
y = X[[target]]
X = X.iloc[:, :NUM_FEATURES]
return X, y
def main(client: Client) -> None:
X, y = get_data(FILENAME, TARGET)
model = xgb.dask.DaskXGBRegressor(
tree_method="gpu_hist",
objective="reg:squarederror",
seed=42,
max_depth=5,
eta=0.01,
n_estimators=10)
model.client = client
model.fit(X, y, eval_set=[(X, y)])
print("Saving the model..")
model.get_booster().save_model("xgboost.model")
print("Doing model importance..")
columns = get_columns(FILENAME)
pd.Series(model.feature_importances_, index=columns).sort_values(ascending=False).to_pickle("~/yolo.pkl")
if __name__ == "__main__":
os.environ["MALLOC_TRIM_THRESHOLD_"]="65536"
with LocalCUDACluster(device_memory_limit="15 GiB", rmm_pool_size="20 GiB") as cluster:
# with LocalCluster() as cluster:
with Client(cluster) as client:
print(client)
main(client)
错误如下。
MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/workspace/.conda-bld/work/include/rmm/mr/device/pool_memory_resource.hpp:192: Maximum pool size exceeded
基本上,当我调用型号。当我使用具有64100行的CSV时,当我使用带有128198行(2x行)的CSV时,它会工作。这些不是大文件,所以我认为我做错了什么。
我尝试使用
- LocalCudAcluster来解决:device_memory_limit和rmm_pool_size
- dask_cudf.read_csv:块size
毫无用处。
我整天都陷入困境,因此任何帮助都将不胜感激。
I am trying to train a ml model using dask. I am training on my local machine with 1 GPU. My GPU has 24 GiBs of memory.
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import pandas as pd
import numpy as np
import os
import xgboost as xgb
np.random.seed(42)
def get_columns(filename):
return pd.read_csv(filename, nrows=10).iloc[:, :NUM_FEATURES].columns
def get_data(filename, target):
import dask_cudf
X = dask_cudf.read_csv(filename)
# X = dd.read_csv(filename, assume_missing=True)
y = X[[target]]
X = X.iloc[:, :NUM_FEATURES]
return X, y
def main(client: Client) -> None:
X, y = get_data(FILENAME, TARGET)
model = xgb.dask.DaskXGBRegressor(
tree_method="gpu_hist",
objective="reg:squarederror",
seed=42,
max_depth=5,
eta=0.01,
n_estimators=10)
model.client = client
model.fit(X, y, eval_set=[(X, y)])
print("Saving the model..")
model.get_booster().save_model("xgboost.model")
print("Doing model importance..")
columns = get_columns(FILENAME)
pd.Series(model.feature_importances_, index=columns).sort_values(ascending=False).to_pickle("~/yolo.pkl")
if __name__ == "__main__":
os.environ["MALLOC_TRIM_THRESHOLD_"]="65536"
with LocalCUDACluster(device_memory_limit="15 GiB", rmm_pool_size="20 GiB") as cluster:
# with LocalCluster() as cluster:
with Client(cluster) as client:
print(client)
main(client)
Error as follows.
MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/workspace/.conda-bld/work/include/rmm/mr/device/pool_memory_resource.hpp:192: Maximum pool size exceeded
Basically my GPU runs out of memory when I call model.fit. It works when I use a csv with 64100 rows and fails when I use a csv with 128198 rows (2x rows). These aren't large files so I assume I am doing something wrong.
I have tried fiddling around with
- LocalCUDACluster: device_memory_limit and rmm_pool_size
- dask_cudf.read_csv: chunksize
Nothing has worked.
I have been stuck on this all day so any help would be much appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您不能在模型生长大于其余的GPU内存大小的情况下训练XGBoost模型。您可以使用dask_xgboost扩展,但是您需要确保总GPU存储器足够。
这是盘绕的一个很棒的博客: https://coiled.io/blog /dask-xgboost-python-example/
You cannot train an xgboost model where the model grows larger than the remaining GPU memory size. You can scale out with dask_xgboost, but you need to ensure that the total GPU memory is sufficient.
Here is a great blog on this by Coiled: https://coiled.io/blog/dask-xgboost-python-example/