XGBoostror：rabit/internal/utils.h：90：alleduce失败 - 错误 - 在AWS中尝试XGBoost XGBoost时

发布于 2025-01-26 19:25:47 字数 898 浏览 5 评论 0原文

概述：我正在尝试通过设置Fargate群集并将其连接到DASK群集，在使用DASK中坐在S3中的一堆Parquet文件上运行XGBoost模型。

总数据帧大小总计约为140 GB的数据。我扩大了一个具有属性的Fargate群集：

工人：39
总线程：156
总存储器：371.93 Gib，

因此应该有足够的数据来保存数据任务。每个工人都有9+ GB，带有4个线程。我做一些非常基本的预处理，然后创建一个daskdmatrix，它确实会导致每个工人的任务字节变得有点高，但是从未超过失败的阈值。

接下来，我运行xgb.dask.train，它使用XGBoost软件包而不是dask_ml.xgboost软件包。很快，工人死了，我得到了错误xgboostror：rabit/internal/utils.h：90：allreduce失败。当我使用只有17MB数据的一个文件尝试此文件时，我仍然会遇到此错误，但只有几个工人死亡。有谁知道为什么我将数据框的内存加倍，为什么会发生这种情况？

X_train = X_train.to_dask_array()
X_test = X_test.to_dask_array()
y_train = y_train
y_test = y_test

dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train)

output = xgb.dask.train(
client,
{"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"},
dtrain,
num_boost_round=100,
evals=[(dtrain, "train")])

原文

Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster.

Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties:

Workers: 39
Total threads: 156
Total memory: 371.93 GiB

So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail.

Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error XGBoostError: rabit/internal/utils.h:90: Allreduce failed. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe?

X_train = X_train.to_dask_array()
X_test = X_test.to_dask_array()
y_train = y_train
y_test = y_test

dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train)

output = xgb.dask.train(
client,
{"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"},
dtrain,
num_boost_round=100,
evals=[(dtrain, "train")])

分享到QQ

分享到微博