XGBoostror:rabit/internal/utils.h:90:alleduce失败 - 错误 - 在AWS中尝试XGBoost XGBoost时
概述:我正在尝试通过设置Fargate群集并将其连接到DASK群集,在使用DASK中坐在S3中的一堆Parquet文件上运行XGBoost模型。
总数据帧大小总计约为140 GB的数据。我扩大了一个具有属性的Fargate群集:
- 工人:39
- 总线程:156
- 总存储器:371.93 Gib,
因此应该有足够的数据来保存数据任务。每个工人都有9+ GB,带有4个线程。我做一些非常基本的预处理,然后创建一个daskdmatrix,它确实会导致每个工人的任务字节变得有点高,但是从未超过失败的阈值。
接下来,我运行xgb.dask.train
,它使用XGBoost软件包而不是dask_ml.xgboost软件包。很快,工人死了,我得到了错误xgboostror:rabit/internal/utils.h:90:allreduce失败
。当我使用只有17MB数据的一个文件尝试此文件时,我仍然会遇到此错误,但只有几个工人死亡。有谁知道为什么我将数据框的内存加倍,为什么会发生这种情况?
X_train = X_train.to_dask_array()
X_test = X_test.to_dask_array()
y_train = y_train
y_test = y_test
dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train)
output = xgb.dask.train(
client,
{"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"},
dtrain,
num_boost_round=100,
evals=[(dtrain, "train")])
Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster.
Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties:
- Workers: 39
- Total threads: 156
- Total memory: 371.93 GiB
So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail.
Next I run xgb.dask.train
which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error XGBoostError: rabit/internal/utils.h:90: Allreduce failed
. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe?
X_train = X_train.to_dask_array()
X_test = X_test.to_dask_array()
y_train = y_train
y_test = y_test
dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train)
output = xgb.dask.train(
client,
{"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"},
dtrain,
num_boost_round=100,
evals=[(dtrain, "train")])
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论