XGBoostror:rabit/internal/utils.h:90:alleduce失败 - 错误 - 在AWS中尝试XGBoost XGBoost时

发布于 2025-01-26 19:25:47 字数 898 浏览 5 评论 0原文

概述:我正在尝试通过设置Fargate群集并将其连接到DASK群集,在使用DASK中坐在S3中的一堆Parquet文件上运行XGBoost模型。

总数据帧大小总计约为140 GB的数据。我扩大了一个具有属性的Fargate群集:

  • 工人:39
  • 总线程:156
  • 总存储器:371.93 Gib,

因此应该有足够的数据来保存数据任务。每个工人都有9+ GB,带有4个线程。我做一些非常基本的预处理,然后创建一个daskdmatrix,它确实会导致每个工人的任务字节变得有点高,但是从未超过失败的阈值。

接下来,我运行xgb.dask.train,它使用XGBoost软件包而不是dask_ml.xgboost软件包。很快,工人死了,我得到了错误xgboostror:rabit/internal/utils.h:90:allreduce失败。当我使用只有17MB数据的一个文件尝试此文件时,我仍然会遇到此错误,但只有几个工人死亡。有谁知道为什么我将数据框的内存加倍,为什么会发生这种情况?

X_train = X_train.to_dask_array()
X_test = X_test.to_dask_array()
y_train = y_train
y_test = y_test

dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train)

output = xgb.dask.train(
client,
{"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"},
dtrain,
num_boost_round=100,
evals=[(dtrain, "train")])

Overview: I'm trying to run an XGboost model on a bunch of parquet files sitting in S3 using dask by setting up a fargate cluster and connecting it to a Dask cluster.

Total dataframe size totals to about 140 GB of data. I scaled up a fargate cluster with properties:

  • Workers: 39
  • Total threads: 156
  • Total memory: 371.93 GiB

So there should be enough data to hold the data tasks. Each worker has 9+ GB with 4 Threads. I do some very basic preprocessing and then I create a DaskDMatrix which does cause the task bytes per worker to get a little high, but never above the threshold where it would fail.

Next I run xgb.dask.train which utilizes the xgboost package not the dask_ml.xgboost package. Very quickly, the workers die and I get the error XGBoostError: rabit/internal/utils.h:90: Allreduce failed. When I attempted this with a single file with only 17MB of data, I would still get this error but only a couple workers die. Does anyone know why this happens since I have double the memory of the dataframe?

X_train = X_train.to_dask_array()
X_test = X_test.to_dask_array()
y_train = y_train
y_test = y_test

dtrain = xgb.dask.DaskDMatrix(client,X_train, y_train)

output = xgb.dask.train(
client,
{"verbosity": 1, "tree_method": "hist", "objective": "reg:squarederror"},
dtrain,
num_boost_round=100,
evals=[(dtrain, "train")])

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文