GCP Vertex AI 训练：自动打包的自定义训练作业产生巨大的 Docker 镜像

发布于 2025-01-10 05:30:11 字数 1267 浏览 4 评论 0原文

我正在尝试在 Google Cloud Platform 的 Vertex AI Training 服务中运行自定义训练作业。

这项工作基于来自 Google 的教程，用于微调预训练的 BERT 模型（来自 HuggingFace）。

当我使用 gcloud CLI 工具将我的训练代码自动打包到 Docker 映像中并将其部署到 Vertex AI Training 服务时，如下所示：

$BASE_GPU_IMAGE="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-7:latest"
$BUCKET_NAME = "my-bucket"

gcloud ai custom-jobs create `
--region=us-central1 `
--display-name=fine_tune_bert `
--args="--job_dir=$BUCKET_NAME,--num-epochs=2,--model-name=finetuned-bert-classifier" `
--worker-pool-spec="machine-type=n1-standard-4,replica-count=1,accelerator-type=NVIDIA_TESLA_V100,executor-image-uri=$BASE_GPU_IMAGE,local-package-path=.,python-module=trainer.task"

...我最终得到的 Docker 映像大致如下18GB（！）并且需要很长时间才能上传到 GCP 注册表。

授予基本映像约为 6.5GB，但是额外的 >10GB 来自哪里，有没有办法让我避免这个“映像” bloat”？

请注意，我的工作在运行时使用 datasets Python 包加载训练数据，据我所知，它不包含在自动打包的 docker 镜像中。

原文

I am trying to run a Custom Training Job in Google Cloud Platform's Vertex AI Training service.

The job is based on a tutorial from Google that fine-tunes a pre-trained BERT model (from HuggingFace).

When I use the gcloud CLI tool to auto-package my training code into a Docker image and deploy it to the Vertex AI Training service like so:

$BASE_GPU_IMAGE="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-7:latest"
$BUCKET_NAME = "my-bucket"

gcloud ai custom-jobs create `
--region=us-central1 `
--display-name=fine_tune_bert `
--args="--job_dir=$BUCKET_NAME,--num-epochs=2,--model-name=finetuned-bert-classifier" `
--worker-pool-spec="machine-type=n1-standard-4,replica-count=1,accelerator-type=NVIDIA_TESLA_V100,executor-image-uri=$BASE_GPU_IMAGE,local-package-path=.,python-module=trainer.task"

... I end up with a Docker image that is roughly 18GB (!) and takes a very long time to upload to the GCP registry.

Granted the base image is around 6.5GB but where do the additional >10GB come from and is there a way for me to avoid this "image bloat"?

Please note that my job loads the training data using the datasets Python package at run time and AFAIK does not include it in the auto-packaged docker image.

分享到QQ

分享到微博