GCP Vertex AI 训练:自动打包的自定义训练作业产生巨大的 Docker 镜像
我正在尝试在 Google Cloud Platform 的 Vertex AI Training 服务中运行自定义训练作业。
这项工作基于 来自 Google 的教程,用于微调预训练的 BERT 模型(来自 HuggingFace)。
当我使用 gcloud CLI 工具将我的训练代码自动打包到 Docker 映像中并将其部署到 Vertex AI Training 服务时,如下所示:
$BASE_GPU_IMAGE="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-7:latest"
$BUCKET_NAME = "my-bucket"
gcloud ai custom-jobs create `
--region=us-central1 `
--display-name=fine_tune_bert `
--args="--job_dir=$BUCKET_NAME,--num-epochs=2,--model-name=finetuned-bert-classifier" `
--worker-pool-spec="machine-type=n1-standard-4,replica-count=1,accelerator-type=NVIDIA_TESLA_V100,executor-image-uri=$BASE_GPU_IMAGE,local-package-path=.,python-module=trainer.task"
...我最终得到的 Docker 映像大致如下18GB(!)并且需要很长时间才能上传到 GCP 注册表。
授予 基本映像约为 6.5GB,但是额外的 >10GB 来自哪里,有没有办法让我避免这个“映像” bloat”?
请注意,我的工作在运行时使用 datasets
Python 包加载训练数据,据我所知,它不包含在自动打包的 docker 镜像中。
I am trying to run a Custom Training Job in Google Cloud Platform's Vertex AI Training service.
The job is based on a tutorial from Google that fine-tunes a pre-trained BERT model (from HuggingFace).
When I use the gcloud
CLI tool to auto-package my training code into a Docker image and deploy it to the Vertex AI Training service like so:
$BASE_GPU_IMAGE="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-7:latest"
$BUCKET_NAME = "my-bucket"
gcloud ai custom-jobs create `
--region=us-central1 `
--display-name=fine_tune_bert `
--args="--job_dir=$BUCKET_NAME,--num-epochs=2,--model-name=finetuned-bert-classifier" `
--worker-pool-spec="machine-type=n1-standard-4,replica-count=1,accelerator-type=NVIDIA_TESLA_V100,executor-image-uri=$BASE_GPU_IMAGE,local-package-path=.,python-module=trainer.task"
... I end up with a Docker image that is roughly 18GB (!) and takes a very long time to upload to the GCP registry.
Granted the base image is around 6.5GB but where do the additional >10GB come from and is there a way for me to avoid this "image bloat"?
Please note that my job loads the training data using the datasets
Python package at run time and AFAIK does not include it in the auto-packaged docker image.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
UI 中显示的图像尺寸是图像的虚拟尺寸。它是将通过网络下载的压缩后的总图像大小。一旦图像被拉取,它就会被提取出来,并且得到的尺寸会更大。在这种情况下,PyTorch 映像的虚拟大小 为 6.8 GB,而实际大小为 17.9 GB。
此外,当执行 docker push 命令时,进度条会显示未压缩的大小。推送的实际数据量将在发送前进行压缩,所以上传的大小不会通过进度条反映出来。
为了减少 docker 镜像的大小,可以使用自定义容器。在这里,只能配置必要的组件,这将导致更小的 docker 镜像。有关自定义容器的更多信息,请点击此处。
The image size shown in the UI is the virtual size of the image. It is the compressed total image size that will be downloaded over the network. Once the image is pulled, it will be extracted and the resulting size will be bigger. In this case, the PyTorch image's virtual size is 6.8 GB while the actual size is 17.9 GB.
Also, when a
docker push
command is executed, the progress bars show the uncompressed size. The actual amount of data that’s pushed will be compressed before sending, so the uploaded size will not be reflected by the progress bar.To cut down the size of the docker image, custom containers can be used. Here, only the necessary components can be configured which would result in a smaller docker image. More information on custom containers here.