用TensorFlow-Cloud在GCP上使用TPU训练模型时Machineconfig

发布于 2025-02-11 22:06:01 字数 2709 浏览 0 评论 0原文

我正在尝试在Google Cloud平台上训练一个相当大的模型(顶部有CNN分类头的Longformer-Large)。我正在使用TensorFlow-Cloud和Colab来运行我的模型。我尝试使用批处理4和4 P100-GPU运行此操作,但我仍然会遇到OOM错误,因此我想尝试使用TPU尝试。我现在将批次大小提高到8。

但是,我发现TPU配置不能是capent_worker_config的错误。

这是我的代码:

tfc.run(
distribution_strategy="auto",
requirements_txt="requirements.txt",
docker_config=tfc.DockerConfig(
   image_build_bucket=GCS_BUCKET
),
worker_count=1,
worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
job_labels={"job": JOB_NAME})

这是错误:

Validating environment and input parameters.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-26-e1be60d71623> in <module>()
     19     worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
     20     chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
---> 21     job_labels={"job": JOB_NAME},
     22 )

2 frames
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/run.py in run(entry_point, requirements_txt, docker_config, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, job_labels, service_account, **kwargs)
    256         job_labels=job_labels or {},
    257         service_account=service_account,
--> 258         docker_parent_image=docker_config.parent_image,
    259     )
    260     print("Validation was successful.")

/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in validate(entry_point, requirements_txt, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, docker_image_build_bucket, called_from_notebook, job_labels, service_account, docker_parent_image)
     78     _validate_distribution_strategy(distribution_strategy)
     79     _validate_cluster_config(
---> 80         chief_config, worker_count, worker_config, docker_parent_image
     81     )
     82     _validate_job_labels(job_labels or {})

/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in _validate_cluster_config(chief_config, worker_count, worker_config, docker_parent_image)
    160             "Invalid `chief_config` input. "
    161             "`chief_config` cannot be a TPU config. "
--> 162             "Received {}.".format(chief_config)
    163         )
    164 

ValueError: Invalid `chief_config` input. `chief_config` cannot be a TPU config. Received <tensorflow_cloud.core.machine_config.MachineConfig object at 0x7f5860afe210>.

有人可以告诉我如何在GCP-TPU上运行代码吗?实际上,我不太在乎时间,我只想要一些在不遇到OOM问题的情况下运行的配置(因此,如果它对我也可以很好地工作)。

谢谢你!

I am trying to train a rather large model (Longformer-large with a CNN classification head on top) on Google Cloud Platform. I am using Tensorflow-cloud and Colab to run my model. I tried to run this with batchsize 4 and 4 P100-GPUs but I still get an OOM error, so I would like to try it with TPU. I have increased batch size to 8 now.

However, I get the error that TPU config cannot be the chief_worker_config.

This is my code:

tfc.run(
distribution_strategy="auto",
requirements_txt="requirements.txt",
docker_config=tfc.DockerConfig(
   image_build_bucket=GCS_BUCKET
),
worker_count=1,
worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
job_labels={"job": JOB_NAME})

This is the error:

Validating environment and input parameters.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-26-e1be60d71623> in <module>()
     19     worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
     20     chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
---> 21     job_labels={"job": JOB_NAME},
     22 )

2 frames
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/run.py in run(entry_point, requirements_txt, docker_config, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, job_labels, service_account, **kwargs)
    256         job_labels=job_labels or {},
    257         service_account=service_account,
--> 258         docker_parent_image=docker_config.parent_image,
    259     )
    260     print("Validation was successful.")

/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in validate(entry_point, requirements_txt, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, docker_image_build_bucket, called_from_notebook, job_labels, service_account, docker_parent_image)
     78     _validate_distribution_strategy(distribution_strategy)
     79     _validate_cluster_config(
---> 80         chief_config, worker_count, worker_config, docker_parent_image
     81     )
     82     _validate_job_labels(job_labels or {})

/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in _validate_cluster_config(chief_config, worker_count, worker_config, docker_parent_image)
    160             "Invalid `chief_config` input. "
    161             "`chief_config` cannot be a TPU config. "
--> 162             "Received {}.".format(chief_config)
    163         )
    164 

ValueError: Invalid `chief_config` input. `chief_config` cannot be a TPU config. Received <tensorflow_cloud.core.machine_config.MachineConfig object at 0x7f5860afe210>.

Can someone tell me how I can run my code on GCP-TPUs? I actually don't care too much about time, I just want some configuration that runs without getting OOM issues (so GPU if it works totally fine with me as well).

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文