用TensorFlow-Cloud在GCP上使用TPU训练模型时Machineconfig
我正在尝试在Google Cloud平台上训练一个相当大的模型(顶部有CNN分类头的Longformer-Large)。我正在使用TensorFlow-Cloud和Colab来运行我的模型。我尝试使用批处理4和4 P100-GPU运行此操作,但我仍然会遇到OOM错误,因此我想尝试使用TPU尝试。我现在将批次大小提高到8。
但是,我发现TPU配置不能是capent_worker_config的错误。
这是我的代码:
tfc.run(
distribution_strategy="auto",
requirements_txt="requirements.txt",
docker_config=tfc.DockerConfig(
image_build_bucket=GCS_BUCKET
),
worker_count=1,
worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
job_labels={"job": JOB_NAME})
这是错误:
Validating environment and input parameters.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-e1be60d71623> in <module>()
19 worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
20 chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
---> 21 job_labels={"job": JOB_NAME},
22 )
2 frames
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/run.py in run(entry_point, requirements_txt, docker_config, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, job_labels, service_account, **kwargs)
256 job_labels=job_labels or {},
257 service_account=service_account,
--> 258 docker_parent_image=docker_config.parent_image,
259 )
260 print("Validation was successful.")
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in validate(entry_point, requirements_txt, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, docker_image_build_bucket, called_from_notebook, job_labels, service_account, docker_parent_image)
78 _validate_distribution_strategy(distribution_strategy)
79 _validate_cluster_config(
---> 80 chief_config, worker_count, worker_config, docker_parent_image
81 )
82 _validate_job_labels(job_labels or {})
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in _validate_cluster_config(chief_config, worker_count, worker_config, docker_parent_image)
160 "Invalid `chief_config` input. "
161 "`chief_config` cannot be a TPU config. "
--> 162 "Received {}.".format(chief_config)
163 )
164
ValueError: Invalid `chief_config` input. `chief_config` cannot be a TPU config. Received <tensorflow_cloud.core.machine_config.MachineConfig object at 0x7f5860afe210>.
有人可以告诉我如何在GCP-TPU上运行代码吗?实际上,我不太在乎时间,我只想要一些在不遇到OOM问题的情况下运行的配置(因此,如果它对我也可以很好地工作)。
谢谢你!
I am trying to train a rather large model (Longformer-large with a CNN classification head on top) on Google Cloud Platform. I am using Tensorflow-cloud and Colab to run my model. I tried to run this with batchsize 4 and 4 P100-GPUs but I still get an OOM error, so I would like to try it with TPU. I have increased batch size to 8 now.
However, I get the error that TPU config cannot be the chief_worker_config.
This is my code:
tfc.run(
distribution_strategy="auto",
requirements_txt="requirements.txt",
docker_config=tfc.DockerConfig(
image_build_bucket=GCS_BUCKET
),
worker_count=1,
worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
job_labels={"job": JOB_NAME})
This is the error:
Validating environment and input parameters.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-e1be60d71623> in <module>()
19 worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
20 chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
---> 21 job_labels={"job": JOB_NAME},
22 )
2 frames
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/run.py in run(entry_point, requirements_txt, docker_config, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, job_labels, service_account, **kwargs)
256 job_labels=job_labels or {},
257 service_account=service_account,
--> 258 docker_parent_image=docker_config.parent_image,
259 )
260 print("Validation was successful.")
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in validate(entry_point, requirements_txt, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, docker_image_build_bucket, called_from_notebook, job_labels, service_account, docker_parent_image)
78 _validate_distribution_strategy(distribution_strategy)
79 _validate_cluster_config(
---> 80 chief_config, worker_count, worker_config, docker_parent_image
81 )
82 _validate_job_labels(job_labels or {})
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in _validate_cluster_config(chief_config, worker_count, worker_config, docker_parent_image)
160 "Invalid `chief_config` input. "
161 "`chief_config` cannot be a TPU config. "
--> 162 "Received {}.".format(chief_config)
163 )
164
ValueError: Invalid `chief_config` input. `chief_config` cannot be a TPU config. Received <tensorflow_cloud.core.machine_config.MachineConfig object at 0x7f5860afe210>.
Can someone tell me how I can run my code on GCP-TPUs? I actually don't care too much about time, I just want some configuration that runs without getting OOM issues (so GPU if it works totally fine with me as well).
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论