如何使用多个GPU在tensorflow中训练模型

发布于 2025-01-09 13:20:18 字数 2120 浏览 4 评论 0原文

我读过keras官方文档,它说

要使用 Keras 模型进行单主机、多设备同步训练,您可以使用 tf.distribute.MirroredStrategy API。其工作原理如下:

<块引用>

实例化 MirroredStrategy,可选择配置您要使用的特定设备(默认情况下,该策略将使用所有可用的 GPU)。

<块引用>

使用策略对象打开一个作用域,并在该作用域内创建您需要的包含变量的所有 Keras 对象。通常,这意味着创建和创建在分发范围内编译模型。

<块引用>

照常通过 fit() 训练模型。

这就是我所做的。基本上,我有 8 个 GPU,但只有 3 个可用于该任务(5、6 和 7)。我使用这 3 个 GPU 创建一个策略,并在范围内编译模型。 但是,我的训练过程中的每个 epoch 所花费的时间与使用单个 GPU 一样多,并且当我在终端中执行 nvidia-smi 时,nvidia 还显示仅使用 GPU 7。 也许是警告消息显示问题?但我不是专家...如果是问题,有人可以将其翻译成简单的英语或提供解决方案吗?多谢!!

strategy = tf.distribute.MirroredStrategy(["GPU:5", "GPU:6", "GPU:7"])
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:6,/job:localhost/replica:0/task:0/device:GPU:5,/job:localhost/replica:0/task:0/device:GPU:7
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
Number of devices: 3
with strategy.scope():
    model_test = model_unet
    model_test.compile(loss=loss,
                       optimizer = adam_opt, 
                       metrics=['accuracy',segmentation_models.metrics.IOUScore()],
    )
model_test.fit(x_train,y_train,
               validation_data=(x_val,y_val),
               batch_size= 16, 
               epochs= 6, verbose=1, callbacks = callbacks
               )
# an example of the first epoch
Train on 14400 samples, validate on 3600 samples
Epoch 1/6
 6384/14400 [============>.................] - ETA: 5:35 - loss: 0.0045 - accuracy: 0.9833 - iou_score: 0.8918

仅使用 GPU 7 仅 GPU 7 正在使用

I've read the keras official document and it says

To do single-host, multi-device synchronous training with a Keras model, you would use the tf.distribute.MirroredStrategy API. Here's how it works:

Instantiate a MirroredStrategy, optionally configuring which specific devices you want to use (by default the strategy will use all GPUs available).

Use the strategy object to open a scope, and within this scope, create all the Keras objects you need that contain variables. Typically, that means creating & compiling the model inside the distribution scope.

Train the model via fit() as usual.

Here is what I did. Basically, I have 8 GPUs but only 3 are available for the task (5, 6 & 7). I create a strategy using these 3 GPUs and compile the model inside the scope. However, each epoch in my training process takes as much time as using a single GPU, and nvidia also shows that only GPU 7 is in use when I do nvidia-smi in the terminal. Maybe the warning message shows the problem? But I am not an expert... If it is the issue, could someone translate it into plain English or provide a solution? Thanks a lot!!

strategy = tf.distribute.MirroredStrategy(["GPU:5", "GPU:6", "GPU:7"])
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:6,/job:localhost/replica:0/task:0/device:GPU:5,/job:localhost/replica:0/task:0/device:GPU:7
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
Number of devices: 3
with strategy.scope():
    model_test = model_unet
    model_test.compile(loss=loss,
                       optimizer = adam_opt, 
                       metrics=['accuracy',segmentation_models.metrics.IOUScore()],
    )
model_test.fit(x_train,y_train,
               validation_data=(x_val,y_val),
               batch_size= 16, 
               epochs= 6, verbose=1, callbacks = callbacks
               )
# an example of the first epoch
Train on 14400 samples, validate on 3600 samples
Epoch 1/6
 6384/14400 [============>.................] - ETA: 5:35 - loss: 0.0045 - accuracy: 0.9833 - iou_score: 0.8918

Only GPU 7 is in use
only GPU 7 is in use

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文