如何使用多个GPU在tensorflow中训练模型

发布于 2025-01-09 13:20:18 字数 2120 浏览 4 评论 0原文

我读过keras官方文档，它说

要使用 Keras 模型进行单主机、多设备同步训练，您可以使用 tf.distribute.MirroredStrategy API。其工作原理如下：
<块引用>
实例化 MirroredStrategy，可选择配置您要使用的特定设备（默认情况下，该策略将使用所有可用的 GPU）。

<块引用>
使用策略对象打开一个作用域，并在该作用域内创建您需要的包含变量的所有 Keras 对象。通常，这意味着创建和创建在分发范围内编译模型。

<块引用>
照常通过 fit() 训练模型。

这就是我所做的。基本上，我有 8 个 GPU，但只有 3 个可用于该任务（5、6 和 7）。我使用这 3 个 GPU 创建一个策略，并在范围内编译模型。 但是，我的训练过程中的每个 epoch 所花费的时间与使用单个 GPU 一样多，并且当我在终端中执行 nvidia-smi 时，nvidia 还显示仅使用 GPU 7。 也许是警告消息显示问题？但我不是专家...如果是问题，有人可以将其翻译成简单的英语或提供解决方案吗？多谢！！

strategy = tf.distribute.MirroredStrategy(["GPU:5", "GPU:6", "GPU:7"])
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:6,/job:localhost/replica:0/task:0/device:GPU:5,/job:localhost/replica:0/task:0/device:GPU:7
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
Number of devices: 3

with strategy.scope():
    model_test = model_unet
    model_test.compile(loss=loss,
                       optimizer = adam_opt, 
                       metrics=['accuracy',segmentation_models.metrics.IOUScore()],
    )
model_test.fit(x_train,y_train,
               validation_data=(x_val,y_val),
               batch_size= 16, 
               epochs= 6, verbose=1, callbacks = callbacks
               )

# an example of the first epoch
Train on 14400 samples, validate on 3600 samples
Epoch 1/6
 6384/14400 [============>.................] - ETA: 5:35 - loss: 0.0045 - accuracy: 0.9833 - iou_score: 0.8918

仅使用 GPU 7

原文

I've read the keras official document and it says

To do single-host, multi-device synchronous training with a Keras model, you would use the tf.distribute.MirroredStrategy API. Here's how it works:
Instantiate a MirroredStrategy, optionally configuring which specific devices you want to use (by default the strategy will use all GPUs available).

Use the strategy object to open a scope, and within this scope, create all the Keras objects you need that contain variables. Typically, that means creating & compiling the model inside the distribution scope.

Train the model via fit() as usual.

Here is what I did. Basically, I have 8 GPUs but only 3 are available for the task (5, 6 & 7). I create a strategy using these 3 GPUs and compile the model inside the scope. However, each epoch in my training process takes as much time as using a single GPU, and nvidia also shows that only GPU 7 is in use when I do nvidia-smi in the terminal. Maybe the warning message shows the problem? But I am not an expert... If it is the issue, could someone translate it into plain English or provide a solution? Thanks a lot!!

strategy = tf.distribute.MirroredStrategy(["GPU:5", "GPU:6", "GPU:7"])
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:6,/job:localhost/replica:0/task:0/device:GPU:5,/job:localhost/replica:0/task:0/device:GPU:7
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
Number of devices: 3

with strategy.scope():
    model_test = model_unet
    model_test.compile(loss=loss,
                       optimizer = adam_opt, 
                       metrics=['accuracy',segmentation_models.metrics.IOUScore()],
    )
model_test.fit(x_train,y_train,
               validation_data=(x_val,y_val),
               batch_size= 16, 
               epochs= 6, verbose=1, callbacks = callbacks
               )

# an example of the first epoch
Train on 14400 samples, validate on 3600 samples
Epoch 1/6
 6384/14400 [============>.................] - ETA: 5:35 - loss: 0.0045 - accuracy: 0.9833 - iou_score: 0.8918

Only GPU 7 is in use

分享到QQ

分享到微博