编写一个自定义的张量训练环，以进行记忆密集型培训

发布于 2025-02-09 04:04:00 字数 3310 浏览 1 评论 0原文

我想为我的模型使用TensorFlow的自定义训练环，但是，在内存约束下，我只能一次通过少量的样本（迷你批次）。我如何使用方法对这些微型批次进行训练，并明智地汇总一台机器上整个批次（GPU/CPU）的梯度？请参阅下面的代码示例，来自在这里在批处理大小上，但确实给出了我要做的事情的想法：

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

#simple MNIST model
inputs = keras.Input(shape=(784,), name="digits")
x1 = layers.Dense(64, activation="relu")(inputs)
x2 = layers.Dense(64, activation="relu")(x1)
outputs = layers.Dense(10, name="predictions")(x2)
model = keras.Model(inputs=inputs, outputs=outputs)

# Instantiate an optimizer.
optimizer = keras.optimizers.SGD(learning_rate=1e-3)
# Instantiate a loss function.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Prepare the training dataset.
batch_size = 64
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = np.reshape(x_train, (-1, 784))
x_test = np.reshape(x_test, (-1, 784))

# Reserve 10,000 samples for validation.
x_val = x_train[-10000:]
y_val = y_train[-10000:]
x_train = x_train[:-10000]
y_train = y_train[:-10000]

# Prepare the training dataset.
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)

# Prepare the validation dataset.
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val))
val_dataset = val_dataset.batch(batch_size)

如果在整个64个样本批处理大小上训练可以适合内存，我们可以简单地使用：

@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        logits = model(x, training=True)
        loss_value = loss_fn(y, logits)
    grads = tape.gradient(loss_value, model.trainable_weights)
    optimizer.apply_gradients(zip(grads, model.trainable_weights))
    train_acc_metric.update_state(y, logits)
    return loss_value

import time

epochs = 10
for epoch in range(epochs):
    print("\nStart of epoch %d" % (epoch,))
    start_time = time.time()

    # Iterate over the batches of the dataset.
    for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
        loss_value = train_step(x_batch_train, y_batch_train)

        # Log every 200 batches.
        if step % 200 == 0:
            print(
                "Training loss (for one batch) at step %d: %.4f"
                % (step, float(loss_value))
            )
            print("Seen so far: %d samples" % ((step + 1) * batch_size))

但是，我如何更新train_step使其启用它以16号尺寸的四个小批量运行（例如）构成64个批次大小，以处理我更多的内存密集型数据，然后汇总梯度以更新模型？

我尝试使用tf.gradienttape（）作为磁带：调用在中编写一个循环，而只是堆叠损失结果，但我不认为这是正确的方法。

我还考虑过使用tf.distribute.strategy，但我的理解是仅在跨机器或GPU训练时使用时，我看不到如何在这里使用它？

总而言之，我想做的是对数据集和模型体系结构的不可知论。我想我正在寻找一个渐变alleduce

使用MiniBatch计算梯度。
使用Allreduce集体式方法计算所有小批次的梯度平均值。
使用平均梯度更新模型。

我认为，这种应用梯度平均值的方法要比应用所有梯度的记忆力要少得多，如在这里

原文

I want to use tensorflow's custom training loop for my model but, down to memory constraints, I can only pass a small number of samples (mini-batches) through in one go. How do I use an approach to train on these mini-batches and sensibly aggregate the gradients for the whole batch on one machine (GPU/CPU)? See below example with code from here - note this example doesn't hit memory issues based on the batch size but does give the idea of what I'm trying to do:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

#simple MNIST model
inputs = keras.Input(shape=(784,), name="digits")
x1 = layers.Dense(64, activation="relu")(inputs)
x2 = layers.Dense(64, activation="relu")(x1)
outputs = layers.Dense(10, name="predictions")(x2)
model = keras.Model(inputs=inputs, outputs=outputs)

# Instantiate an optimizer.
optimizer = keras.optimizers.SGD(learning_rate=1e-3)
# Instantiate a loss function.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Prepare the training dataset.
batch_size = 64
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = np.reshape(x_train, (-1, 784))
x_test = np.reshape(x_test, (-1, 784))

# Reserve 10,000 samples for validation.
x_val = x_train[-10000:]
y_val = y_train[-10000:]
x_train = x_train[:-10000]
y_train = y_train[:-10000]

# Prepare the training dataset.
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)

# Prepare the validation dataset.
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val))
val_dataset = val_dataset.batch(batch_size)

If training on the full 64 sample batch size in one go could fit in memory we could simply use:

@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        logits = model(x, training=True)
        loss_value = loss_fn(y, logits)
    grads = tape.gradient(loss_value, model.trainable_weights)
    optimizer.apply_gradients(zip(grads, model.trainable_weights))
    train_acc_metric.update_state(y, logits)
    return loss_value

import time

epochs = 10
for epoch in range(epochs):
    print("\nStart of epoch %d" % (epoch,))
    start_time = time.time()

    # Iterate over the batches of the dataset.
    for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
        loss_value = train_step(x_batch_train, y_batch_train)

        # Log every 200 batches.
        if step % 200 == 0:
            print(
                "Training loss (for one batch) at step %d: %.4f"
                % (step, float(loss_value))
            )
            print("Seen so far: %d samples" % ((step + 1) * batch_size))

However, how do I update train_step to enable it to take four mini-batch runs of size 16 (for example) to make up the full batch size of 64 to deal with my more memory intensive data and then aggregate the gradients to update the model?

I tried just writing a loop within the with tf.GradientTape() as tape: call and just stacking the loss results but I don't think this is the correct approach.

I also thought about using tf.distribute.Strategy but my understanding is this is only for using when training across machines or GPUs so I don't see how I could use it here?

To summarise, What I want to do is agnostic to the dataset and model architecture. I guess I am looking for an Gradient AllReduce approach which in stead of splitting the mini-batches to different machines instead just runs them iteratively. So it would need to:

Compute the gradient using a minibatch.
Compute the mean of the gradients from all mini-batches, using a AllReduce collective-style approach.
Update the model with the averaged gradient.

I assume this approach of applying the mean of the gradients would be far less memory intensive than applying all the gradients as discussed here

分享到QQ

分享到微博