在DataCrunch中，GPU利用率较低（不稳定）

发布于 2025-02-07 10:23:26 字数 2374 浏览 2 评论 0原文

我正在使用分布式镜像（2 A100）的DataCrunch中训练我的TensorFlow模型。我正在在TfreCords数据集上进行培训，其中每个Tfrecords包含600个长度10s的音频文件。每个音频的采样率为48000。问题是，在训练时，GPU未正确使用，而GPU利用率不稳定。我已经使用NVIDIA-SMI监视了GPU的使用情况。在培训时，GPU利用率的图表如下：

准备数据集的代码如下：

ds = tf.data.TFRecordDataset(files_ds, compression_type='ZLIB', num_parallel_reads=tf.data.AUTOTUNE)

# Prepare batches
ds = ds.batch(batch_size, drop_remainder=True)

# Parse a batch into a dataset of [noisy, clean] pairs
ds = ds.map(lambda x: _parse_batch(x, sample_rate, duration, split)

if args.steps_per_epoch != -1 : ds = ds.repeat()

return ds.prefetch(buffer_size=tf.data.AUTOTUNE)

分析批处理方法是：

def _parse_batch(record_batch, sample_rate, duration, split):
    n_samples = sample_rate * duration

    # Create a description of the features
    feature_description = {
        'noisy': tf.io.FixedLenFeature([n_samples], tf.float32),
        'clean': tf.io.FixedLenFeature([n_samples], tf.float32),
    }
    # Parse the input `tf.Example` proto using the dictionary above
    example = tf.io.parse_example(record_batch, feature_description)
    noisy, clean = tf.expand_dims(example['noisy'], axis=-1), tf.expand_dims(example['clean'], axis=-1)
    noisy, clean = augment(noisy, clean)
    return noisy, clean

以下是增强的代码：

def augment(noisy, clean):
    self_proba = 0.5
    self_initial = 0.3
    self_rt60 = (0.3, 1.3)
    self_first_delay = (0.01, 0.03)
    self_repeat = 3
    self_jitter = 0.1
    self_keep_clean = 0.1
    self_sample_rate = 48000

    if random.random() >= self_proba:
        return noisy, clean
    noise = noisy - clean
    initial = random.random() * self_initial
    first_delay = random.uniform(*self_first_delay)
    rt60 = random.uniform(*self_rt60)
    reverb_noise = _reverb(noise, initial, first_delay, rt60, self_repeat, self_jitter, self_sample_rate)
    noise += reverb_noise
    reverb_clean = _reverb(clean, initial, first_delay, rt60, self_repeat, self_jitter, self_sample_rate)
    clean += self_keep_clean * reverb_clean
    noise += (1 - self_keep_clean) * reverb_clean
    noisy = noise + clean
    return noisy, clean

任何人可以指出不正确使用GPU背后的原因吗？如何增加GPU利用率？

原文

I'm training my TensorFlow model in datacrunch with distributed MirroredStrategy (2 A100). I'm training on tfrecords datasets, where each tfrecords contains 600 audio files of length 10s. The sample rate for each audio is 48000. The problem is that while training, the GPU is not utilized correctly, and GPU utilization is not stable. I have monitored the GPU usage using nvidia-smi. The graph of GPU utilization over 160s while training is given below:

The code for preparing dataset is given below:

ds = tf.data.TFRecordDataset(files_ds, compression_type='ZLIB', num_parallel_reads=tf.data.AUTOTUNE)

# Prepare batches
ds = ds.batch(batch_size, drop_remainder=True)

# Parse a batch into a dataset of [noisy, clean] pairs
ds = ds.map(lambda x: _parse_batch(x, sample_rate, duration, split)

if args.steps_per_epoch != -1 : ds = ds.repeat()

return ds.prefetch(buffer_size=tf.data.AUTOTUNE)

And the parse batch method is:

def _parse_batch(record_batch, sample_rate, duration, split):
    n_samples = sample_rate * duration

    # Create a description of the features
    feature_description = {
        'noisy': tf.io.FixedLenFeature([n_samples], tf.float32),
        'clean': tf.io.FixedLenFeature([n_samples], tf.float32),
    }
    # Parse the input `tf.Example` proto using the dictionary above
    example = tf.io.parse_example(record_batch, feature_description)
    noisy, clean = tf.expand_dims(example['noisy'], axis=-1), tf.expand_dims(example['clean'], axis=-1)
    noisy, clean = augment(noisy, clean)
    return noisy, clean

Here is the code for augmentation:

def augment(noisy, clean):
    self_proba = 0.5
    self_initial = 0.3
    self_rt60 = (0.3, 1.3)
    self_first_delay = (0.01, 0.03)
    self_repeat = 3
    self_jitter = 0.1
    self_keep_clean = 0.1
    self_sample_rate = 48000

    if random.random() >= self_proba:
        return noisy, clean
    noise = noisy - clean
    initial = random.random() * self_initial
    first_delay = random.uniform(*self_first_delay)
    rt60 = random.uniform(*self_rt60)
    reverb_noise = _reverb(noise, initial, first_delay, rt60, self_repeat, self_jitter, self_sample_rate)
    noise += reverb_noise
    reverb_clean = _reverb(clean, initial, first_delay, rt60, self_repeat, self_jitter, self_sample_rate)
    clean += self_keep_clean * reverb_clean
    noise += (1 - self_keep_clean) * reverb_clean
    noisy = noise + clean
    return noisy, clean

Can anyone point out the reason behind not utilizing the GPU properly? How can I increase GPU utilization?

分享到QQ

分享到微博