在DataCrunch中,GPU利用率较低(不稳定)

发布于 2025-02-07 10:23:26 字数 2374 浏览 2 评论 0原文

我正在使用分布式镜像(2 A100)的DataCrunch中训练我的TensorFlow模型。我正在在TfreCords数据集上进行培训,其中每个Tfrecords包含600个长度10s的音频文件。每个音频的采样率为48000。问题是,在训练时,GPU未正确使用,而GPU利用率不稳定。我已经使用NVIDIA-SMI监视了GPU的使用情况。在培训时,GPU利用率的图表如下:

准备数据集的代码如下:

ds = tf.data.TFRecordDataset(files_ds, compression_type='ZLIB', num_parallel_reads=tf.data.AUTOTUNE)

# Prepare batches
ds = ds.batch(batch_size, drop_remainder=True)

# Parse a batch into a dataset of [noisy, clean] pairs
ds = ds.map(lambda x: _parse_batch(x, sample_rate, duration, split)

if args.steps_per_epoch != -1 : ds = ds.repeat()

return ds.prefetch(buffer_size=tf.data.AUTOTUNE)

分析批处理方法是:

def _parse_batch(record_batch, sample_rate, duration, split):
    n_samples = sample_rate * duration

    # Create a description of the features
    feature_description = {
        'noisy': tf.io.FixedLenFeature([n_samples], tf.float32),
        'clean': tf.io.FixedLenFeature([n_samples], tf.float32),
    }
    # Parse the input `tf.Example` proto using the dictionary above
    example = tf.io.parse_example(record_batch, feature_description)
    noisy, clean = tf.expand_dims(example['noisy'], axis=-1), tf.expand_dims(example['clean'], axis=-1)
    noisy, clean = augment(noisy, clean)
    return noisy, clean

以下是增强的代码:

def augment(noisy, clean):
    self_proba = 0.5
    self_initial = 0.3
    self_rt60 = (0.3, 1.3)
    self_first_delay = (0.01, 0.03)
    self_repeat = 3
    self_jitter = 0.1
    self_keep_clean = 0.1
    self_sample_rate = 48000

    if random.random() >= self_proba:
        return noisy, clean
    noise = noisy - clean
    initial = random.random() * self_initial
    first_delay = random.uniform(*self_first_delay)
    rt60 = random.uniform(*self_rt60)
    reverb_noise = _reverb(noise, initial, first_delay, rt60, self_repeat, self_jitter, self_sample_rate)
    noise += reverb_noise
    reverb_clean = _reverb(clean, initial, first_delay, rt60, self_repeat, self_jitter, self_sample_rate)
    clean += self_keep_clean * reverb_clean
    noise += (1 - self_keep_clean) * reverb_clean
    noisy = noise + clean
    return noisy, clean

任何人可以指出不正确使用GPU背后的原因吗?如何增加GPU利用率?

I'm training my TensorFlow model in datacrunch with distributed MirroredStrategy (2 A100). I'm training on tfrecords datasets, where each tfrecords contains 600 audio files of length 10s. The sample rate for each audio is 48000. The problem is that while training, the GPU is not utilized correctly, and GPU utilization is not stable. I have monitored the GPU usage using nvidia-smi. The graph of GPU utilization over 160s while training is given below:
enter image description here

The code for preparing dataset is given below:

ds = tf.data.TFRecordDataset(files_ds, compression_type='ZLIB', num_parallel_reads=tf.data.AUTOTUNE)

# Prepare batches
ds = ds.batch(batch_size, drop_remainder=True)

# Parse a batch into a dataset of [noisy, clean] pairs
ds = ds.map(lambda x: _parse_batch(x, sample_rate, duration, split)

if args.steps_per_epoch != -1 : ds = ds.repeat()

return ds.prefetch(buffer_size=tf.data.AUTOTUNE)

And the parse batch method is:

def _parse_batch(record_batch, sample_rate, duration, split):
    n_samples = sample_rate * duration

    # Create a description of the features
    feature_description = {
        'noisy': tf.io.FixedLenFeature([n_samples], tf.float32),
        'clean': tf.io.FixedLenFeature([n_samples], tf.float32),
    }
    # Parse the input `tf.Example` proto using the dictionary above
    example = tf.io.parse_example(record_batch, feature_description)
    noisy, clean = tf.expand_dims(example['noisy'], axis=-1), tf.expand_dims(example['clean'], axis=-1)
    noisy, clean = augment(noisy, clean)
    return noisy, clean

Here is the code for augmentation:

def augment(noisy, clean):
    self_proba = 0.5
    self_initial = 0.3
    self_rt60 = (0.3, 1.3)
    self_first_delay = (0.01, 0.03)
    self_repeat = 3
    self_jitter = 0.1
    self_keep_clean = 0.1
    self_sample_rate = 48000

    if random.random() >= self_proba:
        return noisy, clean
    noise = noisy - clean
    initial = random.random() * self_initial
    first_delay = random.uniform(*self_first_delay)
    rt60 = random.uniform(*self_rt60)
    reverb_noise = _reverb(noise, initial, first_delay, rt60, self_repeat, self_jitter, self_sample_rate)
    noise += reverb_noise
    reverb_clean = _reverb(clean, initial, first_delay, rt60, self_repeat, self_jitter, self_sample_rate)
    clean += self_keep_clean * reverb_clean
    noise += (1 - self_keep_clean) * reverb_clean
    noisy = noise + clean
    return noisy, clean

Can anyone point out the reason behind not utilizing the GPU properly? How can I increase GPU utilization?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文