TensorFlow 2 keras 模型在 CPU 上训练速度非常慢，并且大多数 CPU 核心（> 95% 核心）处于空闲状态

发布于 2025-01-19 05:28:15 字数 742 浏览 2 评论 0原文

我正在尝试在 Jupyter 笔记本的 CPU（EC2 实例 m4.10，大约 160GB 和 40 个 CPU 核心）上训练神经网络模型（TensorFlow 2.8）。训练数据从 300 多个 gzip 文件（每个文件超过 200 MB）加载并作为数据集进行处理。但是，训练过程非常缓慢。每个 epoch 花费 75 分钟。代码：

 import tensorflow.keras as keras
 tf.comfig.run_function_eagerly(True)
 model.compile(optimizer=Adam(learning_rate), loss=keras.BinaryCrossEntropy),
               run_eagerly=True,
               metrics=[keras.metrics.BinaryAccuracy()])

 model.fit(train_data, epoch=10, steps_per_epoch=1000, validation_steps=100, 
          workers=16, use_muiltiprocessing=True)

当模型正在训练时，只有 1 或 2 个 CPU 核心忙碌，其他 38 个核心都空闲。

我已经尝试过eager=False，但没有用。我检查了一些关于为什么 tf2 比 tf1 慢的帖子，但是，他们都没有谈论为什么大多数 CPU 核心处于空闲状态。

请让我知道我在这里错过了什么？

原文

I am trying to train a neural network model (TensorFlow 2.8) on a CPU (EC2 instance m4.10 with about 160GB and 40 CPU cores) from the Jupyter notebook. The training data is loaded from 300+ gzip files (each file is 200+ MB) and processed as a dataset. But, the training process is very slow. It cost 75 mins per epoch.
The code:

 import tensorflow.keras as keras
 tf.comfig.run_function_eagerly(True)
 model.compile(optimizer=Adam(learning_rate), loss=keras.BinaryCrossEntropy),
               run_eagerly=True,
               metrics=[keras.metrics.BinaryAccuracy()])

 model.fit(train_data, epoch=10, steps_per_epoch=1000, validation_steps=100, 
          workers=16, use_muiltiprocessing=True)

When the model is being trained, only 1 or 2 CPU cores are busy and all other 38 cores are idle.

I have tried eager=False, but no use. I have checked some posts about why tf2 is slower than tf1, but, none of them talk about why most CPU cores are idle.

Please let me know what I missed here?

分享到QQ

分享到微博