TensorFlow 2 keras 模型在 CPU 上训练速度非常慢,并且大多数 CPU 核心(> 95% 核心)处于空闲状态
我正在尝试在 Jupyter 笔记本的 CPU(EC2 实例 m4.10,大约 160GB 和 40 个 CPU 核心)上训练神经网络模型(TensorFlow 2.8)。训练数据从 300 多个 gzip 文件(每个文件超过 200 MB)加载并作为数据集进行处理。但是,训练过程非常缓慢。每个 epoch 花费 75 分钟。 代码:
import tensorflow.keras as keras
tf.comfig.run_function_eagerly(True)
model.compile(optimizer=Adam(learning_rate), loss=keras.BinaryCrossEntropy),
run_eagerly=True,
metrics=[keras.metrics.BinaryAccuracy()])
model.fit(train_data, epoch=10, steps_per_epoch=1000, validation_steps=100,
workers=16, use_muiltiprocessing=True)
当模型正在训练时,只有 1 或 2 个 CPU 核心忙碌,其他 38 个核心都空闲。
我已经尝试过eager=False
,但没有用。我检查了一些关于为什么 tf2 比 tf1 慢的帖子,但是,他们都没有谈论为什么大多数 CPU 核心处于空闲状态。
请让我知道我在这里错过了什么?
I am trying to train a neural network model (TensorFlow 2.8) on a CPU (EC2 instance m4.10 with about 160GB and 40 CPU cores) from the Jupyter notebook. The training data is loaded from 300+ gzip files (each file is 200+ MB) and processed as a dataset. But, the training process is very slow. It cost 75 mins per epoch.
The code:
import tensorflow.keras as keras
tf.comfig.run_function_eagerly(True)
model.compile(optimizer=Adam(learning_rate), loss=keras.BinaryCrossEntropy),
run_eagerly=True,
metrics=[keras.metrics.BinaryAccuracy()])
model.fit(train_data, epoch=10, steps_per_epoch=1000, validation_steps=100,
workers=16, use_muiltiprocessing=True)
When the model is being trained, only 1 or 2 CPU cores are busy and all other 38 cores are idle.
I have tried eager=False
, but no use. I have checked some posts about why tf2
is slower than tf1
, but, none of them talk about why most CPU cores are idle.
Please let me know what I missed here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论