联邦学习中的 Epoch 与 Rounds
我正在对联盟学习模型进行联合平均。在运行了数千轮模型之后,模型仍然没有收敛。 我如何增加训练中的时期数量,以及它与回合的数量有何不同? 我如何才能融合,因为我尝试增加回合的数量,但需要很长时间才能训练(我正在使用Google Colab,并且执行时间不能超过24小时,我也尝试订阅Google Colab Pro。使用GPU但不能很好地工作)
代码和培训结果如下提供
train_data = [train.create_tf_dataset_for_client(c).repeat(2).map(reshape_data)
.batch(batch_size=50,num_parallel_calls=50)
for c in train_client_ids]
iterative_process = tff.learning.build_federated_averaging_process(
model_fn,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.0001),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.9))
NUM_ROUNDS = 50000
state = iterative_process.initialize()
logdir = "/tmp/logs/scalars/training/"
summary_writer = tf.summary.create_file_writer(logdir)
with summary_writer.as_default():
for round_num in range(0, NUM_ROUNDS):
state, metrics = iterative_process.next(state, train_data)
if (round_num% 1000) == 0:
print('round {:2d}, metrics={}'.format(round_num, metrics))
for name, value in metrics['train'].items():
tf.summary.scalar(name, value, step=round_num)
I am applying federated averaging on my federated learning model. After running the model for thousands rounds the model still did not converged.
How can I increase the number of epochs in training, and how it differs from the number of rounds?
And how can I reach to convergence, since I tried to increase the number of rounds but it take long time to train (I am using Google Colab, and the execution time can not be more than 24 hours I also tried subscribed to Google Colab Pro to use the GPU but it did not work well)
The code and the training results are provided below
train_data = [train.create_tf_dataset_for_client(c).repeat(2).map(reshape_data)
.batch(batch_size=50,num_parallel_calls=50)
for c in train_client_ids]
iterative_process = tff.learning.build_federated_averaging_process(
model_fn,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.0001),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.9))
NUM_ROUNDS = 50000
state = iterative_process.initialize()
logdir = "/tmp/logs/scalars/training/"
summary_writer = tf.summary.create_file_writer(logdir)
with summary_writer.as_default():
for round_num in range(0, NUM_ROUNDS):
state, metrics = iterative_process.next(state, train_data)
if (round_num% 1000) == 0:
print('round {:2d}, metrics={}'.format(round_num, metrics))
for name, value in metrics['train'].items():
tf.summary.scalar(name, value, step=round_num)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
参见 .repeat())。时期的数量是每批客户端列车的迭代次数。回合的数量是联合计算的总数。
See this tutorial for how to increase epochs (basically increase the number in
.repeat()
). The number of epochs is the number of iterations a client train on each batch. The number of rounds is the total number of federated computation.