在实际集群中使用 MultiWorkerMirroredStrategy 时，Tensorflow 分布式学习不起作用

发布于 2025-01-09 02:39:56 字数 558 浏览 0 评论 0原文

我只是想遵循tensorflow文档中的 MultiWorkerMirroredStrategy 的示例。我在具有单个节点的本地主机中成功进行了训练。但是，我在有两个节点的集群中训练失败。我尝试过禁用防火墙，但没有解决问题。

这是main.py。（我在节点 1 和节点 2 中运行相同的代码，除了 tf_config 变量。我将 node1 的 tf_config 设置为 tf_config['task']['index']=0，将 node2 的 tf_config 设置为 tf_config['任务']['索引']=1)

任何帮助表示赞赏。谢谢。

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小清晰的声音 2025-01-16 02:39:56

我发现您没有错误代码，但我想我可以推断问题可能出现在哪里，因为您的代码应该可以工作。一旦有机会，我将在我的 kubernetes 上进行测试（我有一个 atm 节点）。

最有可能出现的问题。您正在使用 json.dumps() 设置环境变量。在许多设置中，您应该使用：
tf_config=json.loads(os.environ.get(TF_CONFIG) 或 '{}'),
TASK_INDEX=tf_config['任务']['索引']

这应该可以解决暴露端口和 IP 配置的任何问题。

- 听起来你使用的方法是在笔记本上？因为您没有为 main.py 运行相同的代码。就像在一个 main.py 中一样，您设置 1，另一个设置 0。无论哪种方式，这都不是您在这里所做的。您将索引设置为 1 和 0，但您不会仅返回索引，您将返回完整的集群规范以及您设置的索引。如果您的集群未设置环境变量，您将需要取回已设置的 TF_CONFIG，然后使用 load 将其设置为您的 tf_config，现在您将仅获取该节点的副本索引。

如果您使用的是笔记本电脑，则需要将其连接到集群环境，否则您将向您的计算机设置本地环境变量，而不是集群上的容器。考虑使用 Kubeflow 来管理此问题。
您可以在设置集群后从笔记本启动
配置操作，或将 TF_job 规范构建为定义节点规范的 YAML，然后使用该规范启动 pod。
无论哪种方式，集群实际上都需要具有该配置，您应该能够加载集群中的环境，以便为每个节点分配一个索引，并且您可以从您在设置时设置的节点副本 ID 中获取该索引启动节点并使用 YAML 或 json 字典指定。如果 kubernetes 上的副本索引：{num} 与容器上的环境变量（启动 pod 时分配的环境变量）不匹配，则在本地容器中运行的本地设置环境对实际集群没有任何意义。

-尝试创建一个函数，该函数将返回每个工作线程的索引，以测试它是否在 kubernetes 仪表板或 kubectl 上设置为相同的副本索引。确保让函数打印出来，以便您可以在 pod 日志中看到它。这将有助于调试。

- 查看 Pod 日志并查看 Pod 是否正在连接到服务器，以及是否正在使用与您的集群兼容的任何通信规范：grcp/等。您没有设置通信策略，但在大多数情况下它应该能够自动为您找到它（只需检查以防万一）。

-如果您能够启动 Pod，请确保在重试之前将其终止。一旦你掌握了 python pipeline skd 的窍门，kubeflow 会让你的事情变得更加容易。您可以将函数作为容器启动。您可以通过终止旧的 Pod 来构建一个清理操作。

-您应该考虑将 main.py 和任何其他支持模块加载到存储库（例如 dockerhub）中的映像上，以便容器可以加载映像。使用多工作策略，每台机器都需要具有相同的数据才能正确分片。再次检查您的 pod 日志，看看它是否无法对数据进行分片。

- 您是否在具有不同 GPU 的本地计算机上运行？如果是这样，您应该使用镜像策略而不是多人工作。

I see that you don't have an error code, but I think I can infer where the issue could be arising, since your code should work. I will test on my kubernetes once I get a chance (I have a node down atm).

The most likely issue. You are using json.dumps() to set the environment variable. In many setting the you should be using:
tf_config=json.loads(os.environ.get(TF_CONFIG) or '{}'),
TASK_INDEX=tf_config['task']['index']

That should clear up any issues with expose ports and ip configurations.

-It sounds like the method you are using is in a notebook? Since you are not running the the same code for main.py. As in one main.py you set 1 and the other 0. Either way that is not what you are doing here. You are setting the index to 1 and 0 but you are not getting back only the index, you are getting back the full cluster spec with the index you set it to. If the environment variable is not set from your cluster, you will need to get back the TF_CONFIG that was set, and then use loads to set that as your tf_config, now you will be getting ONLY the replica index for that node.

If you are using a notebook it needs to be connected to the cluster environment, otherwise you are setting a local environment variable to your machine, and not to containers on the cluster. Consider using Kubeflow, to manage this.
You can either launch from the notebook after setting up your cluster
configuration op, or build a TF_job spec as a YAML that defined the node specs, then launch the pods using that spec.
Either way, the cluster needs to actually have that configuration, you should be able to load the environment in the cluster such that each node is ASSIGNED an index and you are getting that index from THAT nodes replica ID that you set when you launched the nodes and specified with a YAML or json dictionary. A locally set environment running within the local container means nothing to the actual cluster, if the replica-index:{num} on kubernetes does not match the environment variable on the container —That is assigned when the pod is launched.

-Try making a function that will return what the index of each worker is to test if it is set to the same replica-index on your kubernetes dashboard or from kubectl. Make sure to have the function print that out so you can see it in the pod logs. This will help with debugging.

-Look at the pod logs and see if the pods are connecting to the server, and are using whatever communication spec is compatible with your cluster: grcp/etc. You are not setting a communication strategy, but it should be able to automatically find it for you in most cases (just check in case).

-If you are able to launch pods, make sure you are terminating them before trying again. Again kubeflow is going to make things so much easier for you once you get the hang of their python pipeline skd. You can launch functions as containers. You can build an op that clean up, by terminating old pods.

-You should consider having your main.py and any other supporting modules loaded on an image in a repository, such as dockerhub, so that the containers can load the image. With Multiworker Strategy, each machine needs to have the same data for it to be sharded properly. Again check your pod logs to see if it cannot shard the data.

-Are you running on a local machine with different GPUs? If so you should be using Mirrored Strategy NOT multiworker.

回复收藏 0 原文

断桥再见 2025-01-16 02:39:56

所以有很多可能出问题的地方都与你的配置有关，你运行的是什么？您使用的是 minikube，还是拆分了 CPU 核心？我可以提出一般性改进建议，帮助您的代码处理更多通信选项。此外，您似乎没有数据分发选项。您放置在策略上下文管理器范围内的唯一内容是模型定义和编译。看起来这两个节点共享同一端口。最好将 TF_config 设置为传递到

'tf_config = json.loads(os.environ.get('TF_CONFIG') or '{}')' 的

环境变量，这是我在过去它既可以通过 mini-kube 在多个 CPU 线程上运行，也可以在 Kubernetes 框架上运行。

模型替换为简单的 mnist 模型。

from __future__ import absolute_import, division, print_function

import argparse
import json
import os

import tensorflow_datasets as tfds
import tensorflow as tf
from tensorflow.keras import layers, models


def make_datasets_unbatched():
  BUFFER_SIZE = 10000

  # Scaling MNIST data from (0, 255] to (0., 1.]
  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255
    return image, label

  datasets, _ = tfds.load(name='mnist', with_info=True, as_supervised=True)

  return datasets['train'].map(scale).cache().shuffle(BUFFER_SIZE)


def build_and_compile_cnn_model():
  model = models.Sequential()
  model.add(
      layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.Flatten())
  model.add(layers.Dense(64, activation='relu'))
  model.add(layers.Dense(10, activation='softmax'))

  model.summary()

  model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

  return model


def decay(epoch):
  if epoch < 3: #pylint: disable=no-else-return
    return 1e-3
  if 3 <= epoch < 7:
    return 1e-4
  return 1e-5


def main(args):

  # MultiWorkerMirroredStrategy creates copies of all variables in the model's
  # layers on each device across all workers
  # if your GPUs don't support NCCL, replace "communication" with another
  strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
      communication=tf.distribute.experimental.CollectiveCommunication.NCCL)

  BATCH_SIZE_PER_REPLICA = 64
  BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

  with strategy.scope():
    ds_train = make_datasets_unbatched().batch(BATCH_SIZE).repeat()
    options = tf.data.Options()
    options.experimental_distribute.auto_shard_policy = \
        tf.data.experimental.AutoShardPolicy.DATA
    ds_train = ds_train.with_options(options)
    # Model building/compiling need to be within `strategy.scope()`.
    multi_worker_model = build_and_compile_cnn_model()

  # Define the checkpoint directory to store the checkpoints
  checkpoint_dir = args.checkpoint_dir

  # Name of the checkpoint files
  checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

  # Function for decaying the learning rate.
  # You can define any decay function you need.
  # Callback for printing the LR at the end of each epoch.
  class PrintLR(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None): #pylint: disable=no-self-use
      print('\nLearning rate for epoch {} is {}'.format(
        epoch + 1, multi_worker_model.optimizer.lr.numpy()))

  callbacks = [
      tf.keras.callbacks.TensorBoard(log_dir='./logs'),
      tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
                                         save_weights_only=True),
      tf.keras.callbacks.LearningRateScheduler(decay),
      PrintLR()
  ]

  # Keras' `model.fit()` trains the model with specified number of epochs and
  # number of steps per epoch. Note that the numbers here are for demonstration
  # purposes only and may not sufficiently produce a model with good quality.
  multi_worker_model.fit(ds_train,
                         epochs=10,
                         steps_per_epoch=70,
                         callbacks=callbacks)

  # Saving a model
  # Let `is_chief` be a utility function that inspects the cluster spec and
  # current task type and returns True if the worker is the chief and False
  # otherwise.
  def is_chief():
    return TASK_INDEX == 0

  if is_chief():
    model_path = args.saved_model_dir

  else:
    # Save to a path that is unique across workers.
    model_path = args.saved_model_dir + '/worker_tmp_' + str(TASK_INDEX)

  multi_worker_model.save(model_path)


if __name__ == '__main__':
  os.environ['NCCL_DEBUG'] = 'INFO'

  tfds.disable_progress_bar()

  # to decide if a worker is chief, get TASK_INDEX in Cluster info
  tf_config = json.loads(os.environ.get('TF_CONFIG') or '{}')
  TASK_INDEX = tf_config['task']['index']

  parser = argparse.ArgumentParser()
  parser.add_argument('--saved_model_dir',
                      type=str,
                      required=True,
                      help='Tensorflow export directory.')

  parser.add_argument('--checkpoint_dir',
                      type=str,
                      required=True,
                      help='Tensorflow checkpoint directory.')

  parsed_args = parser.parse_args()
  main(parsed_args)

So there are a lot of things that could be going wrong that are related to your configuration, what are you running on? Are you using minikube, or are you splitting your CPU cores? I can suggest general improvements that could help with your code to handle more communication options. Also you don't seem to have data distribution options. The only thing you place within the scope of the strategy context manager is the model definition and compile. It also looks like those two nodes are sharing the same port. It's better to set TF_config as an environment variable that is passed into

'tf_config = json.loads(os.environ.get('TF_CONFIG') or '{}')'

Here is an example of set-ups I have used in the past it worked both running on multiple CPU threads through mini-kube, and on a Kubernetes framework.

Model replaced with a simple mnist model.

from __future__ import absolute_import, division, print_function

import argparse
import json
import os

import tensorflow_datasets as tfds
import tensorflow as tf
from tensorflow.keras import layers, models


def make_datasets_unbatched():
  BUFFER_SIZE = 10000

  # Scaling MNIST data from (0, 255] to (0., 1.]
  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255
    return image, label

  datasets, _ = tfds.load(name='mnist', with_info=True, as_supervised=True)

  return datasets['train'].map(scale).cache().shuffle(BUFFER_SIZE)


def build_and_compile_cnn_model():
  model = models.Sequential()
  model.add(
      layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Conv2D(64, (3, 3), activation='relu'))
  model.add(layers.Flatten())
  model.add(layers.Dense(64, activation='relu'))
  model.add(layers.Dense(10, activation='softmax'))

  model.summary()

  model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])

  return model


def decay(epoch):
  if epoch < 3: #pylint: disable=no-else-return
    return 1e-3
  if 3 <= epoch < 7:
    return 1e-4
  return 1e-5


def main(args):

  # MultiWorkerMirroredStrategy creates copies of all variables in the model's
  # layers on each device across all workers
  # if your GPUs don't support NCCL, replace "communication" with another
  strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
      communication=tf.distribute.experimental.CollectiveCommunication.NCCL)

  BATCH_SIZE_PER_REPLICA = 64
  BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

  with strategy.scope():
    ds_train = make_datasets_unbatched().batch(BATCH_SIZE).repeat()
    options = tf.data.Options()
    options.experimental_distribute.auto_shard_policy = \
        tf.data.experimental.AutoShardPolicy.DATA
    ds_train = ds_train.with_options(options)
    # Model building/compiling need to be within `strategy.scope()`.
    multi_worker_model = build_and_compile_cnn_model()

  # Define the checkpoint directory to store the checkpoints
  checkpoint_dir = args.checkpoint_dir

  # Name of the checkpoint files
  checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

  # Function for decaying the learning rate.
  # You can define any decay function you need.
  # Callback for printing the LR at the end of each epoch.
  class PrintLR(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None): #pylint: disable=no-self-use
      print('\nLearning rate for epoch {} is {}'.format(
        epoch + 1, multi_worker_model.optimizer.lr.numpy()))

  callbacks = [
      tf.keras.callbacks.TensorBoard(log_dir='./logs'),
      tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
                                         save_weights_only=True),
      tf.keras.callbacks.LearningRateScheduler(decay),
      PrintLR()
  ]

  # Keras' `model.fit()` trains the model with specified number of epochs and
  # number of steps per epoch. Note that the numbers here are for demonstration
  # purposes only and may not sufficiently produce a model with good quality.
  multi_worker_model.fit(ds_train,
                         epochs=10,
                         steps_per_epoch=70,
                         callbacks=callbacks)

  # Saving a model
  # Let `is_chief` be a utility function that inspects the cluster spec and
  # current task type and returns True if the worker is the chief and False
  # otherwise.
  def is_chief():
    return TASK_INDEX == 0

  if is_chief():
    model_path = args.saved_model_dir

  else:
    # Save to a path that is unique across workers.
    model_path = args.saved_model_dir + '/worker_tmp_' + str(TASK_INDEX)

  multi_worker_model.save(model_path)


if __name__ == '__main__':
  os.environ['NCCL_DEBUG'] = 'INFO'

  tfds.disable_progress_bar()

  # to decide if a worker is chief, get TASK_INDEX in Cluster info
  tf_config = json.loads(os.environ.get('TF_CONFIG') or '{}')
  TASK_INDEX = tf_config['task']['index']

  parser = argparse.ArgumentParser()
  parser.add_argument('--saved_model_dir',
                      type=str,
                      required=True,
                      help='Tensorflow export directory.')

  parser.add_argument('--checkpoint_dir',
                      type=str,
                      required=True,
                      help='Tensorflow checkpoint directory.')

  parsed_args = parser.parse_args()
  main(parsed_args)

回复收藏 0 原文

~没有更多了~