Tensorflow：GPU 设备之间的内存增长不会有差异 |如何将多 GPU 与张量流结合使用

发布于 2025-01-11 09:07:07 字数 2077 浏览 0 评论 0原文

我正在尝试在集群内的 GPU 节点上运行 keras 代码。 GPU 节点每个节点有 4 个 GPU。我确保 GPU 节点中的所有 4 个 GPU 可供我使用。我运行下面的代码让 TensorFlow 使用 GPU：

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

输出中列出了 4 个可用的 GPU。但是，运行代码时出现以下错误：

Traceback (most recent call last):
  File "/BayesOptimization.py", line 20, in <module>
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/framework/config.py", line 439, in list_logical_devices
    return context.context().list_logical_devices(device_type=device_type)
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1368, in list_logical_devices
    self.ensure_initialized()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 511, in ensure_initialized
    config_str = self.config.SerializeToString()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1015, in config
    gpu_options = self._compute_gpu_options()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1074, in _compute_gpu_options
    raise ValueError("Memory growth cannot differ between GPU devices")
ValueError: Memory growth cannot differ between GPU devices

代码不应该列出所有可用的 GPU 并将每个 GPU 的内存增长设置为 true 吗？

我目前正在使用tensorflow库和python 3.97：

tensorflow                2.4.1           gpu_py39h8236f22_0
tensorflow-base           2.4.1           gpu_py39h29c2da4_0
tensorflow-estimator      2.4.1              pyheb71bc4_0
tensorflow-gpu            2.4.1                h30adc30_0

知道问题是什么以及如何解决它吗？提前致谢！

原文

I am trying to run a keras code on a GPU node within a cluster. The GPU node has 4 GPUs per node. I made sure to have all 4 GPUs within the GPU node available for my use. I run the code below to let tensorflow use the GPU:

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

The 4 GPUs available get listed in the output. However, I got the following error when running the code:

Traceback (most recent call last):
  File "/BayesOptimization.py", line 20, in <module>
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/framework/config.py", line 439, in list_logical_devices
    return context.context().list_logical_devices(device_type=device_type)
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1368, in list_logical_devices
    self.ensure_initialized()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 511, in ensure_initialized
    config_str = self.config.SerializeToString()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1015, in config
    gpu_options = self._compute_gpu_options()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1074, in _compute_gpu_options
    raise ValueError("Memory growth cannot differ between GPU devices")
ValueError: Memory growth cannot differ between GPU devices

Shouldn't the code list all the available gpus and set memory growth to true for each one?

I am currently using tensorflow libraries and python 3.97:

tensorflow                2.4.1           gpu_py39h8236f22_0
tensorflow-base           2.4.1           gpu_py39h29c2da4_0
tensorflow-estimator      2.4.1              pyheb71bc4_0
tensorflow-gpu            2.4.1                h30adc30_0

Any idea what the problem is and how to solve it? Thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

岁月无声 2025-01-18 09:07:07

仅尝试： os.environ["CUDA_VISIBLE_DEVICES"]="0"
而不是 tf.config.experimental.set_memory_growth。
这对我有用。

回复收藏 0 原文

我喜欢麦丽素 2025-01-18 09:07:07

对于多 GPU 设备，所有可用 GPU 的内存增长应保持恒定。要么对所有 GPU 将其设置为 true，要么将其保留为 false。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

Tensorflow GPU 文档

In case of multi-GPU devices memory growth should be constant through out all available GPUs. Either set it true for all GPUs or keep it false.

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

Tensorflow GPU documentation

回复收藏 0 原文

自在安然 2025-01-18 09:07:07

我认为这只是要小心缩进的问题。对我来说，你的代码不起作用：

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

但是如果你在调用逻辑_gpus = tf.config.list_logic_devices('GPU')之前等待直到最后，那么它就可以工作。所以更改为

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

我认为发生的情况是您需要在调用配置的任何其他访问器之前在配置中设置所有可见的 GPU。因此，您的另一个选择是首先调用 tf.config.set_visible_devices。如果您限制为一个 GPU，则不需要循环；如果您执行两个或更多 GPU，则需要小心地在访问配置或运行任何 tf 代码之前对所有 GPU 调用 set_memory_growth。

I think its just a matter of being careful with indentations. For me your code does not work:

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

BUT if you wait before calling logical_gpus = tf.config.list_logical_devices('GPU') until the end then it works. SO change to

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

I think what is happening is you need to set ALL VISIBLE GPUs in the config before calling any other accessors to the config. So your other option is to call tf.config.set_visible_devices FIRST. If you limit to one GPU you dont need the loop if you do two or more you need to be careful to call set_memory_growth on all of them before accessing the config or running any tf code.

回复收藏 0 原文

~没有更多了~