在用射线曲调训练模型时，节点上的内存使用量不断增加

发布于 2025-02-08 01:19:02 字数 2594 浏览 2 评论 0原文

这是我第一次使用Ray Tune来寻找DL模型的最佳超参数，并且我遇到了一些与内存使用相关的问题。

该节点上的内存使用范围不断增加，这导致了试验运行的错误。下面是脚本运行时得到的。

== Status ==
Current time: 2022-06-16 13:27:43 (running for 00:09:14.60)
Memory usage on this node: 26.0/62.8 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 8.0/8 CPUs, 1.0/1 GPUs, 0.0/51.37 GiB heap, 0.0/0.93 GiB objects (0.0/1.0 accelerator_type:A40)
Result logdir: /app/ray_experiment_0
Number of trials: 3/20 (1 PENDING, 2 RUNNING)
+--------------+----------+----------------+-----------------+--------------+
| Trial name   | status   | loc            |   learning_rate |   batch_size |
|--------------+----------+----------------+-----------------+--------------|
| run_cf921dd8 | RUNNING  | 172.17.0.3:402 |       0.0374603 |           64 |
| run_d20c6f50 | RUNNING  | 172.17.0.3:437 |       0.0950719 |           64 |
| run_d20e37cc | PENDING  |                |       0.0732021 |           64 |
+--------------+----------+----------------+-----------------+--------------+

我不确定我在这里累积了什么以及如何避免这种积累。我发现了一些类似的问题（这个和

ray.init(object_store_memory = 10**9)

。

我正在使用的代码（以下复制）几乎从文档中复制了。我基本上正在使用贝叶斯优化以聪明的方式对超参数进行采样，而ASHAS调度程序则尽早停止试验，如果他们没有足够的承诺，

def grid_search(config):

    # For stopping non promising trials early
    scheduler = ASHAScheduler(
        max_t=5,
        grace_period=1,
        reduction_factor=2)

    # Bayesian optimisation to sample hyperparameters in a smarter way
    algo = BayesOptSearch(random_search_steps=4, mode="min")

    reporter = CLIReporter(
        parameter_columns=["learning_rate", "batch_size"],
        metric_columns=["loss", "mean_accuracy", "training_iteration"])

    resources_per_trial = {"cpu": config["n_cpu_per_trials"], "gpu": config["n_gpu_per_trials"]}

    trainable = tune.with_parameters(run)

    analysis = tune.run(trainable,
        resources_per_trial=resources_per_trial,
        metric="loss",
        mode="min",
        config=config,
        num_samples=config["n_sampling"], # Number of times to sample from the hyperparameter space
        scheduler=scheduler,
        progress_reporter=reporter,
        name=config["name_experiment"],
        local_dir="/app/.",
        search_alg=algo)

    print("Best hyperparameters found were: ", analysis.best_config)

我将很感激，如果你们中有些人设法解决了这个问题。

原文

This is the first time I am using Ray Tune to look for the best hyperparameters for an DL model and I am experiencing some problems related to memory usage.

The Memory usage on this node keeps increasing which lead to an error of the trial run. Below is what I get when the script is running.

== Status ==
Current time: 2022-06-16 13:27:43 (running for 00:09:14.60)
Memory usage on this node: 26.0/62.8 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 8.0/8 CPUs, 1.0/1 GPUs, 0.0/51.37 GiB heap, 0.0/0.93 GiB objects (0.0/1.0 accelerator_type:A40)
Result logdir: /app/ray_experiment_0
Number of trials: 3/20 (1 PENDING, 2 RUNNING)
+--------------+----------+----------------+-----------------+--------------+
| Trial name   | status   | loc            |   learning_rate |   batch_size |
|--------------+----------+----------------+-----------------+--------------|
| run_cf921dd8 | RUNNING  | 172.17.0.3:402 |       0.0374603 |           64 |
| run_d20c6f50 | RUNNING  | 172.17.0.3:437 |       0.0950719 |           64 |
| run_d20e37cc | PENDING  |                |       0.0732021 |           64 |
+--------------+----------+----------------+-----------------+--------------+

I am not sure I completely understand what is Ray accumulating here and how to avoid this accumulation.
I have found a few similar issues (this one and this one for instance) but so far, setting

ray.init(object_store_memory = 10**9)

did not help.

The code I am using (copied below) is pretty much copied from the documentation. I am basically using a Bayesian optimization to sample the hyperparameters in a smart way and an ASHAS scheduler to stop the trials early if they are not promising enough

def grid_search(config):

    # For stopping non promising trials early
    scheduler = ASHAScheduler(
        max_t=5,
        grace_period=1,
        reduction_factor=2)

    # Bayesian optimisation to sample hyperparameters in a smarter way
    algo = BayesOptSearch(random_search_steps=4, mode="min")

    reporter = CLIReporter(
        parameter_columns=["learning_rate", "batch_size"],
        metric_columns=["loss", "mean_accuracy", "training_iteration"])

    resources_per_trial = {"cpu": config["n_cpu_per_trials"], "gpu": config["n_gpu_per_trials"]}

    trainable = tune.with_parameters(run)

    analysis = tune.run(trainable,
        resources_per_trial=resources_per_trial,
        metric="loss",
        mode="min",
        config=config,
        num_samples=config["n_sampling"], # Number of times to sample from the hyperparameter space
        scheduler=scheduler,
        progress_reporter=reporter,
        name=config["name_experiment"],
        local_dir="/app/.",
        search_alg=algo)

    print("Best hyperparameters found were: ", analysis.best_config)

I would really appreciate if some of you have managed to solve this issue.

分享到QQ

分享到微博