在用射线曲调训练模型时,节点上的内存使用量不断增加

发布于 2025-02-08 01:19:02 字数 2594 浏览 2 评论 0原文

这是我第一次使用Ray Tune来寻找DL模型的最佳超参数,并且我遇到了一些与内存使用相关的问题。

该节点上的内存使用范围不断增加,这导致了试验运行的错误。下面是脚本运行时得到的。

== Status ==
Current time: 2022-06-16 13:27:43 (running for 00:09:14.60)
Memory usage on this node: 26.0/62.8 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 8.0/8 CPUs, 1.0/1 GPUs, 0.0/51.37 GiB heap, 0.0/0.93 GiB objects (0.0/1.0 accelerator_type:A40)
Result logdir: /app/ray_experiment_0
Number of trials: 3/20 (1 PENDING, 2 RUNNING)
+--------------+----------+----------------+-----------------+--------------+
| Trial name   | status   | loc            |   learning_rate |   batch_size |
|--------------+----------+----------------+-----------------+--------------|
| run_cf921dd8 | RUNNING  | 172.17.0.3:402 |       0.0374603 |           64 |
| run_d20c6f50 | RUNNING  | 172.17.0.3:437 |       0.0950719 |           64 |
| run_d20e37cc | PENDING  |                |       0.0732021 |           64 |
+--------------+----------+----------------+-----------------+--------------+

我不确定我在这里累积了什么以及如何避免这种积累。 我发现了一些类似的问题(这个

ray.init(object_store_memory = 10**9)

​。

我正在使用的代码(以下复制)几乎从文档中复制了。我基本上正在使用贝叶斯优化以聪明的方式对超参数进行采样,而ASHAS调度程序则尽早停止试验,如果他们没有足够的承诺,

def grid_search(config):

    # For stopping non promising trials early
    scheduler = ASHAScheduler(
        max_t=5,
        grace_period=1,
        reduction_factor=2)

    # Bayesian optimisation to sample hyperparameters in a smarter way
    algo = BayesOptSearch(random_search_steps=4, mode="min")

    reporter = CLIReporter(
        parameter_columns=["learning_rate", "batch_size"],
        metric_columns=["loss", "mean_accuracy", "training_iteration"])

    resources_per_trial = {"cpu": config["n_cpu_per_trials"], "gpu": config["n_gpu_per_trials"]}

    trainable = tune.with_parameters(run)

    analysis = tune.run(trainable,
        resources_per_trial=resources_per_trial,
        metric="loss",
        mode="min",
        config=config,
        num_samples=config["n_sampling"], # Number of times to sample from the hyperparameter space
        scheduler=scheduler,
        progress_reporter=reporter,
        name=config["name_experiment"],
        local_dir="/app/.",
        search_alg=algo)

    print("Best hyperparameters found were: ", analysis.best_config)

我将很感激,如果你们中有些人设法解决了这个问题。

This is the first time I am using Ray Tune to look for the best hyperparameters for an DL model and I am experiencing some problems related to memory usage.

The Memory usage on this node keeps increasing which lead to an error of the trial run. Below is what I get when the script is running.

== Status ==
Current time: 2022-06-16 13:27:43 (running for 00:09:14.60)
Memory usage on this node: 26.0/62.8 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 8.0/8 CPUs, 1.0/1 GPUs, 0.0/51.37 GiB heap, 0.0/0.93 GiB objects (0.0/1.0 accelerator_type:A40)
Result logdir: /app/ray_experiment_0
Number of trials: 3/20 (1 PENDING, 2 RUNNING)
+--------------+----------+----------------+-----------------+--------------+
| Trial name   | status   | loc            |   learning_rate |   batch_size |
|--------------+----------+----------------+-----------------+--------------|
| run_cf921dd8 | RUNNING  | 172.17.0.3:402 |       0.0374603 |           64 |
| run_d20c6f50 | RUNNING  | 172.17.0.3:437 |       0.0950719 |           64 |
| run_d20e37cc | PENDING  |                |       0.0732021 |           64 |
+--------------+----------+----------------+-----------------+--------------+

I am not sure I completely understand what is Ray accumulating here and how to avoid this accumulation.
I have found a few similar issues (this one and this one for instance) but so far, setting

ray.init(object_store_memory = 10**9)

did not help.

The code I am using (copied below) is pretty much copied from the documentation. I am basically using a Bayesian optimization to sample the hyperparameters in a smart way and an ASHAS scheduler to stop the trials early if they are not promising enough

def grid_search(config):

    # For stopping non promising trials early
    scheduler = ASHAScheduler(
        max_t=5,
        grace_period=1,
        reduction_factor=2)

    # Bayesian optimisation to sample hyperparameters in a smarter way
    algo = BayesOptSearch(random_search_steps=4, mode="min")

    reporter = CLIReporter(
        parameter_columns=["learning_rate", "batch_size"],
        metric_columns=["loss", "mean_accuracy", "training_iteration"])

    resources_per_trial = {"cpu": config["n_cpu_per_trials"], "gpu": config["n_gpu_per_trials"]}

    trainable = tune.with_parameters(run)

    analysis = tune.run(trainable,
        resources_per_trial=resources_per_trial,
        metric="loss",
        mode="min",
        config=config,
        num_samples=config["n_sampling"], # Number of times to sample from the hyperparameter space
        scheduler=scheduler,
        progress_reporter=reporter,
        name=config["name_experiment"],
        local_dir="/app/.",
        search_alg=algo)

    print("Best hyperparameters found were: ", analysis.best_config)

I would really appreciate if some of you have managed to solve this issue.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文