在用射线曲调训练模型时,节点上的内存使用量不断增加
这是我第一次使用Ray Tune来寻找DL模型的最佳超参数,并且我遇到了一些与内存使用相关的问题。
该节点上的内存使用范围不断增加,这导致了试验运行的错误。下面是脚本运行时得到的。
== Status ==
Current time: 2022-06-16 13:27:43 (running for 00:09:14.60)
Memory usage on this node: 26.0/62.8 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 8.0/8 CPUs, 1.0/1 GPUs, 0.0/51.37 GiB heap, 0.0/0.93 GiB objects (0.0/1.0 accelerator_type:A40)
Result logdir: /app/ray_experiment_0
Number of trials: 3/20 (1 PENDING, 2 RUNNING)
+--------------+----------+----------------+-----------------+--------------+
| Trial name | status | loc | learning_rate | batch_size |
|--------------+----------+----------------+-----------------+--------------|
| run_cf921dd8 | RUNNING | 172.17.0.3:402 | 0.0374603 | 64 |
| run_d20c6f50 | RUNNING | 172.17.0.3:437 | 0.0950719 | 64 |
| run_d20e37cc | PENDING | | 0.0732021 | 64 |
+--------------+----------+----------------+-----------------+--------------+
我不确定我在这里累积了什么以及如何避免这种积累。 我发现了一些类似的问题(这个 和
ray.init(object_store_memory = 10**9)
。
我正在使用的代码(以下复制)几乎从文档中复制了。我基本上正在使用贝叶斯优化以聪明的方式对超参数进行采样,而ASHAS调度程序则尽早停止试验,如果他们没有足够的承诺,
def grid_search(config):
# For stopping non promising trials early
scheduler = ASHAScheduler(
max_t=5,
grace_period=1,
reduction_factor=2)
# Bayesian optimisation to sample hyperparameters in a smarter way
algo = BayesOptSearch(random_search_steps=4, mode="min")
reporter = CLIReporter(
parameter_columns=["learning_rate", "batch_size"],
metric_columns=["loss", "mean_accuracy", "training_iteration"])
resources_per_trial = {"cpu": config["n_cpu_per_trials"], "gpu": config["n_gpu_per_trials"]}
trainable = tune.with_parameters(run)
analysis = tune.run(trainable,
resources_per_trial=resources_per_trial,
metric="loss",
mode="min",
config=config,
num_samples=config["n_sampling"], # Number of times to sample from the hyperparameter space
scheduler=scheduler,
progress_reporter=reporter,
name=config["name_experiment"],
local_dir="/app/.",
search_alg=algo)
print("Best hyperparameters found were: ", analysis.best_config)
我将很感激,如果你们中有些人设法解决了这个问题。
This is the first time I am using Ray Tune to look for the best hyperparameters for an DL model and I am experiencing some problems related to memory usage.
The Memory usage on this node keeps increasing which lead to an error of the trial run. Below is what I get when the script is running.
== Status ==
Current time: 2022-06-16 13:27:43 (running for 00:09:14.60)
Memory usage on this node: 26.0/62.8 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 8.0/8 CPUs, 1.0/1 GPUs, 0.0/51.37 GiB heap, 0.0/0.93 GiB objects (0.0/1.0 accelerator_type:A40)
Result logdir: /app/ray_experiment_0
Number of trials: 3/20 (1 PENDING, 2 RUNNING)
+--------------+----------+----------------+-----------------+--------------+
| Trial name | status | loc | learning_rate | batch_size |
|--------------+----------+----------------+-----------------+--------------|
| run_cf921dd8 | RUNNING | 172.17.0.3:402 | 0.0374603 | 64 |
| run_d20c6f50 | RUNNING | 172.17.0.3:437 | 0.0950719 | 64 |
| run_d20e37cc | PENDING | | 0.0732021 | 64 |
+--------------+----------+----------------+-----------------+--------------+
I am not sure I completely understand what is Ray accumulating here and how to avoid this accumulation.
I have found a few similar issues (this one and this one for instance) but so far, setting
ray.init(object_store_memory = 10**9)
did not help.
The code I am using (copied below) is pretty much copied from the documentation. I am basically using a Bayesian optimization to sample the hyperparameters in a smart way and an ASHAS scheduler to stop the trials early if they are not promising enough
def grid_search(config):
# For stopping non promising trials early
scheduler = ASHAScheduler(
max_t=5,
grace_period=1,
reduction_factor=2)
# Bayesian optimisation to sample hyperparameters in a smarter way
algo = BayesOptSearch(random_search_steps=4, mode="min")
reporter = CLIReporter(
parameter_columns=["learning_rate", "batch_size"],
metric_columns=["loss", "mean_accuracy", "training_iteration"])
resources_per_trial = {"cpu": config["n_cpu_per_trials"], "gpu": config["n_gpu_per_trials"]}
trainable = tune.with_parameters(run)
analysis = tune.run(trainable,
resources_per_trial=resources_per_trial,
metric="loss",
mode="min",
config=config,
num_samples=config["n_sampling"], # Number of times to sample from the hyperparameter space
scheduler=scheduler,
progress_reporter=reporter,
name=config["name_experiment"],
local_dir="/app/.",
search_alg=algo)
print("Best hyperparameters found were: ", analysis.best_config)
I would really appreciate if some of you have managed to solve this issue.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论