huggingface;工作?

发布于 2025-02-08 09:46:02 字数 1521 浏览 2 评论 0原文

我目前的培训师设置为:

training_args = TrainingArguments(
    output_dir=f"./results_{model_checkpoint}",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    weight_decay=0.01,
    push_to_hub=True,
    save_total_limit = 1,
    resume_from_checkpoint=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_qa["train"],
    eval_dataset=tokenized_qa["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

培训后,在我的output_dir中,我有几个文件保存的文件:

['README.md',
 'tokenizer.json',
 'training_args.bin',
 '.git',
 '.gitignore',
 'vocab.txt',
 'config.json',
 'checkpoint-5000',
 'pytorch_model.bin',
 'tokenizer_config.json',
 'special_tokens_map.json',
 '.gitattributes']

main_classes 或bool,可选) - 如果是由培训师的实例保存的STR,即通往保存检查点的本地路径。如果布尔和等于true,请加载最后一个检查点在args.output_dir中,由培训师的先前实例保存。如果存在,培训将从此处加载的模型/优化器/调度程序状态恢复。,

但是当我调用Trainer.train.train()时,它似乎会删除最后一个检查点并启动新的检查点:

Saving model checkpoint to ./results_distilbert-base-uncased/checkpoint-500
...
Deleting older checkpoint [results_distilbert-base-uncased/checkpoint-5000] due to args.save_total_limit

它是否真的继续从上一个检查点(即5000)继续训练,并在0处启动新检查点的计数(在500步之后保存第一个步骤 - “ CheckPoint-500”),还是根本不继续培训?我还没有找到一种测试方法,并且尚不清楚文档。

I currently have my trainer set up as:

training_args = TrainingArguments(
    output_dir=f"./results_{model_checkpoint}",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    weight_decay=0.01,
    push_to_hub=True,
    save_total_limit = 1,
    resume_from_checkpoint=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_qa["train"],
    eval_dataset=tokenized_qa["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

After training, in my output_dir I have several files that the trainer saved:

['README.md',
 'tokenizer.json',
 'training_args.bin',
 '.git',
 '.gitignore',
 'vocab.txt',
 'config.json',
 'checkpoint-5000',
 'pytorch_model.bin',
 'tokenizer_config.json',
 'special_tokens_map.json',
 '.gitattributes']

From the documentation it seems that resume_from_checkpoint will continue training the model from the last checkpoint:

resume_from_checkpoint (str or bool, optional) — If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.

But when I call trainer.train() it seems to delete the last checkpoint and start a new one:

Saving model checkpoint to ./results_distilbert-base-uncased/checkpoint-500
...
Deleting older checkpoint [results_distilbert-base-uncased/checkpoint-5000] due to args.save_total_limit

Does it really continue training from the last checkpoint (i.e., 5000) and just starts the count of the new checkpoint at 0 (saves the first after 500 steps -- "checkpoint-500"), or does it simply not continue the training? I haven't found a way to test it and the documentation is not clear on that.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

嘦怹 2025-02-15 09:46:02

是的,它有效!当您致电Trainer.train()时,您会暗中告诉它覆盖所有检查点并从头开始。您应该致电Trainer.Train(insume_from_checkpoint = true)或将简历_FOM_CHECKPOINT设置为指向检查点路径的字符串。

Yes it works! When you call trainer.train() you're implicitly telling it to override all checkpoints and start from scratch. You should call trainer.train(resume_from_checkpoint=True) or set resume_from_checkpoint to a string pointing to the checkpoint path.

无戏配角 2025-02-15 09:46:02

查看代码,它首先, and continues training from there to您正在运行的时期总数(无重置为0)。

要查看它继续培训,请在调用triber.train()上,增加num_train_epochs

Looking at the code, it first loads the checkpoint state, updates how many epochs have already been run, and continues training from there to the total number of epochs you're running the job for (no reset to 0).

To see it continue training, increase your num_train_epochs before calling trainer.train() on your checkpoint.

夜访吸血鬼 2025-02-15 09:46:02

您还应将简历_FOM_CHECKPOINT参数添加到Trainer.Train,并带有链接到Checkpoint

Trainer.Train(insume_from_checkpoint =“ {< path-where-where-checkpoint-were_stored>/checkpoint-0000”)

0000-示例Chackpoin编号。

在整个过程中,不要忘记安装驱动器。

You also should add resume_from_checkpoint parametr to trainer.train with link to checkpoint

trainer.train(resume_from_checkpoint="{<path-where-checkpoint-were_stored>/checkpoint-0000")

0000- example of checkpoin number.

Don't forget to mount your drive during whole this process.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文