huggingface;工作？

发布于 2025-02-08 09:46:02 字数 1521 浏览 2 评论 0原文

我目前的培训师设置为：

training_args = TrainingArguments(
    output_dir=f"./results_{model_checkpoint}",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    weight_decay=0.01,
    push_to_hub=True,
    save_total_limit = 1,
    resume_from_checkpoint=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_qa["train"],
    eval_dataset=tokenized_qa["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

培训后，在我的output_dir中，我有几个文件保存的文件：

['README.md',
 'tokenizer.json',
 'training_args.bin',
 '.git',
 '.gitignore',
 'vocab.txt',
 'config.json',
 'checkpoint-5000',
 'pytorch_model.bin',
 'tokenizer_config.json',
 'special_tokens_map.json',
 '.gitattributes']

从

main_classes 或bool，可选） - 如果是由培训师的实例保存的STR，即通往保存检查点的本地路径。如果布尔和等于true，请加载最后一个检查点在args.output_dir中，由培训师的先前实例保存。如果存在，培训将从此处加载的模型/优化器/调度程序状态恢复。，

但是当我调用Trainer.train.train（）时，它似乎会删除最后一个检查点并启动新的检查点：

Saving model checkpoint to ./results_distilbert-base-uncased/checkpoint-500
...
Deleting older checkpoint [results_distilbert-base-uncased/checkpoint-5000] due to args.save_total_limit

它是否真的继续从上一个检查点（即5000）继续训练，并在0处启动新检查点的计数（在500步之后保存第一个步骤 - “ CheckPoint-500”），还是根本不继续培训？我还没有找到一种测试方法，并且尚不清楚文档。

原文

I currently have my trainer set up as:

training_args = TrainingArguments(
    output_dir=f"./results_{model_checkpoint}",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    weight_decay=0.01,
    push_to_hub=True,
    save_total_limit = 1,
    resume_from_checkpoint=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_qa["train"],
    eval_dataset=tokenized_qa["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

After training, in my output_dir I have several files that the trainer saved:

['README.md',
 'tokenizer.json',
 'training_args.bin',
 '.git',
 '.gitignore',
 'vocab.txt',
 'config.json',
 'checkpoint-5000',
 'pytorch_model.bin',
 'tokenizer_config.json',
 'special_tokens_map.json',
 '.gitattributes']

From the documentation it seems that resume_from_checkpoint will continue training the model from the last checkpoint:

resume_from_checkpoint (str or bool, optional) — If a str, local path to a saved checkpoint as saved by a previous instance of Trainer. If a bool and equals True, load the last checkpoint in args.output_dir as saved by a previous instance of Trainer. If present, training will resume from the model/optimizer/scheduler states loaded here.

But when I call trainer.train() it seems to delete the last checkpoint and start a new one:

Saving model checkpoint to ./results_distilbert-base-uncased/checkpoint-500
...
Deleting older checkpoint [results_distilbert-base-uncased/checkpoint-5000] due to args.save_total_limit

Does it really continue training from the last checkpoint (i.e., 5000) and just starts the count of the new checkpoint at 0 (saves the first after 500 steps -- "checkpoint-500"), or does it simply not continue the training? I haven't found a way to test it and the documentation is not clear on that.

分享到QQ

分享到微博