如何使报告的步骤成为pytorch光线的记录频率的倍数,而不是日志记录频率减去1?

发布于 2025-02-13 18:51:51 字数 1056 浏览 3 评论 0原文

[警告! 内部的pedantry]

我正在使用pytorch Lightning包装我的Pytorch型号,但是由于我是pedantic的,我发现记录器在我所要求的频率上报告步骤的方式令人沮丧, 减1

  1. 当我设置log_every_n_steps = 100 in trainer in 时,我的tensorboard输出显示了我的指标99、199、299等。为什么不在100、200、300处?
  2. 当我设置check_val_every_n_epoch = 30trainer中时,我的控制台输出显示进度栏升至Epoch 29,然后做validate,留下一条跟踪在时期29、59、89之后报告指标的控制台输出。这样:
Epoch 29: 100%|█████████████████████████████| 449/449 [00:26<00:00, 17.01it/s, loss=0.642, v_num=logs]
[validation] {'roc_auc': 0.663, 'bacc': 0.662, 'f1': 0.568, 'loss': 0.633}
Epoch 59: 100%|█████████████████████████████| 449/449 [00:26<00:00, 16.94it/s, loss=0.626, v_num=logs]
[validation] {'roc_auc': 0.665, 'bacc': 0.652, 'f1': 0.548, 'loss': 0.630}
Epoch 89: 100%|█████████████████████████████| 449/449 [00:27<00:00, 16.29it/s, loss=0.624, v_num=logs]
[validation] {'roc_auc': 0.665, 'bacc': 0.652, 'f1': 0.548, 'loss': 0.627}

我做错了吗?我应该简单地提交PL来解决这个问题吗?

[Warning!! pedantry inside]

I'm using PyTorch Lightning to wrap my PyTorch model, but because I'm pedantic, I am finding the logger to be frustrating in the way it reports the steps at the frequency I've asked for, minus 1:

  1. When I set log_every_n_steps=100 in Trainer, my Tensorboard output shows my metrics at step 99, 199, 299, etc. Why not at 100, 200, 300?
  2. When I set check_val_every_n_epoch=30 in Trainer, my console output shows progress bar goes up to epoch 29, then does a validate, leaving a trail of console outputs that report metrics after epochs 29, 59, 89, etc. Like this:
Epoch 29: 100%|█████████████████████████████| 449/449 [00:26<00:00, 17.01it/s, loss=0.642, v_num=logs]
[validation] {'roc_auc': 0.663, 'bacc': 0.662, 'f1': 0.568, 'loss': 0.633}
Epoch 59: 100%|█████████████████████████████| 449/449 [00:26<00:00, 16.94it/s, loss=0.626, v_num=logs]
[validation] {'roc_auc': 0.665, 'bacc': 0.652, 'f1': 0.548, 'loss': 0.630}
Epoch 89: 100%|█████████████████████████████| 449/449 [00:27<00:00, 16.29it/s, loss=0.624, v_num=logs]
[validation] {'roc_auc': 0.665, 'bacc': 0.652, 'f1': 0.548, 'loss': 0.627}

Am I doing something wrong? Should I simply submit a PR to PL to fix this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

绮烟 2025-02-20 18:51:51

您没有做错任何事情。 Python使用基于零的索引,因此时期计数也从零开始。如果要更改所显示的内容的行为,则需要覆盖默认的tqdmprogressbar 和修改on_train_epoch_start以显示偏置值。您可以通过以下方式实现这一目标:

from pytorch_lightning.callbacks.progress.tqdm_progress import convert_inf

class LitProgressBar(TQDMProgressBar):
    def init_validation_tqdm(self):
        bar = super().init_validation_tqdm()
        bar.set_description("running validation...")
        return bar
    def on_train_epoch_start(self, trainer, *_) -> None:
        total_train_batches = self.total_train_batches
        total_val_batches = self.total_val_batches
        if total_train_batches != float("inf") and total_val_batches != float("inf"):
            # val can be checked multiple times per epoch
            val_checks_per_epoch = total_train_batches // trainer.val_check_batch
            total_val_batches = total_val_batches * val_checks_per_epoch
        total_batches = total_train_batches + total_val_batches
        self.main_progress_bar.reset(convert_inf(total_batches))
        self.main_progress_bar.set_description(f"Epoch {trainer.current_epoch + 1}")

在最后一行的代码中注意+1。这将抵消进度栏中显示的时期。然后将您的自定义栏传递给您的教练:

# Initialize a trainer
trainer = Trainer(
    accelerator="auto",
    devices=1 if torch.cuda.is_available() else None,  # limiting got iPython runs
    max_epochs=3,
    callbacks=[LitProgressBar()],
    log_every_n_steps=100
)

Finaly:

trainer.fit(mnist_model, train_loader)

对于第一个时代,这将显示:

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name | Type   | Params
--------------------------------
0 | l1   | Linear | 7.9 K 
--------------------------------
7.9 K     Trainable params
0         Non-trainable params
7.9 K     Total params
0.031     Total estimated model params size (MB)

Epoch 1: 17%                        160/938 [00:02<00:11, 68.93it/s, loss=1.05, v_num=4]

而不是默认值

Epoch 0: 17%                        160/938 [00:02<00:11, 68.93it/s, loss=1.05, v_num=4]

You are not doing anything wrong. Python uses zero-based indexing so epoch counting starts at zero as well. If you want to change the behavior of what is being displayed you will need to override the default TQDMProgressBar and modify on_train_epoch_start to display an offsetted value. You can achieve this by:

from pytorch_lightning.callbacks.progress.tqdm_progress import convert_inf

class LitProgressBar(TQDMProgressBar):
    def init_validation_tqdm(self):
        bar = super().init_validation_tqdm()
        bar.set_description("running validation...")
        return bar
    def on_train_epoch_start(self, trainer, *_) -> None:
        total_train_batches = self.total_train_batches
        total_val_batches = self.total_val_batches
        if total_train_batches != float("inf") and total_val_batches != float("inf"):
            # val can be checked multiple times per epoch
            val_checks_per_epoch = total_train_batches // trainer.val_check_batch
            total_val_batches = total_val_batches * val_checks_per_epoch
        total_batches = total_train_batches + total_val_batches
        self.main_progress_bar.reset(convert_inf(total_batches))
        self.main_progress_bar.set_description(f"Epoch {trainer.current_epoch + 1}")

Notice the +1 in the last line of code. This will offset the epoch displayed in the progress bar. Then pass your custom bar to your trainer:

# Initialize a trainer
trainer = Trainer(
    accelerator="auto",
    devices=1 if torch.cuda.is_available() else None,  # limiting got iPython runs
    max_epochs=3,
    callbacks=[LitProgressBar()],
    log_every_n_steps=100
)

Finaly:

trainer.fit(mnist_model, train_loader)

For the first epoch this will display:

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name | Type   | Params
--------------------------------
0 | l1   | Linear | 7.9 K 
--------------------------------
7.9 K     Trainable params
0         Non-trainable params
7.9 K     Total params
0.031     Total estimated model params size (MB)

Epoch 1: 17%                        160/938 [00:02<00:11, 68.93it/s, loss=1.05, v_num=4]

and not the default

Epoch 0: 17%                        160/938 [00:02<00:11, 68.93it/s, loss=1.05, v_num=4]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文