为什么计算的损失与 pytorch_lightning 中记录的损失不同?

发布于 2025-01-13 03:39:40 字数 2501 浏览 0 评论 0原文

我正在训练一个模型,并希望在每次验证损失改善时创建一个混淆矩阵。因此,在 validation_epoch_end 中,我检查该纪元的损失是否优于之前的任何损失。我意识到,有时我计算的损失(validation_step 的所有损失的平均值)并不等于张量板上记录的损失。

我在下面创建了一个小玩具示例。 这可能只是一个舍入错误吗?有没有更好的方法来了解 validation_epoch_end 中的确切损失?

import os
import torch
from pytorch_lightning import seed_everything
from pytorch_lightning.loggers import TensorBoardLogger
from torch import nn
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
import pytorch_lightning as pl


class LitAutoEncoder(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Linear(28 * 28, 1)
        self.decoder = nn.Linear(1, 28 * 28)

    def forward(self, x):
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)

        x *= 42  # to get more extreme loss values in this example

        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log("train_loss", loss, on_step=True, on_epoch=True)
        return {'loss': loss, "x": x, "x_hat": x_hat}

    def training_epoch_end(self, train_step_outputs) -> None:
        train_loss = torch.stack([x['loss'] for x in train_step_outputs]).mean()  # different from logged value?
        x_s = torch.cat([x['x'] for x in train_step_outputs])
        x_hat_s = torch.cat([x['x_hat'] for x in train_step_outputs])

        loss_mean = float(train_loss.detach().cpu())
        loss_calc = F.mse_loss(x_hat_s, x_s)  # value as logged in tensorboard
        loss_calc_float = float(loss_calc.detach().cpu())

        print()
        print(train_loss, loss_calc, train_loss - loss_calc)
        print(loss_mean, loss_calc_float, loss_mean - loss_calc_float)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


if __name__ == '__main__':
    seed_everything(42)

    dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
    train, val = random_split(dataset, [1000, 59000])

    autoencoder = LitAutoEncoder()
    trainer = pl.Trainer(
        max_epochs=42,
        logger=TensorBoardLogger(save_dir='lightning_logs')
    )
    trainer.fit(autoencoder, DataLoader(train, batch_size=256, shuffle=True))

I am training a model and want to create a confusion matrix every time the validation loss improves. So, in validation_epoch_end I check if the loss of this epoch is better than any previous loss. I realized, that sometimes the loss I calculate (mean of all losses of validation_step) is not equal to the loss logged in tensorboard.

I created a little toy example below.
Could this just be a rounding error? Is there a better way to know the exact loss in validation_epoch_end?

import os
import torch
from pytorch_lightning import seed_everything
from pytorch_lightning.loggers import TensorBoardLogger
from torch import nn
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
import pytorch_lightning as pl


class LitAutoEncoder(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Linear(28 * 28, 1)
        self.decoder = nn.Linear(1, 28 * 28)

    def forward(self, x):
        embedding = self.encoder(x)
        return embedding

    def training_step(self, batch, batch_idx):
        x, y = batch
        x = x.view(x.size(0), -1)

        x *= 42  # to get more extreme loss values in this example

        z = self.encoder(x)
        x_hat = self.decoder(z)
        loss = F.mse_loss(x_hat, x)
        self.log("train_loss", loss, on_step=True, on_epoch=True)
        return {'loss': loss, "x": x, "x_hat": x_hat}

    def training_epoch_end(self, train_step_outputs) -> None:
        train_loss = torch.stack([x['loss'] for x in train_step_outputs]).mean()  # different from logged value?
        x_s = torch.cat([x['x'] for x in train_step_outputs])
        x_hat_s = torch.cat([x['x_hat'] for x in train_step_outputs])

        loss_mean = float(train_loss.detach().cpu())
        loss_calc = F.mse_loss(x_hat_s, x_s)  # value as logged in tensorboard
        loss_calc_float = float(loss_calc.detach().cpu())

        print()
        print(train_loss, loss_calc, train_loss - loss_calc)
        print(loss_mean, loss_calc_float, loss_mean - loss_calc_float)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


if __name__ == '__main__':
    seed_everything(42)

    dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
    train, val = random_split(dataset, [1000, 59000])

    autoencoder = LitAutoEncoder()
    trainer = pl.Trainer(
        max_epochs=42,
        logger=TensorBoardLogger(save_dir='lightning_logs')
    )
    trainer.fit(autoencoder, DataLoader(train, batch_size=256, shuffle=True))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

半步萧音过轻尘 2025-01-20 03:39:40

我也注意到了这种奇怪的行为。

答案在于对损失应用移动平均线的代码!请参阅此 github 问题

因此,建议做的事情是定性地使用进度条,并定量地使用日志实际上准确的绘图上的值:)

I noticed this strange behaviour as well.

The answer lies in the code where a moving average is applied on the loss! See this github issue.

So the recommended thing to do is using the progress bar qualitatively, and using quantitatively the values on the plots where logs are actually accurate :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文