为什么计算的损失与 pytorch_lightning 中记录的损失不同?
我正在训练一个模型,并希望在每次验证损失改善时创建一个混淆矩阵。因此,在 validation_epoch_end
中,我检查该纪元的损失是否优于之前的任何损失。我意识到,有时我计算的损失(validation_step 的所有损失的平均值)并不等于张量板上记录的损失。
我在下面创建了一个小玩具示例。 这可能只是一个舍入错误吗?有没有更好的方法来了解 validation_epoch_end
中的确切损失?
import os
import torch
from pytorch_lightning import seed_everything
from pytorch_lightning.loggers import TensorBoardLogger
from torch import nn
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
import pytorch_lightning as pl
class LitAutoEncoder(pl.LightningModule):
def __init__(self):
super().__init__()
self.encoder = nn.Linear(28 * 28, 1)
self.decoder = nn.Linear(1, 28 * 28)
def forward(self, x):
embedding = self.encoder(x)
return embedding
def training_step(self, batch, batch_idx):
x, y = batch
x = x.view(x.size(0), -1)
x *= 42 # to get more extreme loss values in this example
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
self.log("train_loss", loss, on_step=True, on_epoch=True)
return {'loss': loss, "x": x, "x_hat": x_hat}
def training_epoch_end(self, train_step_outputs) -> None:
train_loss = torch.stack([x['loss'] for x in train_step_outputs]).mean() # different from logged value?
x_s = torch.cat([x['x'] for x in train_step_outputs])
x_hat_s = torch.cat([x['x_hat'] for x in train_step_outputs])
loss_mean = float(train_loss.detach().cpu())
loss_calc = F.mse_loss(x_hat_s, x_s) # value as logged in tensorboard
loss_calc_float = float(loss_calc.detach().cpu())
print()
print(train_loss, loss_calc, train_loss - loss_calc)
print(loss_mean, loss_calc_float, loss_mean - loss_calc_float)
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
if __name__ == '__main__':
seed_everything(42)
dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
train, val = random_split(dataset, [1000, 59000])
autoencoder = LitAutoEncoder()
trainer = pl.Trainer(
max_epochs=42,
logger=TensorBoardLogger(save_dir='lightning_logs')
)
trainer.fit(autoencoder, DataLoader(train, batch_size=256, shuffle=True))
I am training a model and want to create a confusion matrix every time the validation loss improves. So, in validation_epoch_end
I check if the loss of this epoch is better than any previous loss. I realized, that sometimes the loss I calculate (mean of all losses of validation_step
) is not equal to the loss logged in tensorboard.
I created a little toy example below.
Could this just be a rounding error? Is there a better way to know the exact loss in validation_epoch_end
?
import os
import torch
from pytorch_lightning import seed_everything
from pytorch_lightning.loggers import TensorBoardLogger
from torch import nn
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
import pytorch_lightning as pl
class LitAutoEncoder(pl.LightningModule):
def __init__(self):
super().__init__()
self.encoder = nn.Linear(28 * 28, 1)
self.decoder = nn.Linear(1, 28 * 28)
def forward(self, x):
embedding = self.encoder(x)
return embedding
def training_step(self, batch, batch_idx):
x, y = batch
x = x.view(x.size(0), -1)
x *= 42 # to get more extreme loss values in this example
z = self.encoder(x)
x_hat = self.decoder(z)
loss = F.mse_loss(x_hat, x)
self.log("train_loss", loss, on_step=True, on_epoch=True)
return {'loss': loss, "x": x, "x_hat": x_hat}
def training_epoch_end(self, train_step_outputs) -> None:
train_loss = torch.stack([x['loss'] for x in train_step_outputs]).mean() # different from logged value?
x_s = torch.cat([x['x'] for x in train_step_outputs])
x_hat_s = torch.cat([x['x_hat'] for x in train_step_outputs])
loss_mean = float(train_loss.detach().cpu())
loss_calc = F.mse_loss(x_hat_s, x_s) # value as logged in tensorboard
loss_calc_float = float(loss_calc.detach().cpu())
print()
print(train_loss, loss_calc, train_loss - loss_calc)
print(loss_mean, loss_calc_float, loss_mean - loss_calc_float)
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
return optimizer
if __name__ == '__main__':
seed_everything(42)
dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor())
train, val = random_split(dataset, [1000, 59000])
autoencoder = LitAutoEncoder()
trainer = pl.Trainer(
max_epochs=42,
logger=TensorBoardLogger(save_dir='lightning_logs')
)
trainer.fit(autoencoder, DataLoader(train, batch_size=256, shuffle=True))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我也注意到了这种奇怪的行为。
答案在于对损失应用移动平均线的代码!请参阅此 github 问题。
因此,建议做的事情是定性地使用进度条,并定量地使用日志实际上准确的绘图上的值:)
I noticed this strange behaviour as well.
The answer lies in the code where a moving average is applied on the loss! See this github issue.
So the recommended thing to do is using the progress bar qualitatively, and using quantitatively the values on the plots where logs are actually accurate :)