为什么验证损失是恒定的?

发布于 2025-01-13 06:04:54 字数 3223 浏览 0 评论 0原文

我正在尝试在我制作的自定义数据集上使用 Aladdin Persson 的unet 模型。问题是“在训练期间,训练损失不断减少,而验证损失却保持不变”。我就是不明白问题出在哪里。我的训练集中有 368 张图片,验证集中有 51 张图片。 [橙色是验证损失,蓝色是训练][1] 我还发布了我的训练代码以及检查验证集准确性的部分。

这部分是train_fn。

for batch_idx, (data, targets) in enumerate(loop):
    #img = data.cpu().squeeze(0).permute(1,2,0).numpy()
    #plt.imshow(img)
    data = data.to(device=DEVICE)
    targets = targets.float().unsqueeze(1).to(device=DEVICE)
    
    # forward
    with torch.cuda.amp.autocast():
        predictions = model(data)
        loss = loss_fn(predictions, targets)
    # backward
    optimizer.zero_grad()
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    
    running_loss += loss.item()

    # update tqdm loop
    loop.set_postfix(loss=loss.item())
    
train_loss = running_loss/len(loader)
train_losses.append(train_loss)

epochs.append(epoch)
scheduler.step()

还有培训部分

for epoch in range(1,NUM_EPOCHS):
    train_fn(train_loader, model, optimizer, loss_fn, scaler, epoch, scheduler)

    #save model
    checkpoint = {
        "state_dict": model.state_dict(),
        "optimizer":optimizer.state_dict(),
    }
    save_checkpoint(checkpoint)

    # check accuracy
    val_loss = check_accuracy(epoch, val_loader, model, loss_fn, device=DEVICE)
    val_losses.append(val_loss)
    # print some examples to a folder
    save_predictions_as_imgs(
        val_loader, model, folder="saved_images/", device=DEVICE
    )

    plt.plot(epochs, train_losses)
    plt.plot(epochs, val_losses)
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.title('Loss function')
    plt.show()
    

和 check_accuracy

def check_accuracy(epoch ,loader, model, loss_fn, device="cuda"):
try:
    val_losses
except NameError:
    val_losses = []
num_correct = 0
num_pixels = 0
dice_score = 0
running_loss = 0
idx = 1
model.eval()

with torch.no_grad():
    for x, y in loader:
        # if idx <= 10:
        #     grid_data = make_grid(x)
        #     grid_mask = make_grid(y)
        #     f, axarr_val = plt.subplots(2,1)
        #     plt.title('Validation transform')
        #     axarr_val[0].imshow(grid_data.permute(1,2,0).numpy())
        #     axarr_val[1].imshow(grid_mask.permute(1,2,0).numpy())
        #     plt.savefig("transformacije/validation/fig" + str(epoch+1) + str(idx) + ".png")
        #     plt.close(f)
        #     idx = idx+1
        x = x.to(device)
        y = y.to(device).unsqueeze(1)
        preds = torch.sigmoid(model(x))
        preds = (preds > 0.5).float()
        num_correct += (preds == y).sum()
        num_pixels += torch.numel(preds)
        dice_score += (2 * (preds * y).sum()) / (
            (preds + y).sum() + 1e-8
        )
        loss = loss_fn(preds, y)
        running_loss += loss.item()
    val_loss = running_loss/len(loader)
print(
    f"Got {num_correct}/{num_pixels} with acc {num_correct/num_pixels*100:.2f}"
)
print(f"Dice score: {dice_score/len(loader)}")
print(f"Validation Loss: {val_loss}")
model.train()
return val_loss

如果您能提供帮助,我将不胜感激。谢谢。 [1]: https://i.sstatic.net/tRh89.png

I am trying to use the unet model from Aladdin Persson on a custom dataset i made. The problem is 'during the training the training loss is decreasing while the validation loss is constant. And i just can't figure out what the problem is. I have 368 pictures in the training set and 51 in the validation set.
[Orange is validation loss and blue training][1]
I am also posting my training code and the part where i check the accuracy on the validation set.

This part is the train_fn.

for batch_idx, (data, targets) in enumerate(loop):
    #img = data.cpu().squeeze(0).permute(1,2,0).numpy()
    #plt.imshow(img)
    data = data.to(device=DEVICE)
    targets = targets.float().unsqueeze(1).to(device=DEVICE)
    
    # forward
    with torch.cuda.amp.autocast():
        predictions = model(data)
        loss = loss_fn(predictions, targets)
    # backward
    optimizer.zero_grad()
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    
    running_loss += loss.item()

    # update tqdm loop
    loop.set_postfix(loss=loss.item())
    
train_loss = running_loss/len(loader)
train_losses.append(train_loss)

epochs.append(epoch)
scheduler.step()

And the training part

for epoch in range(1,NUM_EPOCHS):
    train_fn(train_loader, model, optimizer, loss_fn, scaler, epoch, scheduler)

    #save model
    checkpoint = {
        "state_dict": model.state_dict(),
        "optimizer":optimizer.state_dict(),
    }
    save_checkpoint(checkpoint)

    # check accuracy
    val_loss = check_accuracy(epoch, val_loader, model, loss_fn, device=DEVICE)
    val_losses.append(val_loss)
    # print some examples to a folder
    save_predictions_as_imgs(
        val_loader, model, folder="saved_images/", device=DEVICE
    )

    plt.plot(epochs, train_losses)
    plt.plot(epochs, val_losses)
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.title('Loss function')
    plt.show()
    

And the check_accuracy

def check_accuracy(epoch ,loader, model, loss_fn, device="cuda"):
try:
    val_losses
except NameError:
    val_losses = []
num_correct = 0
num_pixels = 0
dice_score = 0
running_loss = 0
idx = 1
model.eval()

with torch.no_grad():
    for x, y in loader:
        # if idx <= 10:
        #     grid_data = make_grid(x)
        #     grid_mask = make_grid(y)
        #     f, axarr_val = plt.subplots(2,1)
        #     plt.title('Validation transform')
        #     axarr_val[0].imshow(grid_data.permute(1,2,0).numpy())
        #     axarr_val[1].imshow(grid_mask.permute(1,2,0).numpy())
        #     plt.savefig("transformacije/validation/fig" + str(epoch+1) + str(idx) + ".png")
        #     plt.close(f)
        #     idx = idx+1
        x = x.to(device)
        y = y.to(device).unsqueeze(1)
        preds = torch.sigmoid(model(x))
        preds = (preds > 0.5).float()
        num_correct += (preds == y).sum()
        num_pixels += torch.numel(preds)
        dice_score += (2 * (preds * y).sum()) / (
            (preds + y).sum() + 1e-8
        )
        loss = loss_fn(preds, y)
        running_loss += loss.item()
    val_loss = running_loss/len(loader)
print(
    f"Got {num_correct}/{num_pixels} with acc {num_correct/num_pixels*100:.2f}"
)
print(f"Dice score: {dice_score/len(loader)}")
print(f"Validation Loss: {val_loss}")
model.train()
return val_loss

I would be grateful if you could help anyway possible. Thank you.
[1]: https://i.sstatic.net/tRh89.png

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

不一样的天空 2025-01-20 06:04:54

您的模型可能对数据过度拟合,特别是考虑到数据集的大小与数据的大小相比。您的代码很可能没有任何问题,但您应该使用较小的模型或增加数据集的大小,或者更有可能两者兼而有之。

编辑:在论文中,作者使用了大量的数据增强,通过提高输入的方差(有效地从原始 30 个带注释的图像创建更大的数据集)来使他们的模型更加稳健。据我所知,您目前在训练循环中没有使用任何增强功能。

It is likely that your model is overfitting to the data, especially given the size of your dataset compared to the size of your data. There is most likely nothing wrong with your code, but instead you should be using either a smaller model or increase the size of your dataset, or more likely both.

Edit: In the paper, the authors use a lot of data augmentation to make their model more robust by boosting the variance of inputs (effectively creating a larger dataset from the original 30 annotated images). As far as I can see, you currently are not using any augmentations in your training loop.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文