如何处理不平衡的多标签数据集？

发布于 2025-02-08 12:48:37 字数 3192 浏览 2 评论 0 原文

我目前正在尝试使用带有4个标签（A，B，C，D）的Pytorch Densenet121训练图像分类模型。我有224000张图像，每个图像以的形式标记为 [1，0，0，1] （图像中存在标签A和D）。我替换了Densenet121的最后一个密集层。该模型是使用Adam Optimizer训练的，LR为0.0001（每个时期的衰减为10倍），并接受了4个时期的训练。在我相信班级不平衡问题已经解决之后，我将尝试更多时代。

估计的阳性类数为 [19000，65000，38000，105000] 。当我训练模型而没有上课平衡和权重（与Bceloss）时，我对标签A和C的召回率很低（实际上，真正的正面TP和假阳性FP小于20），

我尝试了3种方法来处理在Google和Stackoverflow进行了广泛的搜索后，类不平衡。

方法1：班级体重 我试图通过使用负样本与正样本的比率来实现班级权重。

y = train_df[CLASSES];
pos_weight = (y==0).sum()/(y==1).sum()

pos_weight = torch.Tensor(pos_weight)
if torch.cuda.is_available():
    pos_weight = pos_weight.cuda()
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

最终的班级权重为 [10.79，2.45，4.90，1.13] 。我得到了相反的效果；具有太多的积极预测，导致精确度较低。

方法2：更改班级权重的逻辑

我还试图通过获取数据集中的积极样本的比例并进行倒数来获得班级权重。最终的班级权重为 [11.95，3.49，5.97，2.16] 。我仍然得到太多的积极预测。

class_dist = y.apply(pd.Series.value_counts)
class_dist_norm = class_dist.loc[1.0]/class_dist.loc[1.0].sum()
pos_weight = 1/class_dist_norm

方法3：局灶性损失

我还尝试了以下实施（但仍然得到太多的积极预测）尝试了焦点损失。我已经使用了 alpha 参数的班级权重。从但我做了一些修改更适合我的用例。

class FocalLoss(nn.CrossEntropyLoss):
    ''' Focal loss for classification tasks on imbalanced datasets '''

    def __init__(self, alpha=None, gamma=1.5, ignore_index=-100, reduction='mean', epsilon=1e-6):
        super().__init__(weight=alpha, ignore_index=ignore_index, reduction='mean')
        self.reduction = reduction
        self.gamma = gamma
        self.epsilon = epsilon
        self.alpha = alpha

    def forward(self, input_, target):
        # cross_entropy = super().forward(input_, target)
        # Temporarily mask out ignore index to '0' for valid gather-indices input.
        # This won't contribute final loss as the cross_entropy contribution
        # for these would be zero.
        target = target * (target != self.ignore_index).long()

        # p_t = p if target = 1, p_t = (1-p) if target = 0, where p is the probability of predicting target = 1

        p_t = input_ * target + (1 - input_) * (1 - target)

        # Loss = -(alpha)( 1 - p_t)^gamma log(p_t), where -log(p_t) is cross entropy => loss = (alpha)(1-p_t)^gamma * cross_entropy (Epsilon added to prevent error with log(0) when class probability is 0)
        if self.alpha != None:
            loss = -1 * self.alpha * torch.pow(1 - p_t, self.gamma) * torch.log(p_t + self.epsilon)
        else:
            loss = -1 * torch.pow(1 - p_t, self.gamma) * torch.log(p_t + self.epsilon)


        if self.reduction == 'mean':
            return torch.mean(loss)
        elif self.reduction == 'sum':
            return torch.sum(loss)
        else:
            return loss

要注意的一件事是，在第一个时期之后使用停滞的损失，但是时代之间的指标有所不同。

我已经考虑了不足的采样和过采样，但是由于每个图像都具有1个以上的标签，因此我不确定如何进行。一种可能的方法是通过复制1个标签的图像来超过图像。但是我担心该模型只会在具有1个标签的图像上概括，但在带有多个标签的图像上表现不佳。

因此，我想问一下我应该尝试的方法，还是在方法中犯了任何错误。

任何建议将不胜感激。

谢谢你！

原文

I am currently trying to train an image classification model using Pytorch densenet121 with 4 labels (A, B, C, D). I have 224000 images and each image is labeled in the form of [1, 0, 0, 1] (Label A and D are present in the image). I have replaced the last dense layer of densenet121. The model is trained using Adam optimizer, LR of 0.0001 (with the decay of a factor of 10 per epoch), and trained for 4 epochs. I will try more epochs after I am confident that the class imbalanced issue is resolved.

The estimated number of positive classes is [19000, 65000, 38000, 105000] respectively. When I trained the model without class balancing and weights (with BCELoss), I have a very low recall for label A and C (in fact the True Positive TP and False Positive FP is less than 20)

I have tried 3 approaches to deal with the class imbalance after an extensive search on Google and Stackoverflow.

Approach 1: Class weights
I have tried to implement class weights by using the ratio of negative samples to positive samples.

y = train_df[CLASSES];
pos_weight = (y==0).sum()/(y==1).sum()

pos_weight = torch.Tensor(pos_weight)
if torch.cuda.is_available():
    pos_weight = pos_weight.cuda()
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

The resultant class weights are [10.79, 2.45, 4.90, 1.13]. I am getting the opposite effect; having too many positive predictions which result in low precision.

Approach 2: Changing logic for class weights

I have also tried to get class weights by getting the proportion of the positive samples in the dataset and getting the inverse. The resultant class weights are [11.95, 3.49, 5.97, 2.16]. I am still getting too many positive predictions.

class_dist = y.apply(pd.Series.value_counts)
class_dist_norm = class_dist.loc[1.0]/class_dist.loc[1.0].sum()
pos_weight = 1/class_dist_norm

Approach 3: Focal Loss

I have also tried Focal Loss with the following implementation (but still getting too many positive predictions). I have used the class weights for the alpha parameter. This is referenced from https://gist.github.com/f1recracker/0f564fd48f15a58f4b92b3eb3879149b but I made some modifications to suit my use case better.

class FocalLoss(nn.CrossEntropyLoss):
    ''' Focal loss for classification tasks on imbalanced datasets '''

    def __init__(self, alpha=None, gamma=1.5, ignore_index=-100, reduction='mean', epsilon=1e-6):
        super().__init__(weight=alpha, ignore_index=ignore_index, reduction='mean')
        self.reduction = reduction
        self.gamma = gamma
        self.epsilon = epsilon
        self.alpha = alpha

    def forward(self, input_, target):
        # cross_entropy = super().forward(input_, target)
        # Temporarily mask out ignore index to '0' for valid gather-indices input.
        # This won't contribute final loss as the cross_entropy contribution
        # for these would be zero.
        target = target * (target != self.ignore_index).long()

        # p_t = p if target = 1, p_t = (1-p) if target = 0, where p is the probability of predicting target = 1

        p_t = input_ * target + (1 - input_) * (1 - target)

        # Loss = -(alpha)( 1 - p_t)^gamma log(p_t), where -log(p_t) is cross entropy => loss = (alpha)(1-p_t)^gamma * cross_entropy (Epsilon added to prevent error with log(0) when class probability is 0)
        if self.alpha != None:
            loss = -1 * self.alpha * torch.pow(1 - p_t, self.gamma) * torch.log(p_t + self.epsilon)
        else:
            loss = -1 * torch.pow(1 - p_t, self.gamma) * torch.log(p_t + self.epsilon)


        if self.reduction == 'mean':
            return torch.mean(loss)
        elif self.reduction == 'sum':
            return torch.sum(loss)
        else:
            return loss

One thing to note is that the loss using stagnant after the first epoch, but the metrics varied between epochs.

I have considered undersampling and oversampling but I am unsure of how to proceed due to the fact that each image can have more than 1 label. One possible method is to oversample images with only 1 label by replicating them. But I am concerned that the model will only generalize on images with 1 label but perform poorly on images with multiple labels.

Therefore I would like to ask if there are methods that I should try, or did I make any mistakes in my approaches.

Any advice will be greatly appreciated.

Thank you!

分享到QQ

分享到微博