焦点损失 NLP/文本数据 pytorch - 改善结果

发布于 2025-01-16 07:57:33 字数 2559 浏览 3 评论 0原文

我有一个 NLP/文本数据分类问题,其中存在非常倾斜的分布 - 类 0 - 98%,类 1 - 2% 对于我的训练和验证数据,我正在进行过采样,我的类别分布为类别 0 - 55%,类别 1 - 45%。 测试数据存在偏态分布,

我使用 nn.BCEWithLogitsLoss(pos_weight=tensor(1.2579, device='cuda:0')) 构建了一个模型。 pos_weight 是使用 55/45(训练数据中的类分布)计算的。

在测试数据的第 1 类中,我得到了 f1 性能 <代码>0.07, 真阴性、假阳性、假阴性、真阳​​性 = (28809, 13258, 537, 495)

我使用下面的代码更改为焦点损失,但我的性能并没有提高很多。测试数据类别 1 上的 f1 仍然相同并且 真阴性、假阳性、假阴性、真阳​​性 = (32527, 9540, 640, 392)

kornia.losses.binary_focal_loss_with_logits(probssss, labelsss,alpha=0.25,gamma=2.0,reduction ='mean')

  1. 我的 alpha 和 gamma 参数是否错误?我应该尝试什么特定的值吗?我可以尝试调整它们,但这可能需要大量的时间和资源。因此,我正在寻找
  2. 针对我的 nn.BCEWithLogitsLoss(pos_weight=tensor(1.2579, device='cuda:0')) 的建议,我应该为 pos_weight 使用任何其他值吗?请记住,我的目标是为测试数据类 1 获得最大的 f1 性能

#update

我正在使用手套嵌入构建 CNN - 我获取我的文本并找到他们的手套嵌入- 我正在删除所有标点符号,除此之外没有其他主要数据清理。我对调整焦点损失的参数 - alpha 和 gamma 感兴趣

我的模型如下

class CNN(nn.Module):
    
    def __init__(self,
                 pretrained_embedding,
                 embed_dim,
                 filter_sizes,
                 num_filters,
                 fc1_neurons,
                 fc2_neurons,
                 dropout):

        super(CNN, self).__init__()
        
        # Embedding layer
        self.vocab_size, self.embed_dim = pretrained_embedding.shape
        self.embedding = nn.Embedding.from_pretrained(pretrained_embedding,
                                                      freeze=True)

        # Conv Network
        self.conv1d_list = nn.ModuleList([
            nn.Conv1d(in_channels=self.embed_dim,
                      out_channels=num_filters[i],
                      kernel_size=filter_sizes[i])
            for i in range(len(filter_sizes))
        ])
        
        #Batchnorm
        self.batch_norm1 = nn.BatchNorm1d(num_filters[0] * len(filter_sizes))
        
        # Dropout Layer
        self.dropout = nn.Dropout(p=dropout)
        
        # RELU activation function
        self.relu =  nn.ReLU()
        
        # Fully-connected layers
#         self.fc1 = nn.Linear(np.sum(num_filters), fc1_neurons)
        
        self.batch_norm2 = nn.BatchNorm1d(num_filters)
        
        self.fc2 = nn.Linear(np.sum(num_filters), fc2_neurons)
        
        self.batch_norm3 = nn.BatchNorm1d(fc2_neurons)
        
        self.fc3 = nn.Linear(fc2_neurons, 1)

I have a NLP/text data classification problem where there is a very skewed distribution - class 0 - 98%, class 1 - 2%
For my training and validation data I am doing oversampling and my class distribution is class 0 - 55%, class 1 - 45%.
The test data has skewed distribution

i built a model using nn.BCEWithLogitsLoss(pos_weight=tensor(1.2579, device='cuda:0')) . pos_weight was calculated using 55/45 (class distribution in training data.)

and on my class 1 of test data I got f1 performance of 0.07,
true negatives, false positives, false negative, true positive = (28809, 13258, 537, 495)

I changed to focal loss using below code and my performance didnt improve a lot. f1 on class 1 of test data is still same and
true negatives, false positives, false negative, true positive = (32527, 9540, 640, 392)

kornia.losses.binary_focal_loss_with_logits(probssss, labelsss,alpha=0.25,gamma=2.0,reduction='mean')

  1. are my alpha and gamma parameters wrong? Are there any specific values that I should try? I could try to tune them but it might take a lot of time and resources. therefore I am looking for recommendations
  2. for my nn.BCEWithLogitsLoss(pos_weight=tensor(1.2579, device='cuda:0')) should I use any other value for pos_weight? Please remember that my goal is to get maximum f1 performance for test data class 1

#update

I am building a CNN using glove embedding - i take my text and find their glove embedding - i am removing all punctuation and apart from that no other major data cleaning. I am interested in tuning parameters of the focal loss - alpha and gamma

My model is as below

class CNN(nn.Module):
    
    def __init__(self,
                 pretrained_embedding,
                 embed_dim,
                 filter_sizes,
                 num_filters,
                 fc1_neurons,
                 fc2_neurons,
                 dropout):

        super(CNN, self).__init__()
        
        # Embedding layer
        self.vocab_size, self.embed_dim = pretrained_embedding.shape
        self.embedding = nn.Embedding.from_pretrained(pretrained_embedding,
                                                      freeze=True)

        # Conv Network
        self.conv1d_list = nn.ModuleList([
            nn.Conv1d(in_channels=self.embed_dim,
                      out_channels=num_filters[i],
                      kernel_size=filter_sizes[i])
            for i in range(len(filter_sizes))
        ])
        
        #Batchnorm
        self.batch_norm1 = nn.BatchNorm1d(num_filters[0] * len(filter_sizes))
        
        # Dropout Layer
        self.dropout = nn.Dropout(p=dropout)
        
        # RELU activation function
        self.relu =  nn.ReLU()
        
        # Fully-connected layers
#         self.fc1 = nn.Linear(np.sum(num_filters), fc1_neurons)
        
        self.batch_norm2 = nn.BatchNorm1d(num_filters)
        
        self.fc2 = nn.Linear(np.sum(num_filters), fc2_neurons)
        
        self.batch_norm3 = nn.BatchNorm1d(fc2_neurons)
        
        self.fc3 = nn.Linear(fc2_neurons, 1)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

哽咽笑 2025-01-23 07:57:33

我认为比这些参数值更重要的是如何训练特征。您是否从头开始训练 NLP 模型,即仅对您的文本进行训练,还是使用部分预训练的模型?考虑到你的样本量,我建议后者。

I think more important than these parameter values is how the features are being trained. Do you train a NLP model from the ground i.e. only training on your texts or do you use a partially pretrained model? I suggest the latter given your sample sizes.

爱给你人给你 2025-01-23 07:57:33

我认为你应该尝试基于 LSTM、GRU 或 Transformers 的方法。
我会推荐 Transformer 模型,例如 BERT、Distilbert、Roberta 等。您可以从头开始训练,也可以使用 pretrained 对其进行微调。与基于 CNN 的方法相比,它会给你更好的 F1 分数。

另外,您可以尝试添加班级权重。它可能有助于提高准确性。

I think you should try LSTM, GRU, or Transformers based approach.
I will recommend transformer models such as BERT, Distilbert, Roberta, etc. You can train from scratch or use pretrained to fine-tune it. It will give you a better F1 score than CNN based approach.

Also, you can try adding class weights. It might help to improve the accuracy.

你げ笑在眉眼 2025-01-23 07:57:33

我鼓励您检查 sklearn.utils.class_weight.compute_sample_weight 来计算样本的权重,并检查 sklearn.utils.class_weight.compute_class_weight 来计算各个类别的权重。

由于您有 2% - 98% 的类重新分区焦损,这绝对是一个不错的选择!我认为你的参数没问题。

接下来您是否尝试在数据加载器中使用采样器?我认为 torch.utils.data.WeightedRandomSampler 正是您所需要的。

这是一个小例子:

from torch.utils.data import WeightedRandomSampler, DataLoader
from sklearn.utils.class_weight import compute_sample_weight

torch.manual_seed(0)

weights = compute_sample_weight('balanced', dataset.classes)
sampler = WeightedRandomSampler(weights, len(weights))
loader = DataLoader(dataset, sampler=sampler)

counts_train = [0 for _ in range(2)]
for x, y in loader:
  counts_train[y] += 1
$ counts_train
[499, 501]

我通过您的修复生成了一个包含 1000 个示例的数据集,dataset.classes 是一个大小为 1000 的数组,它包含所有标签。

我会在所有子集中保留相同的标签重新分区。这样做将提高模型稳定性。

请毫不犹豫地仅在您的训练集上使用数据增强。它将增加示例数量并使模型更加稳健。
您可以查看此存储库

跳它有帮助!

I encourage you to check sklearn.utils.class_weight.compute_sample_weight to calculate weights of a sample and sklearn.utils.class_weight.compute_class_weight for the individual class weight.

Since you have a 2% - 98% class repartition focal loss is definitely a good call ! I think your params are ok.

Next have you tried to use a sampler in your dataloader ? I think torch.utils.data.WeightedRandomSampler is what you need.

Here is a little example :

from torch.utils.data import WeightedRandomSampler, DataLoader
from sklearn.utils.class_weight import compute_sample_weight

torch.manual_seed(0)

weights = compute_sample_weight('balanced', dataset.classes)
sampler = WeightedRandomSampler(weights, len(weights))
loader = DataLoader(dataset, sampler=sampler)

counts_train = [0 for _ in range(2)]
for x, y in loader:
  counts_train[y] += 1
$ counts_train
[499, 501]

I have generated a dataset of 1000 examples with your reparation, dataset.classes is an array of size 1000, it contains all the labels.

I would keep the same label repartition in all my subsets. Doing so will increase model stability.

Don't hesitate to use data augmentation only on your train set. It will increase the number of examples and make the model more robust.
You can check this repo.

Hop it helps !

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文