使用 num_labels 1 vs 2 进行 Huggingface 变形金刚分类

发布于 2025-01-19 13:00:06 字数 768 浏览 2 评论 0原文

问题1)

这个问题的答案表明,对于二元分类问题,我可以使用num_labels 为 1(正或负)或 2(正和负)。有关于哪种设置更好的指导吗?看来,如果我们使用 1,则将使用 sigmoid 函数计算概率,如果我们使用 2,则将使用 softmax 函数计算概率。

问题 2)

在这两种情况下,我的 y 标签是否相同?每个数据点都有 0 或 1 而不是一个热编码?例如,如果我有 2 个数据点,那么 y 将是 0,1 而不是 [0,0],[0,1]

我有非常不平衡的分类问题,其中1 类仅出现 2% 的次数。 进行过采样。

在我的训练数据中,我对问题 3)

我的数据位于 pandas dataframe 中,我将其转换为数据集并使用下面的方法创建 y 变量。如果我打算使用 num_labels=1,我应该如何转换我的 y 列 - label

`train_dataset=Dataset.from_pandas(train_df).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))`

question 1)

The answer to this question suggested that for a binary classification problem I could use num_labels as 1 (positive or not) or 2 (positive and negative). Is there any guideline regarding which setting is better? It seems that if we use 1 then probability would be calculated using sigmoid function and if we use 2 then probabilities would be calculated using softmax function.

question 2)

In both cases are my y labels going to be same? each data point will have 0 or 1 and not one hot encoding? For example, if I have 2 data points then y would be 0,1 and not [0,0],[0,1]

I have very unbalanced classification problem where class 1 is present only 2% of times. In my training data I am oversampling

question 3)

My data is in pandas dataframe and I am converting it to a dataset and creating y variable using below. How should I cast my y column - label if I am planning to use num_labels=1?

`train_dataset=Dataset.from_pandas(train_df).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))`

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

不爱素颜 2025-01-26 13:00:06

好吧,可能有点晚了。但是我想指出一件事,根据拥抱的面积代码,如果您设置了num_labels = 1,它实际上会触发回归建模,并​​且损失函数将设置为mseloss()。 You can find the code here< /a>。

另外,在他们自己的教程中,对于二进制分类问题(IMDB,正面与负面),他们设置了num_labels = 2。

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

这是链接

Well, it probably is kind of late. But I want to point out one thing, according to the Hugging Face code, if you set num_labels = 1, it will actually trigger the regression modeling, and the loss function will be set to MSELoss(). You can find the code here.

Also, in their own tutorial, for a binary classification problem (IMDB, positive vs. negative), they set num_labels = 2.

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

Here is the link.

A君 2025-01-26 13:00:06
  1. 正如回答 a>, sigmoid 激活函数只是2级 softmax 激活函数的特殊情况。将某些权重设置为零,第二个输出始终为零。因此出于绩效原因,例如更快更新和更少的参数,您应该使用 sigmoid


  2. 当您的输出尺寸为一个时,单热编码意味着将0分配给一个类,而将1分配给另一个类。因此,对于2个数据点,您的y将为0,1

  3. 用于给代表类的整数标签提供名称。为了使用,您的y列应包含零和一个。您可以在下面的示例中看到classLabel带有两个值的列,由一个列表示,该列由01

pytorch示例:

from datasets import Dataset,ClassLabel
import pandas as pd
import torch

train_df = pd.DataFrame({'column':[1,2,3,4,5],'label':[0,1,0,1,0]})
train_dataset=Dataset.from_pandas(train_df).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos']))
train_dataset.set_format(type='torch', columns=['column', 'label'])
dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=5)
print(next(iter(dataloader)))

输出:

{'column': tensor([1, 2, 3, 4, 5]), 'label': tensor([0, 1, 0, 1, 0])}
  • 如果您的y列由negpos值组成以下:
label_mapping = {'neg':0,'pos':1}
train_df['label'] = train_df['label'].apply(lambda x:label_mapping['x'])
  1. As answered here, the Sigmoid activation function is just a special case of 2-class Softmax activation function. With some weights set to zero, the second output is always zero. Thus for performance reasons like updating faster and having fewer parameters, you should use sigmoid.

  2. When your output dimension is one, one-hot encoding means assigning 0 to one class and 1 to the other. So for 2 data points, your y would be 0,1.

  3. ClassLabel is used to give names to integer labels that represent classes. So to use that, your y column should consist of zeros and ones. You can see in the example below that the ClassLabel column with two values is represented with one column consisting of 0 and 1.

PyTorch example:

from datasets import Dataset,ClassLabel
import pandas as pd
import torch

train_df = pd.DataFrame({'column':[1,2,3,4,5],'label':[0,1,0,1,0]})
train_dataset=Dataset.from_pandas(train_df).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos']))
train_dataset.set_format(type='torch', columns=['column', 'label'])
dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=5)
print(next(iter(dataloader)))

output:

{'column': tensor([1, 2, 3, 4, 5]), 'label': tensor([0, 1, 0, 1, 0])}
  • If your y column consists of neg and pos values, pandas would do the job as below:
label_mapping = {'neg':0,'pos':1}
train_df['label'] = train_df['label'].apply(lambda x:label_mapping['x'])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文