使用 num_labels 1 vs 2 进行 Huggingface 变形金刚分类
问题1)
这个问题的答案表明,对于二元分类问题,我可以使用num_labels
为 1(正或负)或 2(正和负)。有关于哪种设置更好的指导吗?看来,如果我们使用 1,则将使用 sigmoid 函数计算概率,如果我们使用 2,则将使用 softmax 函数计算概率。
问题 2)
在这两种情况下,我的 y 标签是否相同?每个数据点都有 0 或 1 而不是一个热编码?例如,如果我有 2 个数据点,那么 y 将是 0,1
而不是 [0,0],[0,1]
我有非常不平衡的分类问题,其中1 类仅出现 2% 的次数。 进行过采样。
在我的训练数据中,我对问题 3)
我的数据位于 pandas dataframe 中,我将其转换为数据集并使用下面的方法创建 y 变量。如果我打算使用 num_labels
=1,我应该如何转换我的 y 列 - label
?
`train_dataset=Dataset.from_pandas(train_df).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))`
question 1)
The answer to this question suggested that for a binary classification problem I could use num_labels
as 1 (positive or not) or 2 (positive and negative). Is there any guideline regarding which setting is better? It seems that if we use 1 then probability would be calculated using sigmoid
function and if we use 2 then probabilities would be calculated using softmax
function.
question 2)
In both cases are my y labels going to be same? each data point will have 0 or 1 and not one hot encoding? For example, if I have 2 data points then y would be 0,1
and not [0,0],[0,1]
I have very unbalanced classification problem where class 1 is present only 2% of times. In my training data I am oversampling
question 3)
My data is in pandas dataframe
and I am converting it to a dataset
and creating y variable using below. How should I cast my y column - label
if I am planning to use num_labels
=1?
`train_dataset=Dataset.from_pandas(train_df).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))`
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好吧,可能有点晚了。但是我想指出一件事,根据拥抱的面积代码,如果您设置了num_labels = 1,它实际上会触发回归建模,并且损失函数将设置为mseloss()。 You can find the code here< /a>。
另外,在他们自己的教程中,对于二进制分类问题(IMDB,正面与负面),他们设置了num_labels = 2。
这是链接。
Well, it probably is kind of late. But I want to point out one thing, according to the Hugging Face code, if you set num_labels = 1, it will actually trigger the regression modeling, and the loss function will be set to MSELoss(). You can find the code here.
Also, in their own tutorial, for a binary classification problem (IMDB, positive vs. negative), they set num_labels = 2.
Here is the link.
正如回答 a>, sigmoid 激活函数只是2级 softmax 激活函数的特殊情况。将某些权重设置为零,第二个输出始终为零。因此出于绩效原因,例如更快更新和更少的参数,您应该使用 sigmoid 。
。
当您的输出尺寸为一个时,单热编码意味着将0分配给一个类,而将1分配给另一个类。因此,对于2个数据点,您的y将为
0,1
。用于给代表类的整数标签提供名称。为了使用,您的
y
列应包含零和一个。您可以在下面的示例中看到classLabel
带有两个值的列,由一个列表示,该列由0
和1
。pytorch
示例:输出:
y
列由neg
和pos
值组成以下:As answered here, the Sigmoid activation function is just a special case of 2-class Softmax activation function. With some weights set to zero, the second output is always zero. Thus for performance reasons like updating faster and having fewer parameters, you should use sigmoid.
When your output dimension is one, one-hot encoding means assigning 0 to one class and 1 to the other. So for 2 data points, your y would be
0,1
.ClassLabel
is used to give names to integer labels that represent classes. So to use that, youry
column should consist of zeros and ones. You can see in the example below that theClassLabel
column with two values is represented with one column consisting of0
and1
.PyTorch
example:output:
y
column consists ofneg
andpos
values, pandas would do the job as below: