不能将LogisticRecress进行多标签分类
我正在使用logisticRegress
从sklearn
库以及MultiOutputClassifier
使用LR进行多标签分类。不幸的是,在运行此代码时我会遇到错误:
res = MultiOutputClassifier(estimator=LogisticRegression()).fit(x_train, y_train)
错误:
ValueError: This solver needs samples of at least 2 classes in the data,
but the data contains only one class: 0
这对我来说是没有意义的,因为x_train
具有Shape(2210,2000),而Y__TRAIN
具有形状(2210 ,58),这意味着它有58个类。
y_train = array([[0, 0, 1, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 1, 0, 1],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
x_train代表一个数组,其中包含嵌入的数组,使用单词术语频率倒数文档频率:
x_train = array([[0. , 0. , 0.02571182, ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0.11247333, 0. , ..., 0.09392727, 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0.07308953, 0. , ..., 0.09155637, 0. ,
0. ],
[0. , 0.07492016, 0. , ..., 0. , 0. ,
0. ]])
I'm using LogisticRegression
from the sklearn
library along with MultiOutputClassifier
in order to use LR for multilabel classification. Unfortunately I'm getting an error when running this code:
res = MultiOutputClassifier(estimator=LogisticRegression()).fit(x_train, y_train)
Error:
ValueError: This solver needs samples of at least 2 classes in the data,
but the data contains only one class: 0
This does not make sense to me because x_train
has shape (2210, 2000) while y_train
has shape (2210, 58), which means it has 58 classes.
y_train = array([[0, 0, 1, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 1, 0, 1],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
x_train represents an array containing embeddings using Bag of Words term frequency inverse document frequency:
x_train = array([[0. , 0. , 0.02571182, ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0.11247333, 0. , ..., 0.09392727, 0. ,
0. ],
...,
[0. , 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0.07308953, 0. , ..., 0.09155637, 0. ,
0. ],
[0. , 0.07492016, 0. , ..., 0. , 0. ,
0. ]])
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的
MultiOutputClassifier
正在训练每个类的新逻辑回归,以使第i-th Logistic回归器在y_train [:,i]
上训练(列vector )。该错误由
logistic回归
给出,并且指示这些列向量之一(y_train [:,i]
)仅具有零(即,只有类0)。逻辑回归(以及许多其他二进制分类器)不能在仅存在单个类的数据集中培训。您可以通过运行来检查每个类的类分布
,因此,检查哪个类仅为零。
在监督学习中的方法1
,如果一个类只有负样本(只有零),则仅输出零是合理的,因为从积极的班级中没有什么可以学到的。
在
scikit-learn
中,监督学习中的某些算法将始终输出此类仅具有零的零。一些示例是:kneighborsclassifier
和deciestTreeClaleClalifier
。或者,您只能丢弃仅具有零的列向量:
方法2
,但是如果您的测试数据实际上具有所有类的零,则您可能需要重新考虑所有类别的示例,以使所有类都具有负面样本和正面样本在培训数据中。搜索多标签数据的分层,例如迭代 - 阶层。
方法3
如果由于某种原因无法使用此培训数据,则可能需要查看一级分类算法或新颖性检测。这些算法可以从只有“正常”样本的训练数据中学习,也就是说,所有样本属于同一类。
Your
MultiOutputClassifier
is training a new Logistic Regression for each class, in such a way that the i-th logistic regressor is trained ony_train[:,i]
(a column vector). The erroris given by
Logistic Regression
and is indicating that one of these column vectors (y_train[:,i]
) has only zeroes (i.e, only the class 0). Logistic Regression (and many other binary classifiers) can't be trained on a dataset where only exists a single class.You can check the class distribution of each class by running
and, consequently, check which class has only zeroes.
Approach 1
In supervised learning, if a class has only negative samples (only zeroes), it is reasonable to only output zero, since there is nothing that can be learned from the positive class.
In
scikit-learn
, some algorithms from supervised learning will always output zero for this class that only has zeroes. Some examples are:KNeighborsClassifier
andDecisionTreeClassifier
.Or you can just discard this column vector that has only zeroes:
Approach 2
But if your testing data actually has zeroes and ones for all classes, you may want to reconsider regenerating your training data in such a way that all classes have negative and positive samples in the training data. Search for stratification for multi-label data, such as iterative-stratification.
Approach 3
If this training data is not possible for some reason, you may want to take a look at one-class classification algorithms or novelty detection. These algorithms can learn from training data that only has "normal" samples, that is, all samples belong to the same class.