不能将LogisticRecress进行多标签分类

发布于 2025-01-21 06:03:19 字数 1369 浏览 4 评论 0原文

我正在使用logisticRegress从sklearn库以及MultiOutputClassifier使用LR进行多标签分类。不幸的是，在运行此代码时我会遇到错误：

res = MultiOutputClassifier(estimator=LogisticRegression()).fit(x_train, y_train)

错误：

ValueError: This solver needs samples of at least 2 classes in the data,
but the data contains only one class: 0

这对我来说是没有意义的，因为x_train具有Shape（2210，2000），而Y__TRAIN具有形状（2210 ，58），这意味着它有58个类。

y_train = array([[0, 0, 1, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0],
   ...,
   [0, 0, 0, ..., 1, 0, 1],
   [0, 0, 0, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0]])

x_train代表一个数组，其中包含嵌入的数组，使用单词术语频率倒数文档频率：

x_train = array([[0.        , 0.        , 0.02571182, ..., 0.        , 0.        ,
    0.        ],
   [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
    0.        ],
   [0.        , 0.11247333, 0.        , ..., 0.09392727, 0.        ,
    0.        ],
   ...,
   [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
    0.        ],
   [0.        , 0.07308953, 0.        , ..., 0.09155637, 0.        ,
    0.        ],
   [0.        , 0.07492016, 0.        , ..., 0.        , 0.        ,
    0.        ]])

原文

I'm using LogisticRegression from the sklearn library along with MultiOutputClassifier in order to use LR for multilabel classification. Unfortunately I'm getting an error when running this code:

res = MultiOutputClassifier(estimator=LogisticRegression()).fit(x_train, y_train)

Error:

ValueError: This solver needs samples of at least 2 classes in the data,
but the data contains only one class: 0

This does not make sense to me because x_train has shape (2210, 2000) while y_train has shape (2210, 58), which means it has 58 classes.

y_train = array([[0, 0, 1, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0],
   ...,
   [0, 0, 0, ..., 1, 0, 1],
   [0, 0, 0, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0]])

x_train represents an array containing embeddings using Bag of Words term frequency inverse document frequency:

x_train = array([[0.        , 0.        , 0.02571182, ..., 0.        , 0.        ,
    0.        ],
   [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
    0.        ],
   [0.        , 0.11247333, 0.        , ..., 0.09392727, 0.        ,
    0.        ],
   ...,
   [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
    0.        ],
   [0.        , 0.07308953, 0.        , ..., 0.09155637, 0.        ,
    0.        ],
   [0.        , 0.07492016, 0.        , ..., 0.        , 0.        ,
    0.        ]])

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

千紇 2025-01-28 06:03:19

您的MultiOutputClassifier正在训练每个类的新逻辑回归，以使第i-th Logistic回归器在y_train [：，i]上训练（列vector ）。该错误

ValueError: This solver needs samples of at least 2 classes in the data,
but the data contains only one class: 0

由logistic回归给出，并且指示这些列向量之一（y_train [：，i]）仅具有零（即，只有类0）。逻辑回归（以及许多其他二进制分类器）不能在仅存在单个类的数据集中培训。
您可以通过运行来检查每个类的类分布

np.mean(y_train,axis=0)

，因此，检查哪个类仅为零。

在监督学习中的方法1

，如果一个类只有负样本（只有零），则仅输出零是合理的，因为从积极的班级中没有什么可以学到的。
在scikit-learn中，监督学习中的某些算法将始终输出此类仅具有零的零。一些示例是：kneighborsclassifier和deciestTreeClaleClalifier。
或者，您只能丢弃仅具有零的列向量：

new_y = y_train[:,y_train.sum(axis=0)!=0]

方法2

，但是如果您的测试数据实际上具有所有类的零，则您可能需要重新考虑所有类别的示例，以使所有类都具有负面样本和正面样本在培训数据中。搜索多标签数据的分层，例如迭代 - 阶层。

方法3

如果由于某种原因无法使用此培训数据，则可能需要查看一级分类算法或新颖性检测。这些算法可以从只有“正常”样本的训练数据中学习，也就是说，所有样本属于同一类。

Your MultiOutputClassifier is training a new Logistic Regression for each class, in such a way that the i-th logistic regressor is trained on y_train[:,i] (a column vector). The error

ValueError: This solver needs samples of at least 2 classes in the data,
but the data contains only one class: 0

is given by Logistic Regression and is indicating that one of these column vectors (y_train[:,i]) has only zeroes (i.e, only the class 0). Logistic Regression (and many other binary classifiers) can't be trained on a dataset where only exists a single class.
You can check the class distribution of each class by running

np.mean(y_train,axis=0)

and, consequently, check which class has only zeroes.

Approach 1

In supervised learning, if a class has only negative samples (only zeroes), it is reasonable to only output zero, since there is nothing that can be learned from the positive class.
In scikit-learn, some algorithms from supervised learning will always output zero for this class that only has zeroes. Some examples are: KNeighborsClassifier and DecisionTreeClassifier.
Or you can just discard this column vector that has only zeroes:

new_y = y_train[:,y_train.sum(axis=0)!=0]

Approach 2

But if your testing data actually has zeroes and ones for all classes, you may want to reconsider regenerating your training data in such a way that all classes have negative and positive samples in the training data. Search for stratification for multi-label data, such as iterative-stratification.

Approach 3

If this training data is not possible for some reason, you may want to take a look at one-class classification algorithms or novelty detection. These algorithms can learn from training data that only has "normal" samples, that is, all samples belong to the same class.

回复收藏 0 原文

~没有更多了~