不能将LogisticRecress进行多标签分类

发布于 2025-01-21 06:03:19 字数 1369 浏览 4 评论 0原文

我正在使用logisticRegresssklearn库以及MultiOutputClassifier使用LR进行多标签分类。不幸的是,在运行此代码时我会遇到错误:

res = MultiOutputClassifier(estimator=LogisticRegression()).fit(x_train, y_train)

错误:

ValueError: This solver needs samples of at least 2 classes in the data,
but the data contains only one class: 0

这对我来说是没有意义的,因为x_train具有Shape(2210,2000),而Y__TRAIN具有形状(2210 ,58),这意味着它有58个类。

y_train = array([[0, 0, 1, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0],
   ...,
   [0, 0, 0, ..., 1, 0, 1],
   [0, 0, 0, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0]])

x_train代表一个数组,其中包含嵌入的数组,使用单词术语频率倒数文档频率:

x_train = array([[0.        , 0.        , 0.02571182, ..., 0.        , 0.        ,
    0.        ],
   [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
    0.        ],
   [0.        , 0.11247333, 0.        , ..., 0.09392727, 0.        ,
    0.        ],
   ...,
   [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
    0.        ],
   [0.        , 0.07308953, 0.        , ..., 0.09155637, 0.        ,
    0.        ],
   [0.        , 0.07492016, 0.        , ..., 0.        , 0.        ,
    0.        ]])

I'm using LogisticRegression from the sklearn library along with MultiOutputClassifier in order to use LR for multilabel classification. Unfortunately I'm getting an error when running this code:

res = MultiOutputClassifier(estimator=LogisticRegression()).fit(x_train, y_train)

Error:

ValueError: This solver needs samples of at least 2 classes in the data,
but the data contains only one class: 0

This does not make sense to me because x_train has shape (2210, 2000) while y_train has shape (2210, 58), which means it has 58 classes.

y_train = array([[0, 0, 1, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0],
   ...,
   [0, 0, 0, ..., 1, 0, 1],
   [0, 0, 0, ..., 0, 0, 0],
   [0, 0, 0, ..., 0, 0, 0]])

x_train represents an array containing embeddings using Bag of Words term frequency inverse document frequency:

x_train = array([[0.        , 0.        , 0.02571182, ..., 0.        , 0.        ,
    0.        ],
   [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
    0.        ],
   [0.        , 0.11247333, 0.        , ..., 0.09392727, 0.        ,
    0.        ],
   ...,
   [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
    0.        ],
   [0.        , 0.07308953, 0.        , ..., 0.09155637, 0.        ,
    0.        ],
   [0.        , 0.07492016, 0.        , ..., 0.        , 0.        ,
    0.        ]])

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

千紇 2025-01-28 06:03:19

您的MultiOutputClassifier正在训练每个类的新逻辑回归,以使第i-th Logistic回归器在y_train [:,i]上训练(列vector )。该错误

ValueError: This solver needs samples of at least 2 classes in the data,
but the data contains only one class: 0

logistic回归给出,并且指示这些列向量之一(y_train [:,i])仅具有零(即,只有类0)。逻辑回归(以及许多其他二进制分类器)不能在仅存在单个类的数据集中培训。
您可以通过运行来检查每个类的类分布

np.mean(y_train,axis=0)

,因此,检查哪个类仅为零。

在监督学习中的方法1

,如果一个类只有负样本(只有零),则仅输出零是合理的,因为从积极的班级中没有什么可以学到的。
scikit-learn中,监督学习中的某些算法将始终输出此类仅具有零的零。一些示例是:kneighborsclassifierdeciestTreeClaleClalifier
或者,您只能丢弃仅具有零的列向量:

new_y = y_train[:,y_train.sum(axis=0)!=0]

方法2

,但是如果您的测试数据实际上具有所有类的零,则您可能需要重新考虑所有类别的示例,以使所有类都具有负面样本和正面样本在培训数据中。搜索多标签数据的分层,例如迭代 - 阶层

方法3

如果由于某种原因无法使用此培训数据,则可能需要查看一级分类算法或新颖性检测。这些算法可以从只有“正常”样本的训练数据中学习,也就是说,所有样本属于同一类。

Your MultiOutputClassifier is training a new Logistic Regression for each class, in such a way that the i-th logistic regressor is trained on y_train[:,i] (a column vector). The error

ValueError: This solver needs samples of at least 2 classes in the data,
but the data contains only one class: 0

is given by Logistic Regression and is indicating that one of these column vectors (y_train[:,i]) has only zeroes (i.e, only the class 0). Logistic Regression (and many other binary classifiers) can't be trained on a dataset where only exists a single class.
You can check the class distribution of each class by running

np.mean(y_train,axis=0)

and, consequently, check which class has only zeroes.

Approach 1

In supervised learning, if a class has only negative samples (only zeroes), it is reasonable to only output zero, since there is nothing that can be learned from the positive class.
In scikit-learn, some algorithms from supervised learning will always output zero for this class that only has zeroes. Some examples are: KNeighborsClassifier and DecisionTreeClassifier.
Or you can just discard this column vector that has only zeroes:

new_y = y_train[:,y_train.sum(axis=0)!=0]

Approach 2

But if your testing data actually has zeroes and ones for all classes, you may want to reconsider regenerating your training data in such a way that all classes have negative and positive samples in the training data. Search for stratification for multi-label data, such as iterative-stratification.

Approach 3

If this training data is not possible for some reason, you may want to take a look at one-class classification algorithms or novelty detection. These algorithms can learn from training data that only has "normal" samples, that is, all samples belong to the same class.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文