我正在使用一个用户保证:使用Multilabelbinarizer时未知类
我使用多分类型问题使用Multilabelbinarizer。当我转换测试数据时,我会收到以下警告。 /local/anaconda/lib/python3.8/site-packages/sklearn/preprocessing/_label.py:994:userWarning:Unknown class(es)['235','256','256','256','546','546','425'],将被忽略 WARNINGS.WARN('未知类(ES){0}将被忽略'。 有没有办法避免此警告?它会影响分类器的性能吗?
mlb = MultiLabelBinarizer()
mlb.fit(df_train['outcome'])
y_train = mlb.transform(df_train['outcome'])
y_test = mlb.transform(df_test['outcome'])
print(y_training)
print(y_validation)
I using MultiLabelBinarizer for multiclassification problem. When I transform on the test data, I got the following warning;
/local/Anaconda/lib/python3.8/site-packages/sklearn/preprocessing/_label.py:994: UserWarning: unknown class(es) ['235', '256', '546', '425'] will be ignored
warnings.warn('unknown class(es) {0} will be ignored'.
Is there a way to avoid this warning? Will it impact the performance of my classifier?
mlb = MultiLabelBinarizer()
mlb.fit(df_train['outcome'])
y_train = mlb.transform(df_train['outcome'])
y_test = mlb.transform(df_test['outcome'])
print(y_training)
print(y_validation)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这意味着测试数据中没有培训数据中的类。我建议将MLB拟合到培训和测试结果的综合列表中。
分类器的性能受到影响,因为您的测试数据中有结果,这些结果在培训数据中没有实例供模型进行训练。
That means there's classes in the test data that aren't in the training data. I suggest fitting the mlb to a combined list of both the training and test outcomes.
The performance of the classifier is affected since you have outcomes in your test data that have no instances in your training data for your model to train on.
这将影响分类器的性能,但是以一种很好的方式,鉴于将二进制器拟合在培训和测试数据上将是数据泄漏的一种形式。
我会反对对所有数据进行拟合。测试集是对您的管道/模型以前从未见过的新数据的模拟,因此您不包括它。
警告告诉您,该对象的性能与预期的完全相同。它构建了功能(.classes_)的词汇,然后使用相同的词汇转换新数据。如果没有看到新功能,它将在转换中忽略它们,然后提高警告。
It will impact the performance of the classifier but in a good way, given that fitting the Binarizer on training AND test data would be a form of data leakage.
I would argue against fitting on all data. The test set is a simulation of NEW data that your pipeline/model has never seen before, and therefore you would not include it.
The warning is telling you that the object is performing exactly as expected. It builds the vocabulary of the features (.classes_) and then transforms new data with that same vocabulary. If there are new features it hasn't seen, it will ignore them in the conversion and then raise the warning.