由于 OneHotEncoding,K-NN 模型的精度较低
我尝试为 数据集 构建 K 最近邻模型,其中因变量可以采用 3 个不同的分类值。
我构建了 2 个不同的模型,一种是对因变量进行 OneHotEncoded,另一种是不使用任何编码。
x_3class = class3.iloc[:,:-1].values
y_3class = class3.iloc[:,-1:].values
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categories="auto")
y_3class_ohencoded = ohe.fit_transform(y_3class).toarray()
from sklearn.model_selection import train_test_split
#non-encoded split
x3c_train,x3c_test,y3c_train,y3c_test = train_test_split(x_3class,y_3class,test_size=0.2,random_state=1)
#onehotencoded split
x_train3,x_test3,y_train3,y_test3 = train_test_split(x_3class,y_3class_ohencoded,test_size=0.2,random_state=1)
#Feature Scaling
sc_3class = StandardScaler()
x3c_train = sc_3class.fit_transform(x3c_train)
x3c_test = sc_3class.transform(x3c_test)
sc_3class_ohe = StandardScaler()
x_train3 = sc_3class_ohe.fit_transform(x_train3)
x_test3 = sc_3class.transform(x_test3)
#Model Building
from sklearn.neighbors import KNeighborsClassifier
knn_classifier_3class = KNeighborsClassifier(n_neighbors=18)
knn_classifier_ohe = KNeighborsClassifier(n_neighbors=18)
knn_classifier_3class.fit(x3c_train,y3c_train)
knn_classifier_ohe.fit(x_train3,y_train3)
#Accuracy Evaluation
nonencoded_accuracy_=cross_val_score(knn_classifier_3class,x3c_test,y3c_test,cv=10)
onehotencoded_accuracy=cross_val_score(knn_classifier_ohe,x_test3,y_test3,cv=10)
print("NonEncoded Model Accuracy: %0.2f" %(nonencoded_accuracy.mean()),"\n",
"OHEncoded Model Accuracy: %0.2f"%(onehotencoded_accuracy.mean()))
非编码模型的准确率比 OneHotEncoded 模型高 13%。
NonEncoded Model Accuracy: 0.63
OHEncoded Model Accuracy: 0.50
到底是什么原因造成如此大的差异呢?
I tried building a K-Nearest Neighbor model for a dataset in which the dependent variable can take 3 different categorical values.
I built 2 different models, one where I OneHotEncoded the dependent variable and one where I didn't use any encoding.
x_3class = class3.iloc[:,:-1].values
y_3class = class3.iloc[:,-1:].values
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categories="auto")
y_3class_ohencoded = ohe.fit_transform(y_3class).toarray()
from sklearn.model_selection import train_test_split
#non-encoded split
x3c_train,x3c_test,y3c_train,y3c_test = train_test_split(x_3class,y_3class,test_size=0.2,random_state=1)
#onehotencoded split
x_train3,x_test3,y_train3,y_test3 = train_test_split(x_3class,y_3class_ohencoded,test_size=0.2,random_state=1)
#Feature Scaling
sc_3class = StandardScaler()
x3c_train = sc_3class.fit_transform(x3c_train)
x3c_test = sc_3class.transform(x3c_test)
sc_3class_ohe = StandardScaler()
x_train3 = sc_3class_ohe.fit_transform(x_train3)
x_test3 = sc_3class.transform(x_test3)
#Model Building
from sklearn.neighbors import KNeighborsClassifier
knn_classifier_3class = KNeighborsClassifier(n_neighbors=18)
knn_classifier_ohe = KNeighborsClassifier(n_neighbors=18)
knn_classifier_3class.fit(x3c_train,y3c_train)
knn_classifier_ohe.fit(x_train3,y_train3)
#Accuracy Evaluation
nonencoded_accuracy_=cross_val_score(knn_classifier_3class,x3c_test,y3c_test,cv=10)
onehotencoded_accuracy=cross_val_score(knn_classifier_ohe,x_test3,y_test3,cv=10)
print("NonEncoded Model Accuracy: %0.2f" %(nonencoded_accuracy.mean()),"\n",
"OHEncoded Model Accuracy: %0.2f"%(onehotencoded_accuracy.mean()))
Accuracy score of non-encoded model was 13% higher than the OneHotEncoded model.
NonEncoded Model Accuracy: 0.63
OHEncoded Model Accuracy: 0.50
What would be the reason for such a big difference?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当您对目标进行 one-hot 编码时,sklearn 会看到多个列并假设您存在多标签问题;也就是说,每一行可以有多个(甚至没有)标签。
在 kNN 中,这可能会导致某些点没有收到标签:在您的情况下,使用
k=18
,考虑一个点分别具有 8、6、4 个类别 0、1、2 的最近邻居。如果不进行编码,它会获得标签 0。通过编码,我们可以以一对一的方式获得单独的 kNN 模型。 (编码后的第一个标签对于 0 类是 1,对于 1 类或 2 类等是 0)因此第一个模型看到 8、6+4 个示例,并预测“不是 0 类”。类似地,其他两个模型预测为零,并且输出全为零,即没有预测类别。如果您改为使用cross_val_predict
,我希望您会看到这一点。多标签问题的默认评分也相当严格,但在这种情况下,这并不重要:您的模型将只预测没有或只预测一个类别(呃,除了平局?)。
When you one-hot encode the target, sklearn sees multiple columns and assumes you have a multilabel problem; that is, that each row can have more than one (or even no) label.
In kNN, this likely results in some points receiving no label: with
k=18
as in your case, consider a point with 8, 6, 4 nearest neighbors of classes 0,1,2 respectively. Without encoding, it gets label 0. With encoding, we have the separate kNN models in a one-vs-rest fashion. (The first label after encoding is 1 for class 0 and 0 for either of class 1 or 2, etc.) So the first model sees 8, 6+4 examples, and predicts "not class 0". Similarly the other two models predict zero, and the output is all zeros, i.e. no class was predicted. If you docross_val_predict
instead, I expect you'll see this.The default scoring for multilabel problems is also pretty harsh, but in this case it doesn't matter: your model will only ever predict no or exactly one class (erm, except maybe for ties?).