由于 OneHotEncoding，K-NN 模型的精度较低

发布于 2025-01-17 07:41:30 字数 1784 浏览 2 评论 0原文

我尝试为数据集构建 K 最近邻模型，其中因变量可以采用 3 个不同的分类值。

我构建了 2 个不同的模型，一种是对因变量进行 OneHotEncoded，另一种是不使用任何编码。

x_3class = class3.iloc[:,:-1].values
y_3class = class3.iloc[:,-1:].values 

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categories="auto")
y_3class_ohencoded = ohe.fit_transform(y_3class).toarray() 

from sklearn.model_selection import train_test_split
#non-encoded split
x3c_train,x3c_test,y3c_train,y3c_test = train_test_split(x_3class,y_3class,test_size=0.2,random_state=1)

#onehotencoded split
x_train3,x_test3,y_train3,y_test3 = train_test_split(x_3class,y_3class_ohencoded,test_size=0.2,random_state=1)

#Feature Scaling
sc_3class = StandardScaler()
x3c_train = sc_3class.fit_transform(x3c_train)
x3c_test = sc_3class.transform(x3c_test)
sc_3class_ohe = StandardScaler()
x_train3 = sc_3class_ohe.fit_transform(x_train3)
x_test3 = sc_3class.transform(x_test3)

#Model Building 
from sklearn.neighbors import KNeighborsClassifier 
knn_classifier_3class = KNeighborsClassifier(n_neighbors=18)
knn_classifier_ohe = KNeighborsClassifier(n_neighbors=18)

knn_classifier_3class.fit(x3c_train,y3c_train)
knn_classifier_ohe.fit(x_train3,y_train3)

#Accuracy Evaluation
nonencoded_accuracy_=cross_val_score(knn_classifier_3class,x3c_test,y3c_test,cv=10)
onehotencoded_accuracy=cross_val_score(knn_classifier_ohe,x_test3,y_test3,cv=10)

print("NonEncoded Model Accuracy: %0.2f" %(nonencoded_accuracy.mean()),"\n",
"OHEncoded Model Accuracy: %0.2f"%(onehotencoded_accuracy.mean()))

非编码模型的准确率比 OneHotEncoded 模型高 13%。

NonEncoded Model Accuracy: 0.63 
 OHEncoded Model Accuracy: 0.50

到底是什么原因造成如此大的差异呢？

原文

I tried building a K-Nearest Neighbor model for a dataset in which the dependent variable can take 3 different categorical values.

I built 2 different models, one where I OneHotEncoded the dependent variable and one where I didn't use any encoding.

x_3class = class3.iloc[:,:-1].values
y_3class = class3.iloc[:,-1:].values 

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categories="auto")
y_3class_ohencoded = ohe.fit_transform(y_3class).toarray() 

from sklearn.model_selection import train_test_split
#non-encoded split
x3c_train,x3c_test,y3c_train,y3c_test = train_test_split(x_3class,y_3class,test_size=0.2,random_state=1)

#onehotencoded split
x_train3,x_test3,y_train3,y_test3 = train_test_split(x_3class,y_3class_ohencoded,test_size=0.2,random_state=1)

#Feature Scaling
sc_3class = StandardScaler()
x3c_train = sc_3class.fit_transform(x3c_train)
x3c_test = sc_3class.transform(x3c_test)
sc_3class_ohe = StandardScaler()
x_train3 = sc_3class_ohe.fit_transform(x_train3)
x_test3 = sc_3class.transform(x_test3)

#Model Building 
from sklearn.neighbors import KNeighborsClassifier 
knn_classifier_3class = KNeighborsClassifier(n_neighbors=18)
knn_classifier_ohe = KNeighborsClassifier(n_neighbors=18)

knn_classifier_3class.fit(x3c_train,y3c_train)
knn_classifier_ohe.fit(x_train3,y_train3)

#Accuracy Evaluation
nonencoded_accuracy_=cross_val_score(knn_classifier_3class,x3c_test,y3c_test,cv=10)
onehotencoded_accuracy=cross_val_score(knn_classifier_ohe,x_test3,y_test3,cv=10)

print("NonEncoded Model Accuracy: %0.2f" %(nonencoded_accuracy.mean()),"\n",
"OHEncoded Model Accuracy: %0.2f"%(onehotencoded_accuracy.mean()))

Accuracy score of non-encoded model was 13% higher than the OneHotEncoded model.

NonEncoded Model Accuracy: 0.63 
 OHEncoded Model Accuracy: 0.50

What would be the reason for such a big difference?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

原来分手还会想你 2025-01-24 07:41:30

当您对目标进行 one-hot 编码时，sklearn 会看到多个列并假设您存在多标签问题；也就是说，每一行可以有多个（甚至没有）标签。

在 kNN 中，这可能会导致某些点没有收到标签：在您的情况下，使用 k=18 ，考虑一个点分别具有 8、6、4 个类别 0、1、2 的最近邻居。如果不进行编码，它会获得标签 0。通过编码，我们可以以一对一的方式获得单独的 kNN 模型。（编码后的第一个标签对于 0 类是 1，对于 1 类或 2 类等是 0）因此第一个模型看到 8、6+4 个示例，并预测“不是 0 类”。类似地，其他两个模型预测为零，并且输出全为零，即没有预测类别。如果您改为使用 cross_val_predict，我希望您会看到这一点。

多标签问题的默认评分也相当严格，但在这种情况下，这并不重要：您的模型将只预测没有或只预测一个类别（呃，除了平局？）。

回复收藏 0 原文

~没有更多了~