由于 OneHotEncoding,K-NN 模型的精度较低

发布于 2025-01-17 07:41:30 字数 1784 浏览 2 评论 0原文

我尝试为 数据集 构建 K 最近邻模型,其中因变量可以采用 3 个不同的分类值。

我构建了 2 个不同的模型,一种是对因变量进行 OneHotEncoded,另一种是不使用任何编码。

x_3class = class3.iloc[:,:-1].values
y_3class = class3.iloc[:,-1:].values 

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categories="auto")
y_3class_ohencoded = ohe.fit_transform(y_3class).toarray() 

from sklearn.model_selection import train_test_split
#non-encoded split
x3c_train,x3c_test,y3c_train,y3c_test = train_test_split(x_3class,y_3class,test_size=0.2,random_state=1)

#onehotencoded split
x_train3,x_test3,y_train3,y_test3 = train_test_split(x_3class,y_3class_ohencoded,test_size=0.2,random_state=1)

#Feature Scaling
sc_3class = StandardScaler()
x3c_train = sc_3class.fit_transform(x3c_train)
x3c_test = sc_3class.transform(x3c_test)
sc_3class_ohe = StandardScaler()
x_train3 = sc_3class_ohe.fit_transform(x_train3)
x_test3 = sc_3class.transform(x_test3)

#Model Building 
from sklearn.neighbors import KNeighborsClassifier 
knn_classifier_3class = KNeighborsClassifier(n_neighbors=18)
knn_classifier_ohe = KNeighborsClassifier(n_neighbors=18)

knn_classifier_3class.fit(x3c_train,y3c_train)
knn_classifier_ohe.fit(x_train3,y_train3)

#Accuracy Evaluation
nonencoded_accuracy_=cross_val_score(knn_classifier_3class,x3c_test,y3c_test,cv=10)
onehotencoded_accuracy=cross_val_score(knn_classifier_ohe,x_test3,y_test3,cv=10)

print("NonEncoded Model Accuracy: %0.2f" %(nonencoded_accuracy.mean()),"\n",
"OHEncoded Model Accuracy: %0.2f"%(onehotencoded_accuracy.mean()))

非编码模型的准确率比 OneHotEncoded 模型高 13%。

NonEncoded Model Accuracy: 0.63 
 OHEncoded Model Accuracy: 0.50

到底是什么原因造成如此大的差异呢?

I tried building a K-Nearest Neighbor model for a dataset in which the dependent variable can take 3 different categorical values.

I built 2 different models, one where I OneHotEncoded the dependent variable and one where I didn't use any encoding.

x_3class = class3.iloc[:,:-1].values
y_3class = class3.iloc[:,-1:].values 

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categories="auto")
y_3class_ohencoded = ohe.fit_transform(y_3class).toarray() 

from sklearn.model_selection import train_test_split
#non-encoded split
x3c_train,x3c_test,y3c_train,y3c_test = train_test_split(x_3class,y_3class,test_size=0.2,random_state=1)

#onehotencoded split
x_train3,x_test3,y_train3,y_test3 = train_test_split(x_3class,y_3class_ohencoded,test_size=0.2,random_state=1)

#Feature Scaling
sc_3class = StandardScaler()
x3c_train = sc_3class.fit_transform(x3c_train)
x3c_test = sc_3class.transform(x3c_test)
sc_3class_ohe = StandardScaler()
x_train3 = sc_3class_ohe.fit_transform(x_train3)
x_test3 = sc_3class.transform(x_test3)

#Model Building 
from sklearn.neighbors import KNeighborsClassifier 
knn_classifier_3class = KNeighborsClassifier(n_neighbors=18)
knn_classifier_ohe = KNeighborsClassifier(n_neighbors=18)

knn_classifier_3class.fit(x3c_train,y3c_train)
knn_classifier_ohe.fit(x_train3,y_train3)

#Accuracy Evaluation
nonencoded_accuracy_=cross_val_score(knn_classifier_3class,x3c_test,y3c_test,cv=10)
onehotencoded_accuracy=cross_val_score(knn_classifier_ohe,x_test3,y_test3,cv=10)

print("NonEncoded Model Accuracy: %0.2f" %(nonencoded_accuracy.mean()),"\n",
"OHEncoded Model Accuracy: %0.2f"%(onehotencoded_accuracy.mean()))

Accuracy score of non-encoded model was 13% higher than the OneHotEncoded model.

NonEncoded Model Accuracy: 0.63 
 OHEncoded Model Accuracy: 0.50

What would be the reason for such a big difference?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

原来分手还会想你 2025-01-24 07:41:30

当您对目标进行 one-hot 编码时,sklearn 会看到多个列并假设您存在多标签问题;也就是说,每一行可以有多个(甚至没有)标签。

在 kNN 中,这可能会导致某些点没有收到标签:在您的情况下,使用 k=18 ,考虑一个点分别具有 8、6、4 个类别 0、1、2 的最近邻居。如果不进行编码,它会获得标签 0。通过编码,我们可以以一对一的方式获得单独的 kNN 模型。 (编码后的第一个标签对于 0 类是 1,对于 1 类或 2 类等是 0)因此第一个模型看到 8、6+4 个示例,并预测“不是 0 类”。类似地,其他两个模型预测为零,并且输出全为零,即没有预测类别。如果您改为使用 cross_val_predict,我希望您会看到这一点。

多标签问题的默认评分也相当严格,但在这种情况下,这并不重要:您的模型将只预测没有或只预测一个类别(呃,除了平局?)。

When you one-hot encode the target, sklearn sees multiple columns and assumes you have a multilabel problem; that is, that each row can have more than one (or even no) label.

In kNN, this likely results in some points receiving no label: with k=18 as in your case, consider a point with 8, 6, 4 nearest neighbors of classes 0,1,2 respectively. Without encoding, it gets label 0. With encoding, we have the separate kNN models in a one-vs-rest fashion. (The first label after encoding is 1 for class 0 and 0 for either of class 1 or 2, etc.) So the first model sees 8, 6+4 examples, and predicts "not class 0". Similarly the other two models predict zero, and the output is all zeros, i.e. no class was predicted. If you do cross_val_predict instead, I expect you'll see this.

The default scoring for multilabel problems is also pretty harsh, but in this case it doesn't matter: your model will only ever predict no or exactly one class (erm, except maybe for ties?).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文