使用K-均值预测标签查看PCA散点图是正确的吗
对于此处给出的数据:
feat_1 feat_2 Label
4.818919448 -8.88997718 0
2.239877125 -7.142062835 0
2.715454379 -9.392740116 0
1.457970779 -9.295304121 0
3.396769719 -4.696564243 0
-0.251264375 -3.11639814 0
1.553138885 -2.56360423 0
2.556077961 -1.639727669 0
3.264100784 -5.353501855 0
5.54079929 -2.810777111 0
-2.063969924 0.127805678 1
-1.691797179 0.835738844 1
-1.350084344 0.469993022 1
-1.672611658 0.873301506 1
-1.956488821 0.804911876 1
-1.529121941 1.112561558 1
-2.091905556 0.72908025 1
-1.835806179 0.801126086 1
-1.963433251 0.558394092 1
-2.576833733 -0.148751731 1
5.262121279 -0.291153029 2
4.150999653 4.60229228 2
2.538967939 5.642889255 2
9.908816157 2.380103599 2
9.876931469 2.29522071 2
6.691577612 -2.214740473 2
11.75361142 9.650193692 2
4.099660592 5.048216039 2
8.49165607 2.47194124 2
8.243607045 2.831411268 2
其中 x
作为功能(表的第一列)和标签 y
由第三列给出。
我正在使用PCA,然后进行K-均值聚类。
代码
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
X = df.drop(columns=['Label']).values
y = df['Label'].values
pca = PCA().fit(X)
x_pca = pca.fit_transform(X)
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=42)
k_means = k_means.fit(x_pca)
kmeans_labels = k_means.predict(x_pca)
kmeans_labels
target_names = ['class_0', 'class_1', 'class_2']
plt.figure(figsize=(8,6))
plot = plt.scatter(x_pca[:,0],x_pca[:,1],c=y,s=20, cmap=plt.cm.jet, linewidths=0, alpha=0.5)
plt.scatter(k_means.cluster_centers_[:,0], k_means.cluster_centers_[:,1], marker="x", color='k', s=40)
plt.legend(handles=plot.legend_elements()[0], labels=list(target_names))
plt.xlabel('feat_1')
plt.ylabel('feat_2')
plt.title('KMeans')
plt.show()
如果我在plt.scatter图中使用 c = y
,我得到了:
如果我使用 c = kmeans_labels
第二个图很好地分开了类。
这是正确的视图吗?
另外,可以使用此数据分离来训练这样的模型:
X_train, X_test, y_train, y_test = train_test_split(x_pca, kmeans_labels, test_size=0.3, random_state=42)
或者我必须坚持这样的原始标签:
X_train, X_test, y_train, y_test = train_test_split(x_pca, y, test_size=0.3, random_state=42)
where: y = df ['label']。值
?
感谢您的帮助和时间!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当您使用
kmeans
'标签进行可视化时,您将显示如何群体忽略原始标签。但是您的数据已经被标记,因此聚类没有意义。因此,第一个可视化是正确的,就像您只能使用原始标签来训练任何型号。基于第一个可视化,您的类似乎是非常交织在一起的,并且简单模型可能无法预测。如果可能的话,我会在使用型号之前建议其他功能工程。但是,对于任何其他建议,我们将需要有关您数据的更多信息。
When you're using
kmeans
' labels for visualization, you are showing how data is clustered ignoring the original labels. But your data is already labeled so clustering doesn't make sense. So the first visualization is the correct one and in the same way you should only use the original labels for training any models.Based on the first visualization it seems that your classes are very intertwined and probably it would be impossible for simple models to predict. If it's possible I would recommend additional feature engineering before using models. But for any additional recommendations, we would need more information about your data.