当前位置：文江博客话题详情

使用K-均值预测标签查看PCA散点图是正确的吗

发布于 2025-02-08 12:31:46 字数 2808 浏览 0 评论 0 原文

对于此处给出的数据：

feat_1  feat_2  Label
4.818919448 -8.88997718 0
2.239877125 -7.142062835    0
2.715454379 -9.392740116    0
1.457970779 -9.295304121    0
3.396769719 -4.696564243    0
-0.251264375    -3.11639814 0
1.553138885 -2.56360423 0
2.556077961 -1.639727669    0
3.264100784 -5.353501855    0
5.54079929  -2.810777111    0
-2.063969924    0.127805678 1
-1.691797179    0.835738844 1
-1.350084344    0.469993022 1
-1.672611658    0.873301506 1
-1.956488821    0.804911876 1
-1.529121941    1.112561558 1
-2.091905556    0.72908025  1
-1.835806179    0.801126086 1
-1.963433251    0.558394092 1
-2.576833733    -0.148751731    1
5.262121279 -0.291153029    2
4.150999653 4.60229228  2
2.538967939 5.642889255 2
9.908816157 2.380103599 2
9.876931469 2.29522071  2
6.691577612 -2.214740473    2
11.75361142 9.650193692 2
4.099660592 5.048216039 2
8.49165607  2.47194124  2
8.243607045 2.831411268 2

其中 x 作为功能（表的第一列）和标签 y 由第三列给出。

我正在使用PCA，然后进行K-均值聚类。

代码

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA

X = df.drop(columns=['Label']).values
y = df['Label'].values
pca = PCA().fit(X)
x_pca = pca.fit_transform(X)

from sklearn.cluster import KMeans

k_means = KMeans(n_clusters=3, random_state=42)
k_means = k_means.fit(x_pca)
kmeans_labels = k_means.predict(x_pca)
kmeans_labels

target_names = ['class_0', 'class_1', 'class_2']
plt.figure(figsize=(8,6))
plot = plt.scatter(x_pca[:,0],x_pca[:,1],c=y,s=20, cmap=plt.cm.jet, linewidths=0, alpha=0.5)
plt.scatter(k_means.cluster_centers_[:,0], k_means.cluster_centers_[:,1], marker="x", color='k', s=40)
plt.legend(handles=plot.legend_elements()[0], labels=list(target_names))
plt.xlabel('feat_1')
plt.ylabel('feat_2')
plt.title('KMeans')
plt.show()

如果我在plt.scatter图中使用 c = y ，我得到了：

如果我使用 c = kmeans_labels 在plt. -scatter图中，然后得到：

第二个图很好地分开了类。

这是正确的视图吗？

另外，可以使用此数据分离来训练这样的模型：

X_train, X_test, y_train, y_test = train_test_split(x_pca, kmeans_labels, test_size=0.3, random_state=42)

或者我必须坚持这样的原始标签：

X_train, X_test, y_train, y_test = train_test_split(x_pca, y, test_size=0.3, random_state=42)

where： y = df ['label']。值？

感谢您的帮助和时间！

原文

For the data given here:

feat_1  feat_2  Label
4.818919448 -8.88997718 0
2.239877125 -7.142062835    0
2.715454379 -9.392740116    0
1.457970779 -9.295304121    0
3.396769719 -4.696564243    0
-0.251264375    -3.11639814 0
1.553138885 -2.56360423 0
2.556077961 -1.639727669    0
3.264100784 -5.353501855    0
5.54079929  -2.810777111    0
-2.063969924    0.127805678 1
-1.691797179    0.835738844 1
-1.350084344    0.469993022 1
-1.672611658    0.873301506 1
-1.956488821    0.804911876 1
-1.529121941    1.112561558 1
-2.091905556    0.72908025  1
-1.835806179    0.801126086 1
-1.963433251    0.558394092 1
-2.576833733    -0.148751731    1
5.262121279 -0.291153029    2
4.150999653 4.60229228  2
2.538967939 5.642889255 2
9.908816157 2.380103599 2
9.876931469 2.29522071  2
6.691577612 -2.214740473    2
11.75361142 9.650193692 2
4.099660592 5.048216039 2
8.49165607  2.47194124  2
8.243607045 2.831411268 2

Where X is given as the features (first 2 columns of table) and the labels y is given by the third column.

I am using PCA then doing a k-means clustering.

CODE

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA

X = df.drop(columns=['Label']).values
y = df['Label'].values
pca = PCA().fit(X)
x_pca = pca.fit_transform(X)

from sklearn.cluster import KMeans

k_means = KMeans(n_clusters=3, random_state=42)
k_means = k_means.fit(x_pca)
kmeans_labels = k_means.predict(x_pca)
kmeans_labels

target_names = ['class_0', 'class_1', 'class_2']
plt.figure(figsize=(8,6))
plot = plt.scatter(x_pca[:,0],x_pca[:,1],c=y,s=20, cmap=plt.cm.jet, linewidths=0, alpha=0.5)
plt.scatter(k_means.cluster_centers_[:,0], k_means.cluster_centers_[:,1], marker="x", color='k', s=40)
plt.legend(handles=plot.legend_elements()[0], labels=list(target_names))
plt.xlabel('feat_1')
plt.ylabel('feat_2')
plt.title('KMeans')
plt.show()

If I use c=y in the plt.scatter plot, I get this:

If I use c=kmeans_labels in the plt.scatter plot, I then get this:

The second plot separates the classes nicely.

Is this a correct view?

Also, can this data seperation be used to train a model like this:

X_train, X_test, y_train, y_test = train_test_split(x_pca, kmeans_labels, test_size=0.3, random_state=42)

or do I have to stick with the original labels like this:

X_train, X_test, y_train, y_test = train_test_split(x_pca, y, test_size=0.3, random_state=42)

where: y = df['Label'].values?

Thanks for your help and time!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

隐诗 2025-02-15 12:31:46

当您使用 kmeans '标签进行可视化时，您将显示如何群体忽略原始标签。但是您的数据已经被标记，因此聚类没有意义。因此，第一个可视化是正确的，就像您只能使用原始标签来训练任何型号。

基于第一个可视化，您的类似乎是非常交织在一起的，并且简单模型可能无法预测。如果可能的话，我会在使用型号之前建议其他功能工程。但是，对于任何其他建议，我们将需要有关您数据的更多信息。

回复收藏 0 原文

~没有更多了~

关于作者

趁微风不噪

暂无简介

文章

28 人气

关注发私信

夢野间

文章 0 评论 0

关注

百度③文鱼

文章 0 评论 0

关注

小草泠泠

文章 0 评论 0

关注

zhuwenyan

文章 0 评论 0

关注

weirdo

文章 0 评论 0

关注

坚持沉默

文章 0 评论 0

友情链接

文江博客

使用K-均值预测标签查看PCA散点图是正确的吗

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

使用K-均值预测标签查看PCA散点图是正确的吗

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。