使用K-均值预测标签查看PCA散点图是正确的吗

发布于 2025-02-08 12:31:46 字数 2808 浏览 0 评论 0 原文

对于此处给出的数据:

feat_1  feat_2  Label
4.818919448 -8.88997718 0
2.239877125 -7.142062835    0
2.715454379 -9.392740116    0
1.457970779 -9.295304121    0
3.396769719 -4.696564243    0
-0.251264375    -3.11639814 0
1.553138885 -2.56360423 0
2.556077961 -1.639727669    0
3.264100784 -5.353501855    0
5.54079929  -2.810777111    0
-2.063969924    0.127805678 1
-1.691797179    0.835738844 1
-1.350084344    0.469993022 1
-1.672611658    0.873301506 1
-1.956488821    0.804911876 1
-1.529121941    1.112561558 1
-2.091905556    0.72908025  1
-1.835806179    0.801126086 1
-1.963433251    0.558394092 1
-2.576833733    -0.148751731    1
5.262121279 -0.291153029    2
4.150999653 4.60229228  2
2.538967939 5.642889255 2
9.908816157 2.380103599 2
9.876931469 2.29522071  2
6.691577612 -2.214740473    2
11.75361142 9.650193692 2
4.099660592 5.048216039 2
8.49165607  2.47194124  2
8.243607045 2.831411268 2

其中 x 作为功能(表的第一列)和标签 y 由第三列给出。

我正在使用PCA,然后进行K-均值聚类。

代码

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA

X = df.drop(columns=['Label']).values
y = df['Label'].values
pca = PCA().fit(X)
x_pca = pca.fit_transform(X)

from sklearn.cluster import KMeans

k_means = KMeans(n_clusters=3, random_state=42)
k_means = k_means.fit(x_pca)
kmeans_labels = k_means.predict(x_pca)
kmeans_labels

target_names = ['class_0', 'class_1', 'class_2']
plt.figure(figsize=(8,6))
plot = plt.scatter(x_pca[:,0],x_pca[:,1],c=y,s=20, cmap=plt.cm.jet, linewidths=0, alpha=0.5)
plt.scatter(k_means.cluster_centers_[:,0], k_means.cluster_centers_[:,1], marker="x", color='k', s=40)
plt.legend(handles=plot.legend_elements()[0], labels=list(target_names))
plt.xlabel('feat_1')
plt.ylabel('feat_2')
plt.title('KMeans')
plt.show()

如果我在plt.scatter图中使用 c = y ,我得到了:

”

如果我使用 c = kmeans_labels 在plt. -scatter图中,然后得到:

“在此处输入图像说明”

第二个图很好地分开了类。

这是正确的视图吗?

另外,可以使用此数据分离来训练这样的模型:

X_train, X_test, y_train, y_test = train_test_split(x_pca, kmeans_labels, test_size=0.3, random_state=42)

或者我必须坚持这样的原始标签:

X_train, X_test, y_train, y_test = train_test_split(x_pca, y, test_size=0.3, random_state=42)

where: y = df ['label']。值

感谢您的帮助和时间!

For the data given here:

feat_1  feat_2  Label
4.818919448 -8.88997718 0
2.239877125 -7.142062835    0
2.715454379 -9.392740116    0
1.457970779 -9.295304121    0
3.396769719 -4.696564243    0
-0.251264375    -3.11639814 0
1.553138885 -2.56360423 0
2.556077961 -1.639727669    0
3.264100784 -5.353501855    0
5.54079929  -2.810777111    0
-2.063969924    0.127805678 1
-1.691797179    0.835738844 1
-1.350084344    0.469993022 1
-1.672611658    0.873301506 1
-1.956488821    0.804911876 1
-1.529121941    1.112561558 1
-2.091905556    0.72908025  1
-1.835806179    0.801126086 1
-1.963433251    0.558394092 1
-2.576833733    -0.148751731    1
5.262121279 -0.291153029    2
4.150999653 4.60229228  2
2.538967939 5.642889255 2
9.908816157 2.380103599 2
9.876931469 2.29522071  2
6.691577612 -2.214740473    2
11.75361142 9.650193692 2
4.099660592 5.048216039 2
8.49165607  2.47194124  2
8.243607045 2.831411268 2

Where X is given as the features (first 2 columns of table) and the labels y is given by the third column.

I am using PCA then doing a k-means clustering.

CODE

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA

X = df.drop(columns=['Label']).values
y = df['Label'].values
pca = PCA().fit(X)
x_pca = pca.fit_transform(X)

from sklearn.cluster import KMeans

k_means = KMeans(n_clusters=3, random_state=42)
k_means = k_means.fit(x_pca)
kmeans_labels = k_means.predict(x_pca)
kmeans_labels

target_names = ['class_0', 'class_1', 'class_2']
plt.figure(figsize=(8,6))
plot = plt.scatter(x_pca[:,0],x_pca[:,1],c=y,s=20, cmap=plt.cm.jet, linewidths=0, alpha=0.5)
plt.scatter(k_means.cluster_centers_[:,0], k_means.cluster_centers_[:,1], marker="x", color='k', s=40)
plt.legend(handles=plot.legend_elements()[0], labels=list(target_names))
plt.xlabel('feat_1')
plt.ylabel('feat_2')
plt.title('KMeans')
plt.show()

If I use c=y in the plt.scatter plot, I get this:

enter image description here

If I use c=kmeans_labels in the plt.scatter plot, I then get this:

enter image description here

The second plot separates the classes nicely.

Is this a correct view?

Also, can this data seperation be used to train a model like this:

X_train, X_test, y_train, y_test = train_test_split(x_pca, kmeans_labels, test_size=0.3, random_state=42)

or do I have to stick with the original labels like this:

X_train, X_test, y_train, y_test = train_test_split(x_pca, y, test_size=0.3, random_state=42)

where: y = df['Label'].values?

Thanks for your help and time!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

隐诗 2025-02-15 12:31:46

当您使用 kmeans '标签进行可视化时,您将显示如何群体忽略原始标签。但是您的数据已经被标记,因此聚类没有意义。因此,第一个可视化是正确的,就像您只能使用原始标签来训练任何型号。

基于第一个可视化,您的类似乎是非常交织在一起的,并且简单模型可能无法预测。如果可能的话,我会在使用型号之前建议其他功能工程。但是,对于任何其他建议,我们将需要有关您数据的更多信息。

When you're using kmeans' labels for visualization, you are showing how data is clustered ignoring the original labels. But your data is already labeled so clustering doesn't make sense. So the first visualization is the correct one and in the same way you should only use the original labels for training any models.

Based on the first visualization it seems that your classes are very intertwined and probably it would be impossible for simple models to predict. If it's possible I would recommend additional feature engineering before using models. But for any additional recommendations, we would need more information about your data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文