预测Sklearn Kmeans的方法，它是如何工作的，它在做什么？

发布于 2025-02-13 00:43:53 字数 963 浏览 1 评论 0原文

我一直在玩Sklearn的K-Means聚集课，对它的预测方法感到困惑。

我已经在IRIS数据集上应用了一个模型：

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

pca = PCA(n_components = 2).fit(X_train)

X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

kmeans_pca = KMeans(n_clusters=3).fit(X_train_pca)

并做出了预测：

pred = kmeans_pca.predict(X_test_pca)

print(classification_report(y_test, pred))

          precision    recall  f1-score   support

       0       1.00      1.00      1.00        19
       1       0.76      0.87      0.81        15
       2       0.86      0.75      0.80        16

    accuracy                           0.88        50
   macro avg       0.87      0.87      0.87        50
weighted avg       0.88      0.88      0.88        50

预测似乎是坚定的，因为我尚未在标签中传递给培训集，这使我感到困惑。我已经阅读了这篇文章

我意识到K-均不用于分类任务，但我想知道如何实现这些结果。这样，在Sklearn中执行K-Means聚类时，可以使用什么目的。

原文

I have been playing around with sklearn's k-means clustering class and I am confused about its predict method.

I have applied a model on the iris dataset like so:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

pca = PCA(n_components = 2).fit(X_train)

X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

kmeans_pca = KMeans(n_clusters=3).fit(X_train_pca)

And have made predictions:

pred = kmeans_pca.predict(X_test_pca)

print(classification_report(y_test, pred))

          precision    recall  f1-score   support

       0       1.00      1.00      1.00        19
       1       0.76      0.87      0.81        15
       2       0.86      0.75      0.80        16

    accuracy                           0.88        50
   macro avg       0.87      0.87      0.87        50
weighted avg       0.88      0.88      0.88        50

The predictions seem adecuate, which has confused me as I have not passed in labels to the training set. I have read this post What is the use of predict() method in kmeans implementation of scikit learn? which tells me that the predict method is calling the closest cluster centroid to the test data. However, I don't know how sklearn correctly assigns the IDs during the training stage (i.e. kmeans_pca.labels_ to respective y_train) in the first place as the training stage does not involve labels.

I realise that k-means is not used for classification tasks, but I would like to know how these results were achieved. With this, what purpose could .predict() serve when performing k-means clustering in sklearn?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心房的律动 2025-02-20 00:43:53

Kmeans聚类是无监督学习的一个例子。这意味着它实际上没有考虑到任何培训标签。

取而代之的是，示例完全来自特征之间的模式 - 将类似的示例分组在一起。在虹膜数据集的情况下，相同花的不同例子往往具有相似的萼片和花瓣的长度和宽度（即花的“特征”）。这意味着仅这些功能就可以放弃如何分组花朵 - 而无需提供明确的标签。

要了解结果是如何实现的，了解算法可能会有所帮助。以下是Kmeans的最常见算法，基于以下步骤：

初始化K不同的集群质心（可能是随机的，但不一定）
将每个示例分配给最近的群集（例如，基于特征矢量和集群质心之间的欧几里得距离）
从步骤2中发现的群集成员的群集质心重复

。改变）。

上述算法最终将相似的示例分配给相同的簇，因此只关心特征之间的相似性，而不是其标签。

.predict（）方法将为您提供任何测试示例（例如“花”，如上所述）最有可能的群集分配。确实，这是通过分配给上面学到的最接近群集质心来完成的。

回复收藏 0 原文

影子是时光的心 2025-02-20 00:43:53

KMeans聚类代码将每个数据点分配给您在拟合Kmeans聚类模型时指定的K簇之一。这意味着它可以将群集ID随机分配给不同运行中的数据点，尽管分配给属于同一群集的点的群集ID将保持不变。

例如，在此示例中，请考虑分配给数据的群集ID（标签）为 - [1 1 0 0 0 2 2 2] k = 3，在下一个运行中，它们可能是[0 0 2 2 1 1 1]。请注意，群集ID已更改，即使属于同一集群的点已经分配了相同的群集ID。

在您的情况下，在预测期间，该模型分配了相同的群集ID，尽管自从有3个群集以来可能有6种不同的方式，并且可以是不同分配群集ID的总数6。

这是我对对虹膜数据训练的Kmeans聚类算法进行预测的输出。

print(classification_report(y_test, pred))
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        19
           1       0.00      0.00      0.00        15
           2       0.92      0.75      0.83        16

    accuracy                           0.24        50
   macro avg       0.31      0.25      0.28        50
weighted avg       0.30      0.24      0.26        50

如您所见，只有属于群集ID 2的点被分配了正确的群集，因为它们在训练过程中被学到了，并且对于剩下的两个群集的分类错误，将整体精度低。

The KMeans clustering code assigns each data point to one of the K clusters that you have specified while fitting the KMeans clustering model. This means that it can randomly assign cluster ids to the data points in different runs, although the cluster id assigned to points belonging to the same cluster would remain the same.

E.g., for this example, consider that the cluster ids (labels) assigned to your data were - [1 1 0 0 2 2 2] for K=3, in the next run, they could have been [0 0 2 2 1 1 1]. Note that the cluster ids have changed, even though the points belonging to the same cluster have been assigned the same cluster-id.

In your case, during prediction, the model assigned the same cluster ids, although there could have been 6 different ways this could have gone since there are 3 clusters, and the total number of ways in which different allocation of cluster ids could be would be 6.

This was my output from doing the prediction on the KMeans clustering algorithm trained on the IRIS data.

print(classification_report(y_test, pred))
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        19
           1       0.00      0.00      0.00        15
           2       0.92      0.75      0.83        16

    accuracy                           0.24        50
   macro avg       0.31      0.25      0.28        50
weighted avg       0.30      0.24      0.26        50

As you can see, only the points belonging to cluster-id 2 were assigned the correct cluster as they were learnt during training and it misclassified for the remaining two clusters taking the overall accuracy low.

回复收藏 0 原文