预测Sklearn Kmeans的方法,它是如何工作的,它在做什么?

发布于 2025-02-13 00:43:53 字数 963 浏览 1 评论 0原文

我一直在玩Sklearn的K-Means聚集课,对它的预测方法感到困惑。

我已经在IRIS数据集上应用了一个模型:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

pca = PCA(n_components = 2).fit(X_train)

X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

kmeans_pca = KMeans(n_clusters=3).fit(X_train_pca)

并做出了预测:

pred = kmeans_pca.predict(X_test_pca)

print(classification_report(y_test, pred))

          precision    recall  f1-score   support

       0       1.00      1.00      1.00        19
       1       0.76      0.87      0.81        15
       2       0.86      0.75      0.80        16

    accuracy                           0.88        50
   macro avg       0.87      0.87      0.87        50
weighted avg       0.88      0.88      0.88        50

预测似乎是坚定的,因为我尚未在标签中传递给培训集,这使我感到困惑。我已经阅读了这篇文章

我意识到K-均不用于分类任务,但我想知道如何实现这些结果。这样,在Sklearn中执行K-Means聚类时,可以使用什么目的。

I have been playing around with sklearn's k-means clustering class and I am confused about its predict method.

I have applied a model on the iris dataset like so:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

pca = PCA(n_components = 2).fit(X_train)

X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

kmeans_pca = KMeans(n_clusters=3).fit(X_train_pca)

And have made predictions:

pred = kmeans_pca.predict(X_test_pca)

print(classification_report(y_test, pred))

          precision    recall  f1-score   support

       0       1.00      1.00      1.00        19
       1       0.76      0.87      0.81        15
       2       0.86      0.75      0.80        16

    accuracy                           0.88        50
   macro avg       0.87      0.87      0.87        50
weighted avg       0.88      0.88      0.88        50

The predictions seem adecuate, which has confused me as I have not passed in labels to the training set. I have read this post What is the use of predict() method in kmeans implementation of scikit learn? which tells me that the predict method is calling the closest cluster centroid to the test data. However, I don't know how sklearn correctly assigns the IDs during the training stage (i.e. kmeans_pca.labels_ to respective y_train) in the first place as the training stage does not involve labels.

I realise that k-means is not used for classification tasks, but I would like to know how these results were achieved. With this, what purpose could .predict() serve when performing k-means clustering in sklearn?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

心房的律动 2025-02-20 00:43:53

Kmeans聚类是无监督学习的一个例子。这意味着它实际上没有考虑到任何培训标签。

取而代之的是,示例完全来自特征之间的模式 - 将类似的示例分组在一起。在虹膜数据集的情况下,相同花的不同例子往往具有相似的萼片和花瓣的长度和宽度(即花的“特征”)。这意味着仅这些功能就可以放弃如何分组花朵 - 而无需提供明确的标签。

要了解结果是如何实现的,了解算法可能会有所帮助。以下是Kmeans的最常见算法,基于以下步骤:

  1. 初始化K不同的集群质心(可能是随机的,但不一定)
  2. 将每个示例分配给最近的群集(例如,基于特征矢量和集群质心之间的欧几里得距离)
  3. 从步骤2中发现的群集成员的群集质心重复

。 改变)。

上述算法最终将相似的示例分配给相同的簇,因此只关心特征之间的相似性,而不是其标签。

.predict()方法将为您提供任何测试示例(例如“花”,如上所述)最有可能的群集分配。确实,这是通过分配给上面学到的最接近群集质心来完成的。

KMeans clustering is an example of unsupervised learning. This means that, indeed, it does not take into account any labels for training.

Instead, examples are clustered entirely from patterns among the features - similar examples are grouped together. In case of the Iris dataset, different examples of the same flowers would tend to have similar lengths and widths of sepals and petals (i.e. the 'features' of the flower). That means that these features alone are giving away how to group the flowers - without any need of providing explicit labels.

To understand how the results are achieved, it might helpful to understand the algorithm. The following is the most common algorithm of KMeans and is based on the following steps:

  1. Initialize K different cluster centroids (possibly randomly, but not necessarily)
  2. Assign each example to the nearest cluster (e.g. based on Euclidean distance between feature vector and cluster centroids)
  3. Recalculate cluster centroids from cluster members found in step 2.

Steps 2 and 3 are repeated until convergence (i.e. when cluster assignments no longer change).

The above algorithm ultimately assigns similar examples to the same clusters and, hence, only cares about similarities between the features and not their labels.

The .predict() methods will give you the most likely cluster assignment of any test examples (e.g. 'flowers', as above). Indeed, this is done by assigning to the closest cluster centroid as learned above.

影子是时光的心 2025-02-20 00:43:53

KMeans聚类代码将每个数据点分配给您在拟合Kmeans聚类模型时指定的K簇之一。这意味着它可以将群集ID随机分配给不同运行中的数据点,尽管分配给属于同一群集的点的群集ID将保持不变。

例如,在此示例中,请考虑分配给数据的群集ID(标签)为 - [1 1 0 0 0 2 2 2] k = 3,在下一个运行中,它们可能是[0 0 2 2 1 1 1]。请注意,群集ID已更改,即使属于同一集群的点已经分配了相同的群集ID。

在您的情况下,在预测期间,该模型分配了相同的群集ID,尽管自从有3个群集以来可能有6种不同的方式,并且可以是不同分配群集ID的总数6。

这是我对对虹膜数据训练的Kmeans聚类算法进行预测的输出。

print(classification_report(y_test, pred))
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        19
           1       0.00      0.00      0.00        15
           2       0.92      0.75      0.83        16

    accuracy                           0.24        50
   macro avg       0.31      0.25      0.28        50
weighted avg       0.30      0.24      0.26        50

如您所见,只有属于群集ID 2的点被分配了正确的群集,因为它们在训练过程中被学到了,并且对于剩下的两个群集的分类错误,将整体精度低。

The KMeans clustering code assigns each data point to one of the K clusters that you have specified while fitting the KMeans clustering model. This means that it can randomly assign cluster ids to the data points in different runs, although the cluster id assigned to points belonging to the same cluster would remain the same.

E.g., for this example, consider that the cluster ids (labels) assigned to your data were - [1 1 0 0 2 2 2] for K=3, in the next run, they could have been [0 0 2 2 1 1 1]. Note that the cluster ids have changed, even though the points belonging to the same cluster have been assigned the same cluster-id.

In your case, during prediction, the model assigned the same cluster ids, although there could have been 6 different ways this could have gone since there are 3 clusters, and the total number of ways in which different allocation of cluster ids could be would be 6.

This was my output from doing the prediction on the KMeans clustering algorithm trained on the IRIS data.

print(classification_report(y_test, pred))
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        19
           1       0.00      0.00      0.00        15
           2       0.92      0.75      0.83        16

    accuracy                           0.24        50
   macro avg       0.31      0.25      0.28        50
weighted avg       0.30      0.24      0.26        50

As you can see, only the points belonging to cluster-id 2 were assigned the correct cluster as they were learnt during training and it misclassified for the remaining two clusters taking the overall accuracy low.

弱骨蛰伏 2025-02-20 00:43:53

聚类是无监督的学习算法,这意味着它确实需要训练标签。

当您指定kmeans(n_clusters = 3)时,这意味着模型将尝试创建3个簇。

在这种情况下,聚类算法将发现最大化距离间群并最小化ditance内集群中的centroid。

群集是随机归因的,因此,如果您运行相同的算法而不将种子固定在相同4分的情况下,我们可以获得不同的结果示例(run1:[0,0,1,1,2],Run2:run2: [1,1,0,2],run3:[2,2,0,1] ...)。

因此,一旦训练了模型,我们就可以预测(即使术语预测不足),它包括给每条线的最接近质心的标签。

Clustering is an Unsupervised learning algorithme wich mean that it does need labels to train.

when you specify KMeans(n_clusters=3) that means the models will try to create 3 clusters.

The clustering algorithme will find in this case 3 centroid that will maximise the distance intercluster and minimise the ditance intracluster.

The clusters are attributed randomly so if you run the same algorithme without fixing the seed on the the same 4 points we can get different results example ( Run1 : [0,0,1,2] , Run2 : [1,1,0,2], Run3 : [2,2,0,1] ... ).

So once the model is trained we can predict (even the terms prediction is not adequate), which consists of giving for each line the label of the closest centroid.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文