K均值算法

发布于 2024-11-15 09:16:00 字数 406 浏览 4 评论 0原文

可能的重复:
如何在 K - 均值算法中优化 K
使用 k-means 聚类时如何确定 k?

根据统计指标,我们可以决定 K。如标准差、均值、方差等, 或者

有没有简单的方法来选择K-means算法中的K?

提前致谢 纳文

Possible Duplicates:
How to optimal K in K - Means Algorithm
How do I determine k when using k-means clustering?

Depending on the statistical measures can we decide on the K. Like Standard Deviation, Mean, Variance etc.,
Or

Is there any simple method to choose the K in K-means Algorithm?

Thanks in advance
Navin

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

沧笙踏歌 2024-11-22 09:16:00

如果您明确想要使用 k-means,您可以研究描述 x-表示。当使用 x-means 的实现时,与 k-means 相比的唯一区别是,您不是指定单个 k,而是指定 k 的范围。 “最好”的选择,wrt。该范围内的某些度量将成为 x 均值输出的一部分。您还可以查看
Mean Shift 聚类算法。

如果您的给定数据在计算上是可行的(可能按照 yura 建议使用采样),您可以使用各种 k 进行聚类,并使用一些标准聚类有效性度量来评估结果聚类的质量。这里描述了一些经典的措施:措施

@道格
k-means++ 在聚类分配开始之前确定聚类数量的最佳 k 是不正确的。 k-means++ 与 k-means 的不同之处仅在于它不是随机选择初始 k 个质心,而是随机选择一个初始质心并连续选择中心,直到选择了 k。在初始完全随机选择之后,数据点被选择作为新的质心,其概率由势函数确定,该势函数取决于数据点到已选择的中心的距离。 k-means++ 的标准参考是k-means++:仔细播种的优点作者:亚瑟和瓦西尔维茨基。

另外,我认为一般选择 k 作为主成分的数量不会改善您的聚类。想象一下三维空间中的数据点都位于穿过原点的平面上。然后,您将获得 2 个主成分,但点的“自然”聚类可以具有任意数量的聚类。

If you explicitly want to use k-means you could study the article describing x-means. When using an implementation of x-means the only difference compared to k-means, is that rather than specifying a single k, you specify a range for k. The "best" choice, wrt. some measure, in the range will be part of the output from x-means. You can also look into the
Mean Shift clustering algorithm.

If it is computationally feasible with your given data (possibly using sampling as yura suggests), you could do clustering with various k's and evalute the quality of the resulting clusters using some of the standard cluster validity measures. Some of the classic measures are described here: measures.

@doug
It is not correct that k-means++ determines an optimal k for the number of clusters before cluster assignments start. k-means++ differs from k-means only by instead of randomly choosing the initial k centroids, it chooses one initial centroid randomly and successively chooses centers until k has been chosen. After the initial completely random choice, data points are chosen as a new centroid with a probability that is determined by a potential function which depends on the datapoint's distance to the already chosen centers. The standard reference for k-means++ is k-means++: The Advantages of Careful Seeding by Arthur and Vassilvitskii.

Also, I don't think that in general choosing k to be the number of principal components will improve your clustering. Imagine data points in three-dimensional space all lying in a plane passing through the origo. You will then get 2 principal components, but the "natural" clustering of the points could have any number of clusters.

清晰传感 2024-11-22 09:16:00

不幸的是没有。没有一种原则性的统计方法(无论简单还是复杂)可以设置“正确的 K”。有些启发法和经验法则有时有效,有时无效。

这种情况更为普遍,因为许多聚类方法都具有这些类型的参数。

Unfortunately not. There isn't a principled statistical method, simple or complex that can set the "right K". There are heuristics, rules of thumb that sometimes work, sometimes don't.

The situation is more general as many clustering methods have these type of parameters.

孤者何惧 2024-11-22 09:16:00

那么智能选择问题有两种实用的解决方案
常用的质心数 (k)。

第一个是PCA你的数据,以及PCA的输出——这是
主成分(特征向量)及其对变化的累积贡献
在数据中观察到的结果——显然表明了最佳的质心数量。
(例如,如果数据中 95% 的变异性是由前三个主要因素解释的
分量,那么 k=3 是 k-means 的明智选择。)

智能估计 k 的第二种常用实用解决方案是
是 k-means 算法的修订版实现,称为 k-means++。本质上,
k-means++ 与原始 k-means 的不同之处只是增加了预处理
步。在此步骤中,估计质心的数量和初始位置。

k-means++ 执行此操作所依赖的算法很容易理解并可以在代码中实现。两者的一个很好的来源是 2007 发布LingPipe 博客,它提供了一个优秀的
k-means++ 的解释以及对原始论文的引用
首先介绍了这项技术。

除了提供 k 的最佳选择之外,k-means++ 显然优于
两种性能中的原始 k 均值(相比之下,处理时间大约为 1/2)
在一项已发表的比较中使用 k 均值)和准确性(三个数量级
同一比较研究中误差的改善)。

Well there are two practical solutions to the the problem of intelligent selection
of the number of centroids (k) in common use.

The first is to PCA your data, and the output from PCA--which is the
principal components (eigenvectors) and their cumulate contribution to the variation
observed in the data--obviously suggests an optimal number of centroids.
(E.g., if 95% of the variability in your data is explained by the first three principal
components, then k=3 is a wise choice for k-means.)

The second commonly used practical solution to intelligently estimate k is
is a revised implementation of the k-means algorithm, called k-means++. In essence,
k-means++ just differs from the original k-means by the additional of a pre-processing
step. During this step, the number and initial position of the centroids and estimated.

The algorithm that k-means++ relies on to do this is straightforward to understand and to implement in code. A good source for both is a 2007 Post in the LingPipe Blog, which offers an excellent
explanation of k-means++ as well as includes a citation to the original paper that
first introduced this technique.

Aside from providing an optimal choice for k, k-means++ is apparently superior to
the original k-means in both performance (roughly 1/2 processing time compared
with k-means in one published comparison) and accuracy (three orders of magnitude
improvement in error in the same comparison study).

讽刺将军 2024-11-22 09:16:00

当您不这样做时,贝叶斯 k-means 可能是一个解决方案不知道簇的数量。网站上给出了相关论文,并给出了相应的MATLAB代码。

Bayesian k-means may be a solution when you don't know the number of clusters. There's a related paper given in the website and the corresponding MATLAB code is also given.

娇女薄笑 2024-11-22 09:16:00

未知(通过统计参数模型等)机器学习问题的最佳解决方案是对数据进行采样并找到最适合子问题的参数,然后将它们用于整个问题。在这种情况下,为 5% 的数据选择最佳 K。

The best solution for unkown(by statistical paramters model etc) ML problem is to sample data and find parameters thet best for sub problem, then use them on full problem. In that case select best K for 5% of data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文